In silico Phenotyping via Co-training
Summary
This work provides the proof-of-principle that co-training can successfully be used to augment training datasets for improved phenotype prediction from genotype.
Code
A beta version of the co-training pipeline can be accessed Download here (GZ, 13 KB). This code will soon be uploaded as a new project in sourceforge.net
Sample results
For a partition of the dataset into: set I = 10%, set II = 70% and set III = 20%, the results generated by our co-training pipeline can be found Download here (GZ, 137.9 MB). Some additional details about the subdirectory structure of the results are:
- cv_set : this directory contains the 100 random permutations of the data into sets I, II and III. In each random fold, patients are assigned a value of {1, 2, 3} to indicate in which set they are placed
- pheno_imp: contains the imputed labels in set II for all random folds. The labels are soft (not binary) becaused the Bagged predictor returns a probability of sample beloging to class 1 (migraine with aura)
- random_forest: contains the final results of the hg classifier when applied to set III. The file roc_auc.csv has the the AUC scores for all the 100 random folds. Additionally, the files mean_tpr.csv and mean_fpr.csv contain the averaged values used to plot the ROC curves.
Publication
In silico phenotyping via co-training for improved phenotype prediction from genotype
Damian Roqueiro, Menno Witteveen, Verneri Anttila, Gisela Terwindt, Arn van den Maagdenberg and Karsten Borgwardt
ISMB 2015 and Bioinformatics 2015, 31 (12): i303-i310
external page Online | ETH Research Collection | Project page with code