In silico Phenotyping via Co-training

Enlarged view: Cover of the Bioinformatics journal

Damian Roqueiro, Menno Witteveen, Verneri Anttila, Gisela Terwindt, Arn van den Maagdenberg, Karsten Borgwardt

In silico phenotyping via co-training for improved phenotype prediction from genotype

Summary

This work provides the proof-of-principle that co-training can successfully be used to augment training datasets for improved phenotype prediction from genotype.

Code

A beta version of the co-training pipeline can be accessed Download here (GZ, 13 KB). This code will soon be uploaded as a new project in sourceforge.net

Sample results

For a partition of the dataset into: set I = 10%, set II = 70% and set III = 20%, the results generated by our co-training pipeline can be found Download here (GZ, 137.9 MB). Some additional details about the subdirectory structure of the results are:

cv_set : this directory contains the 100 random permutations of the data into sets I, II and III. In each random fold, patients are assigned a value of {1, 2, 3} to indicate in which set they are placed
pheno_imp: contains the imputed labels in set II for all random folds. The labels are soft (not binary) becaused the Bagged predictor returns a probability of sample beloging to class 1 (migraine with aura)
random_forest: contains the final results of the h_g classifier when applied to set III. The file roc_auc.csv has the the AUC scores for all the 100 random folds. Additionally, the files mean_tpr.csv and mean_fpr.csv contain the averaged values used to plot the ROC curves.

Publication

In silico phenotyping via co-training for improved phenotype prediction from genotype

Damian Roqueiro, Menno Witteveen, Verneri Anttila, Gisela Terwindt, Arn van den Maagdenberg and Karsten Borgwardt
ISMB 2015 and Bioinformatics 2015, 31 (12): i303-i310
external page Online | ETH Research Collection | Project page with code

Contact damian.roqueiro@bsse.ethz.ch for questions regarding usage of the pipeline or to report bugs.