PaRFR - Parallel Random Forest Regression for Hadoop
Random Forest (RF) is amongst the best performing machine learning algorithms for classification tasks and has been successfully applied to the identification of genome-wide associations in case-control studies. RF can also be applied to population association studies with multivariate quantitative traits, whereby the classification task is replaced by a regression task. For instance, high- dimensional traits arise naturally in recent neuroimaging genetics studies, in which the phenotypic variability in the human brain is measured by means of 3D neuroimaging data. We have developed a parallel version of RF for regression tasks with both univariate and multivariate responses, called PaRFR (Parallel Random Forest Regression), to support multivariate quantitative trait loci mapping in unrelated subjects. PaRFR takes advantage of the MapReduce programming model and is deployed on Hadoop. Notable speed-ups have been obtained by introducing a distance-based criterion for node splitting.
Wang Y., Wong L. and Montana G. (2012) PaRFR: Parallel Random Forest Regression on Hadoop
for Multivariate Quantitative Trait Loci Mapping. Preprint