Department
of Human Genetics and Department of Biostatistics
University of California, Los
Angeles, CA90095
ABSTRACT
A
random forest (RF) predictor (Breiman 2001) is an
ensemble of individual tree predictors. As part of their construction, RF
predictors naturally lead to a dissimilarity measure between the observations.
One can also define an RF dissimilarity measure between unlabelled data: the
idea is to construct an RF predictor that distinguishes the `observed' data
from suitably generated synthetic data (Breiman
2003). The observed data are the original unlabelled data while the synthetic
data are drawn from a reference distribution. Recently, RF dissimilarities have
been used successfully in several unsupervised learning tasks involving genomic
data. Unlike standard dissimilarities, the relationship between the RF
dissimilarity and the variables can be difficult to disentangle. Here we
describe the properties of the RF dissimilarity and make recommendations on how
to use it in practice. An RF
dissimilarity can be attractive because it handles mixed variable types well,
is invariant to monotonic transformations of the input variables, is robust to
outlying observations, and accommodates several strategies for dealing with
missing data. The RF dissimilarity easily deals with large number of variables
due to its intrinsic variable selection, e.g. the Addcl1 RF dissimilarity weighs
the contribution of each variable on the dissimilarity according to how
dependent it is on other variables.
We find that the RF
dissimilarity is useful for detecting tumor sample clusters on the basis of
tumor marker expressions. In this application, biologically meaningful clusters
can often be described with simple thresholding
rules.
KEY WORDS: random forest clustering,
biomarkers, ensemble predictors, random forest distance, random forest
dissimilarity, tree predictor clustering
TECHNICAL REPORT
A
technical report for random forest clustering can be found here
To
cite the technical report, please use:
Tao
Shi and Steve Horvath (2006) Unsupervised Learning with Random Forest
Predictors. Journal of Computational and Graphical Statistics. Volume 15,
Number 1, March 2006, pp. 118-138(21)