Random
Tao Shi, Steve Horvath
(http://www.ph.ucla.edu/biostat/people/horvath.htm)
Department of
Here we provide R code and data underlying the following article:
Shi T, Seligson D, Belldegrun AS, Palotie A, Horvath S. (2005) Tumor classification by tissue microarray profiling: random forest clustering applied to renal cell carcinoma. Mod Pathol. 2005 Apr;18(4):547-57
PDF file
ABSTRACT
We describe a novel strategy
(random forest clustering) for tumor profiling based on tissue microarray data. Random forest clustering is attractive for
tissue microarray and other immunohistochemistry
data since it handles highly skewed tumor marker expressions well and weighs
the contribution of each marker according to its relatedness with other tumor
markers. Since the procedure is unsupervised, no clinicopathological
data or traditional classifications are used a priori. To facilitate
unsupervised learning, an intrinsic dissimilarity measure between the patients
was constructed with a random-forest analysis of the tumor markers. A technical
report that describes
The RF clustering algorithm is shown recently to be particularly suitable for Tissue Microarray (TMA) data for the following reasons. First, the clustering results do not change when one or more covariates are monotonically transformed since the dissimilarity only depends on the feature ranks, obviating the need for symmetrizing skewed covariate distributions. Second, the RF dissimilarity weighs the contributions of each covariate on the dissimilarity in a natural way: the more related the covariate is to other covariates the more it will affect the definition of the RF dissimilarity. Third, the RF dissimilarity does not require the user to specify threshold values for dichotomizing tumor expressions. External threshold values for dichotomizing expressions in unsupervised analyses may reduce the information content or even bias the results. We also compared the random forest clustering approach to the standard Euclidean distance based approach. Although there is good overlap between the two algorithms, we find that the random forest clustering method works better for these data (see the supplement information for Shi et al. 2004). To visualize the tumor samples, we used classical multidimensional scaling, which takes as input the random forest dissimilarity between the samples and returns a set of points in a 2 dimensional space such that the distances between the points are approximately equivalent to the original distances.
Below we list an R tutorial and a sample data set involving 307 tumor samples and 8 tumor markers. The data were generated by David Seligson from the UCLA tissue array core (http://www.genetics.ucla.edu/tissuearray/).
R SOFTWARE
TUTORIAL: RFclustering applied to Renal Cancer
DEMO CODE
1) To install the R software, go to http://www.R-project.org
2) After installing R, you need to install two additional R packages: randomForest and Hmisc
Open R and go to menu "Packages\Install package(s) from CRAN", then choose randomForest. R will automatically install the package. When asked "Delete downloaded files (y/N)? ", answer "y". Do the same thing for Hmisc
3) Download the zip file containing:
a) R function file: "FunctionsRFclustering.txt", which contains several R functions needed for RF clustering and results assessment
b) A test data file: "testData.csv"
c) MDS coordinate file: "cmd1.csv"
d) The tutorial file: "RFclusteringTutorial.txt"
4) Unzip all the files into the same directory, for example, it is "C:\temp\RFclustering"
5) Open the R software by double clicking its icon.
6) Open the tutorial file "RFclusteringTutorial.txt" in a text editor, e.g. Notepad or Microsoft Word
7) Copy and paste the R commands from the tutorial into the R session. Comments are preceded by "#" and are automatically ignored by R.
REFERENCES
The following article describes theoretical studies of RF clustering.
Tao Shi and Steve Horvath (2006) Unsupervised Learning with Random Forest Predictors. Journal of Computational and Graphical Statistics. Volume 15, Number 1, March 2006, pp. 118-138(21)
General intro to random forest
Breiman L. Random forests. Machine Learning 2001;45(1):5-32.
L. Breiman and Adele Cutler’s random forests: http://stat-www.berkeley.edu/users/breiman/RandomForests/
The following reference describes the R implementation of random forests
Liaw A. and Wiener M. Classification and Regression by randomForest. R News, 2(3):18-22, December 2002.
2007-02-27
Please send your suggestions and
comments to: shorvath@mednet.ucla.edu