High-dimensional statistical approaches for heterogeneous molecular data in cancer medicine

Frank Dondelinger, MRC Biostatistics Unit

Nicolas Stadler, Frank Dondelinger, Sach Mukherjee

Abstract

Molecular interplay plays a central role in basic and disease biology. Patterns of interplay are thought to dier between biological contexts, such as cell type, tissue type, or disease state. Many high-throughput studies now span multiple such contexts and the data may therefore be heterogeneous with respect to patterns of interplay. This motivates a need for statistical approaches that can cope with molecular data that are heterogeneous in a multivariate sense.

In this work, we exploit recent advances in high-dimensional statistics (Stadler and Mukherjee, 2013b,a) to put forward tools for analysing heterogeneous molecular data in cancer medicine. We model the data using Gaussian graphical models (Rue and Held, 2005), and develop two useful techniques based on estimation of partial correlations using the graphical lasso (Friedman et al., 2008): a two-sample test that captures dierences in molecular interplay or networks, and a mixture model clustering approach that simultaneously learns cluster assignments and multivariate network models that are cluster-specic.

We demonstrate the characteristics of our methods using an in-depth simulation study, and proceed to apply them to proteomic data from The Cancer Genome Atlas (TCGA) \pan-cancer" study (Akbani et al., 2014). Our analysis of the TCGA data provides formal statistical evidence that protein networks differ signicantly by cancer type. Furthermore, we show how multivariate models can be used to rene cancer subtypes and learn associated networks.

Our results demonstrate the challenges involved in truly multivariate analysis of heterogeneous molecular data and the substantive gains that high-dimensional methods can oer in this setting.

References

Akbani, R., Ng, P. K. S., Werner, H. M., Shahmoradgoli, M., Zhang, F., Ju, Z., Liu, W., Yang, J.-Y., Yoshihara, K., Li, J., et al. (2014). A pan-cancer proteomic perspective on The Cancer Genome Atlas. Nature Communications, 5(3887).

Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical Lasso. Biostatistics, 9(3):432{441. Rue, H. and Held, L. (2005). Gaussian Markov random elds: theory and applications. CRC Press, London.

Stadler, N. and Mukherjee, S. (2013a). Penalized estimation in high-dimensional hidden Markov models with state-specic graphical models. Annals of Applied Statistics, 7:2157{2179.

Stadler, N. and Mukherjee, S. (2013b). Two-sample testing in high-dimensional models. arXiv.org:1210.4584. 2