Despite the findings of numerous cancer studies presenting a large pool of signatures for diagnosis and prognosis, the lack of generalization prohibits the translation of these into clinical practice.
With the aid of IBM's World Community Grid, the Mapping Cancer Markers (MCM) project aims to systematically survey the landscape of useful cancer gene signatures for multiple cancers. Based on this data, we characterized the distribution of the high-performing signatures in terms of the frequency of individual genes and network patterns, and identify generalized motif families that give deeper insights to the molecular background of cancers, and give rise to more reliable signatures for cancer detection and therapy.
State-of-the-art unsupervised learning technologies partition the gene features into clusters of high connectivity. Subsequently, established frequent-itemset mining algorithms are used to identify co-occurring terms among these patterns. The most frequent motif families have been further evaluated with frequentist and Bayesian methods in combination with performance measures such as MCC and AUC. The discovered cluster and motif families summarize genes of similar functionality, localization as well as interaction and pathway networks.