Classification of large DNA methylation datasets for identifying cancer drivers
Classification of large DNA methylation datasets for identifying cancer drivers
Fabrizio Celli a , Fabio Cumbo a,b , and Emanuel Weitschek c,a a Institute of Systems Analysis and Computer Science, National Research Council, Via dei Taurini 19, 00185 Rome, Italy b Department of Engineering, Roma Tre University, Via della Vasca Navale 79, 00154 Rome, Italy c Department of Engineering, Uninettuno International University. Corso Vittorio Emanuele II, 39 00186 Rome, Italy
Abstract
DNA methylation is a well-studied genetic modification crucial to regulate the functioning of the genome. Its alterations play an important role in tumorigenesis and tumor-suppression. Thus, studying DNA methylation data may help biomarker discovery in cancer. Since public data on DNA methylation become abundant – and considering the high number of methylated sites (features) present in the genome – it is important to have a method for efficiently processing such large datasets. Relying on big data technologies, we propose BIGBIOCL an algorithm that can apply supervised classification methods to datasets with hundreds of thousands of features. It is designed for the extraction of alternative and equivalent classification models through iterative deletion of selected features. We run experiments on DNA methylation datasets extracted from The Cancer Genome Atlas, focusing on three tumor types: breast, kidney, and thyroid carcinomas. We perform classifications extracting several methylated sites and their associated genes with accurate performance (accuracy>97%). Results suggest that BIGBIOCL can perform hundreds of classification iterations on hundreds of thousands of features in few hours. Moreover, we compare the performance of our method with other state-of-the-art classifiers and with a wide-spread DNA methylation analysis method based on network analysis. Finally, we are able to efficiently compute multiple alternative classification models and extract - from DNA-methylation large datasets - a set of candidate genes to be further investigated to determine their active role in cancer. BIGBIOCL, results of experiments, and a guide to carry on new experiments are freely available on GitHub at https://github.com/fcproj/BIGBIOCL.
Keywords: classification; machine learning; DNA methylation; cancer; disease diagnostic predictive models; algorithms and techniques to speed up the analysis of big medical data.
Introduction ) to cytosine residues in the sequence. In normal cells, this conversion results in different interaction properties assuring the proper regulation of gene expression and of gene silencing (Baylin et al., 2001). In the haploid human genome there around 28 million of CpG sites in methylated or unmethylated state (Stevens et al., 2013). It is well-known that inactivation of tumor-suppressor genes may occur as a consequence of hyper-methylation within the gene regions and a large range of cancer-related genes can be silenced by DNA methylation in different types of tumors. Moreover, a global hypo-methylation, which induces genomic instability, also contributes to cell transformation (Kulis et al. 2010). Thus, methylation corresponds to inactivity, but inactivity of a repressive factor means stimulation. This means that studying DNA methylation data to identify drivers in cancer is challenging. Another challenge is given by the reduction of the cost of data generation that, especially after the employment of Next-Generation Sequencing technologies (Weitschek et al., 2014), has made available an enormous amount of raw data. The availability of big datasets creates problems with the application of classical algorithms for data mining and analysis (Greene et al., 2014). In this work, we focus on the adoption of big data technologies for the application of classification algorithms on large DNA methylation datasets. Even if there are many different definitions of big data, “Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze” (McKinsey Global Institute, 2012). This definition does not focus on specific data size, but on the technology we adopt to manage those datasets. We want to extract a set of genes that may play a role in a specific tumor by applying supervised learning methods to DNA methylation datasets with a large number of features (450 thousand CpG positions). We aim to compute many classification models containing genes by applying optimized supervised learning algorithms, like Decision Trees (Quinlan, 1993) and Random Forests (Breiman, 2001; Svetnik et al., 2003). We rely on Apache Spark MLlib (Meng et al., 2016), running in standalone or cluster mode, in order to cope with performance. In fact, the largeness of the input dataset does not allow to analyze and process it in an acceptable time with non-big data technologies. A previous classification study on DNA methylation (Danielsson et al., 2015) proposed MethPed, a tool for the identification of pediatric tumors. Researchers built the classification model behind MethPed from DNA methylation datasets with 450 thousand of features. They firstly applied a large number of regression algorithms to select a subset of features with the highest predictive power; then, they adopted Random Forests to build the classification model. On the contrary, we want to apply classification algorithms to the entire dataset in order to obtain a large number of CpG sites and their associated genomic locations. Another study (Akalin et al., 2012) described methylKit, an R package for the analysis of DNA methylation data. This package adopts an unsupervised machine learning approach, working on unlabeled data. methylKit works in-memory and, even if it is multi-threaded, its execution is limited to a single machine. On our side, we want to perform supervised machine learning on a cluster of computational nodes, in order to be able to scale with the increasing dimension of input data. The algorithm proposed in our study is inspired by CAMUR for being applied to large input datasets. CAMUR (Classifier with alternative and multiple rule-based models) is a classification method that iteratively computes a rule-based classification model, eliminates from the input dataset combinations of extracted features, and repeats the classification until a stopping condition is verified (Cestarelli et al., 2016). The result of a CAMUR computation is a set of classification models. CAMUR worked on RNA sequencing cancer datasets with around 20 thousand features. In this work, we design and develop BIGBIOCL, a multiple tree-based classifier, to analyze DNA methylation datasets with more than 450 thousand features (Pidsley R. et al., 2013). Our goal is to extract candidate methylated sites and their related genes in few hours. Methods
In our experiments on the application of big data technologies to the classification of large DNA methylation datasets, we consider three types of cancer: the Breast Invasive Carcinoma (BRCA), the Thyroid Carcinoma (THCA), and the Kidney Renal Papillary Cell Carcinoma (KIRP). We develop BIGBIOCL in order to run an iterative classification algorithm in big data environments, to achieve efficient supervised learning, and to extract multiple classification models. Then, we test our algorithm both in a single-machine and in a Hadoop YARN cluster.
Datasets
The Cancer Genome Atlas (TCGA) is a project started in 2005 and maintained by the National Cancer Institute and National Human Genome Research Institute (Weinstein et al., 2013). The TCGA is a 2.5 petabytes public dataset widely used in scientific research. Searching “The Cancer Genome Atlas” on PubMed reveals more than 2,500 articles in the last 5 years. The TCGA dataset contains the genomic characterization of over 30 types of human cancer (Tomczak et al ., 2015) from more than 11,000 patients. The dataset includes cancer genome profiles obtained from several NGS methods applied to patient tissues, like RNA sequencing, Array-based DNA methylation sequencing, microRNA sequencing, and many others (Hayden 2014, Weitschek et al., 2014, Shendure, J. et. al 2008) . In our work, we focus on DNA methylation data. In particular, we consider profiles obtained using the Illumina Infinium Human DNA Methylation 450 platform (HumanMethylation450), which provides quantitative methylation measurement at CpG site level (Sandoval et al., 2011). HumanMethylation450 allows assessing the methylation status of more than 450 thousand CpG sites (Dedeurwaerder et al., 2014), producing large datasets to be analyzed and interpreted. Even if HumanMethylation450 datasets can be useful for large-scale DNA methylation profiling, they raise problems of efficient data processing.
Consequently, we decide to explore the adoption of big data technologies and infrastructures to enable the possibility of efficiently applying machine learning algorithms to such large datasets. We rely on the latest TCGA data release available at The Genomic Data Commons data sharing platform (https://gdc.nci.nih.gov/). In our experiments, we use the beta value as an estimate of DNA methylation level. Beta value (Du P. et al., 2010) is defined as the ratio of the methylated allele intensity and the overall intensity (i.e. the sum of methylated and unmethylated allele intensities): β n = max(Meth n ,0)max(Meth n ,0)+max(Unmeth n ,0)+ε (1) where 𝑀𝑒𝑡ℎ 𝑛 is the n th methylated allele intensity, 𝑈𝑛𝑚𝑒𝑡ℎ 𝑛 is the n th unmethylated allele intensity, and 𝜀 is a constant offset used to regulate the beta value where both intensities are low. It is worth noting that beta value is a continue variable in the range [0, 1], where 0 means no methylation and 1 full methylation. Table 1.
Datasets used in this study Dataset Number of Samples Number of Features BRCA 897 485,512 THCA 571 485,512 KIRP 321 485,512 We focus on three DNA methylation datasets extracted from TCGA: BRCA, THCA, and KIRP (Table 1). For each dataset, we filter the input data matrix to cope with missing values and to exclude control cases (this is important to reduce the classification task to binary classification, having only tumoral and normal cases). The final data matrix (Table 2) has the following structure: Rows represent samples, i.e. the profile of a patient tissue. The first row is the header, so it contains column names. The first column contains ID of samples. The last column is the category, specifying if the sample is “tumoral” or “normal”. All other columns represent CpG sites, and the corresponding cells contain the beta value for the CpG site. We use the Illumina 450k manifest to know where a CpG site is located and which gene corresponds to it. The manifest is available on Illumina website (https://support.illumina.com/array/array_kits/infinium_humanmethylation450_beadchip_kit/downloads.html). Missing values are encoded with the question mark.
Table 2.
Structure of the DNA methylation data matrix extracted from TCGA Sample ID cg13869341 … cg00381604 Class TCGA-A7-A0DC-11 0.971644 … 0.017485 Tumoral TCGA-BH-A0BV-11A 0.925557 … ? Normal TCGA-BH-A0DZ-11A 0.907020 … 0.019204 Tumoral
Supervised Learning
The goal of our study is to develop an iterative algorithm that can efficiently extract a set of genes from large DNA methylation cancer datasets. The first step is the application of a supervised learning method (Tan et al., 2005; Weitschek et al., 2014). This is possible because the datasets used in this study (Table 1) are labeled datasets, i.e., we know if each tissue belongs to the ‘normal’ or ‘tumoral’ category. Using a labeled dataset (or a part of it) as a training set, the supervised learning algorithm infers some hypothesis from the features and builds a classification model, which is simply a function that assigns a category to a sample. We perform tests with both Decision Trees (Quinlan, 1993) and Random Forests (Breiman, 2001; Svetnik et al., 2003). Then, we extract CpG sites (features) from the classification model and the corresponding genes. The list of genes extracted from a classification model is part of the output of our algorithm. In fact, as we explain in the next section, our algorithm runs many iterations, and the overall result is the union of the results of each iteration. It is important to highlight that we are not interested in the decision model to classify new data (even if this would be possible), but to extract a list of candidate genes that may play a role in cancer. Decision Trees are used for recursive binary partitioning of the feature space. Starting from the root, which contains the entire training dataset, Decision Trees are built by splitting the dataset into distinct nodes, where a node defines the probability of a point to be of a certain category. The final prediction is the label of the final leaf node reached during the decision process. Decision Trees are smooth to understand and they allow validating the model with statistical tests (like entropy or information gain). Unfortunately, it is easy to create a tree that overfits the input data. In addition, since Decision Trees use a greedy algorithm, the optimal tree is sometimes not found. Random Forests solve many problems of Decision Trees, especially when applied to very large datasets. Random Forests run many Decision Trees in parallel and they fit well with big data technologies and map-reduce algorithms, since data can be split on different machines. There are two points of randomness that reduce the possibility of overfitting and over generalization. First of all, each tree is created from a random selection of N data points from the training set. Then, during the decision process of a specific tree, there is a random selection of M features from the global set of features. For all those reasons, while both Decision Trees and Random Forests are explored, the final implementation of BIGBIOCL is based on Random Forests.
BIGBIOCL: a multiple tree-based classifier for big biological data
CAMUR (Cestarelli et al., 2016) is a supervised method that can extract alternative and equivalent classification models from a labeled dataset (Weitschek, 2016). CAMUR adopts an iterative feature elimination technique: it uses the supervised RIPPER algorithm (Cohen, 1995) to compute a rule-based classification model, iteratively eliminates combinations of features that appear in the model from the input dataset, and performs again the classification until a stopping condition is verified. Once a feature is eliminated from the dataset, it can be reinserted in the next iteration ( loose execution mode) or discarded forever ( strict execution mode). CAMUR has been successfully applied to RNA-sequencing data (Cestarelli et al., 2016) extracted from TCGA, and evaluated on Gene Expression Omnibus (GEO) datasets. Datasets used in CAMUR tasks contained at most 30 thousand of features and a thousand of samples. When trying to apply CAMUR to DNA methylation datasets, which contain hundreds of thousands of features, the algorithm suffers of memory and execution time problems. In this work we propose BIGBIOCL, a JAVA command-line software that is inspired by CAMUR to enable the efficient management and classification of large datasets. BIGBIOCL adopts big data solutions and introduces many innovations to CAMUR:
Big Data Research BIGBIOCL is based on MLlib, the Apache Spark’s scalable machine learning library. The adoption of Apache Spark allows executing the algorithm on Hadoop YARN (Vavilapalli et al. 2013) cluster, with the possibility to parallelize the machine learning task on several machines. Even if both Decision Trees and Random Forests have been tested, the final implementation of BIGBIOCL is based on Random Forests. One of the reasons is that, Random Forests naturally fit with parallel computation, since each node of a cluster can compute a different tree of the forest and send the result back to a master node. BIGBIOCL, following the CAMUR method, iteratively computes a Random Forest model. After each iteration, BIGBIOCL permanently removes all features that appear in the computed model from the input dataset, and not only combinations of them. This approach is similar to the CAMUR loose execution mode, but removing all extracted features makes the entire process lighter since there is no more the need to compute the power set at each iteration. Obviously, having hundreds of thousands of features guarantees that a relevant number of alternative classification models are still extracted, as we show in the next section. BIGBIOCL iterative procedure stops when the reliability of the classification model is below a given threshold, or when a maximum number of iterations has been reached. Both stopping conditions must be specified by the user as command-line parameters. We use the F-measure to evaluate the accuracy of classification models. The F-measure is defined as the weighted harmonic mean of precision (P) and recall (R). We decide to equally weight precision and recall, obtaining the formula:
F − measure = (2) It is worth noting that F-measure is high when both precision and recall are high. Precision and recall are defined in terms of true positive TP (the number of samples that are assigned to a category and that belong to that category), false positives FP (the number of samples not belonging to a category but assigned to that category), and false negatives FN (the number of samples belonging to a category but not assigned to that category):
P =
TP𝑇𝑃+𝐹𝑃 ; R =
TP𝑇𝑃+𝐹𝑁 (3) When the iterative algorithm stops, the software collects the list of features that appear in all computed classification models. Since features are CpG sites that are located in different genomic regions, we use a mapping file for discovering the gene where a CpG site is located (see section 2.1 for further details). The software can therefore derive a list of candidate genes as final output of the computation, associating them to the tumor under study. Extracted genes can then be explored and evaluated by biologists to investigate their role in cancer. Obviously, BIGBIOCL can be applied also to different datasets. In fact, it is not limited to DNA methylation data, but it works on any input dataset having the structure illustrated in Table 2. Results
In this section, we discuss the path that led to the Random Forests implementation of BIGBIOCL, providing statistics about experiments and a discussion about results. All our experiments refer to the datasets listed in Table 1. First of all, we tried to use CAMUR in strict mode to extract candidate genes from the BRCA dataset. As we have previously noted, CAMUR works properly with TGCA RNA-sequencing data, where the number of features is around 30 thousand. The BRCA dataset - stored in a 6.5GB text file - includes more than 450 thousand of features and CAMUR cannot manage such amount of data. The experiment was executed using the workstation described in Table 3, allocating 22GB of RAM and 7 cores to the Java Virtual Machine (JVM). After 16 minutes, CAMUR ran out of memory.
Table 3.
Workstation used for experiments Parameter Value Architecture x86 CPU Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz Number of CPUs 8 RAM 24GB OS CentOS Linux release 7.3.1611 Java Version Oracle jdk1.8.0_131 Afterwards, we executed several experiments, relying on Apache Spark MLlib: (1)
Single iteration of Decision Trees. We ran Decision Trees in Spark local mode, in order to evaluate results and performance. (2)
Single iteration of Random Forests. We ran Random Forests in Spark local mode, in order to compare results and performance with Decision Tree experiments. (3)
Execution of Linear Support Vector Machines (SVMs) and Naïve Bayes. We ran SVMs and Naïve Bayes in Spark local mode to compare the accuracy of Random Forests results with other classification methods. (4)
BIGBIOCL: this is the Random Forest iterative algorithm (with feature deletion) implemented with big data technologies. The algorithm was tested both in Spark local mode and on Apache Hadoop YARN multi-node cluster. Apache Spark local mode is a non-distributed single-JVM configuration that allows Spark to run all its execution components (i.e. driver, executor, scheduler, and master) in the same JVM. In local mode, the default parallelism is the number of threads specified as command line parameter. Table 4 and Table 5 show the configuration and results of experiments with a single iteration of Decision Trees in the same workstation used for testing CAMUR. In all our experiments we used 70% of randomly sampled input data to build the model (training set), and 30% of data for the evaluation (test set). Results show that BIGBIOCL can manage large datasets with hundreds of thousands of features.
Table 4.
Configuration of Decision Tree experiments - Spark local mode (dataset: BRCA) ID Memory Threads Max Depth Max Bins Impurity 1 5 GB 4 5 16 Gini 2 5 GB 4 5 32 Gini 3 12 GB 7 5 32 Gini 4 12 GB 7 10 32 Gini 5 12 GB 7 5 8 Gini 6 18 GB 7 5 128 Gini
Experiments with Decision Trees demonstrate that we were able to classify large datasets, even using only 5 GB of memory. The execution time decreases drastically if the parameter max bins is reduced. Even if execution time seems to be acceptable (the algorithm terminates at most in one hour), some observations led us to test (and then adopt) Random Forests: We could extract only few features from each execution of the algorithm. We are interested in identifying a set of candidate genes for a specific type of cancer, thus having more features would be preferable. Decision Trees offer few possibilities of parallelization. This is important especially in the context of multiple iterations, where parallelization can reduce the overall execution time. On the other hand, Random Forests allow splitting the data on many machines, reducing the execution time of each iteration.
Table 5.
Results of Decision Tree experiments described in Table 4 ID Build Time Evaluation time F-Measure - - - 3 66.23 min 1.96 min 98.76% 4 4 67.96 min 1.92 min 99.20% 4 5 9.6 min 1.92 min 98.03% 3 6 OOM - - -
This is Table shows the execution time and results of a single iteration of Decision Trees. The configuration adopted for each experiment is provided in Table 4. “ID” is the unique identifier for an experiment. “Build Time” is the time needed to build the classification model, while “Evaluation time” is the time for the evaluation of the model on test data (30% of input data). The accuracy of the model is given by the F-measure. The column “
Table 6 and Table 7 show results of some experiments with Random Forests. Overall, a single execution of Random Forests performs definitely better than a single execution of Decision Trees. Even if experiment 7 produced a result in more than one hour and a half, experiment 8 shows that increasing the memory from 5 GB to 12 GB dramatically improves the execution time. To build the model, experiment 8 required 38.5% of the time of the equivalent experiment with Decision Trees (ID=3). In addition, Random Forests produce more features, which is important to identify more genes that may play a role in cancer.
Table 6.
Configuration of Random Forest experiments - Spark local mode (dataset: BRCA) ID Memory Threads Max Depth Max Bins
Table 7.
Results of Random Forest experiments described in Table 6 ID Build Time Evaluation time F-Measure
Table 8.
Configuration of SVM experiments - Spark local mode (dataset: BRCA) ID Memory Threads Regularization method Regularization parameter
Table 9.
Big Data Research At each iteration, Apache MLlib Random Forests model is computed on the working dataset S. On a cluster, trees of the Random Forests can be computed in parallel on different nodes. At iteration 0, S is equals to the input dataset. After each iteration, the set of features F that appear in the computed model is removed from S. Thus, next iteration runs on the dataset {S − F} . Once eliminated, features are never reintegrated in the working dataset S. The algorithm terminates when F-measure on test data is below a threshold MF (parameter provided by the user) or the number of iterations is bigger than a threshold MI (in the rest of this article we consider 𝑀𝐼 = 1000 ). Tables 11 and 12 show results of experiments running Spark in local mode. The input dataset is BRCA. As we can see, setting the F-measure threshold to 99%, BIGBIOCL ran 2 iterations in around one hour, extracting 224 candidate genes. Relaxing that constraint, we had more iterations and more candidate genes. When the F-measure threshold was set to 97%, BIGBIOCL executed 96 iterations, computing 5072 genes in less than 2 days. Experiments on Hadoop YARN Cluster are summarized in Tables 13 and 14. They were useful to evaluate how performance improves with parallelization on multiple computational nodes. Results are attractive. Experiment 16 (on Hadoop) corresponds to Experiment 12 (Spark local mode) and its execution time was 22% of Experiment 12. It is also interesting to note (Experiments 16 and 17) that increasing the number of working nodes of the cluster (so also the total number of CPUs) we got more iterations and more genes. We wish to highlight that all the extracted genes related to each tumor are available at supplementary data S1. Additionally, a comprehensive description of the experimentation is provided in the wiki of BIGBIOCL on GitHub. Furthermore, if we compare Experiment 18 (on Hadoop) with Experiment 13 (Spark local mode), we can notice again how the execution on a cluster outperforms the Spark local mode, both in terms of execution time and of number of features extracted. On average, running BIGBIOCL in Spark local mode requires around 1500 seconds to generate the classification model at each iteration, while using 3 PICO’s nodes on Hadoop YARN cluster the average time to build a classification model is 330 seconds.
Table 10.
PICO’s hardware, used for experiments with Hadoop Cluster Parameter Value Total Nodes 66 CPU Intel Xeon E5 2670 v2 @2.5Ghz Cores per node 20 RAM per node 128 GB Tables 15 and 16 show results of some experiments with THCA and KIRP datasets. Experiments refer to the execution of BIGBIOCL in Spark local mode, using the workstation described in Table 3. On average, the time to build a classification model for the KIRP dataset is 340 seconds, while for THCA this number increases to 945 seconds. This result is quite obvious, since THCA contains 571 samples, while KIRP only 321. What is interesting to note is that on THCA the algorithm stops after 7 iterations, while on KIRP after 34 iterations, even if the KIRP dataset contains less samples. This depends on the different distribution of beta values in the two datasets. For estimating the execution time of a sequential implementation of our algorithm, we can consider experiment number 17 (Tables 13 and 14) whose execution time was 13 h 30 min. If we run the same number of iterations (i.e., 116) with the sequential implementation of Random Forest of the Weka software package, the execution time will be at least of 30 h (not taking into account potential overhead).
Table 11.
Configuration of BIGBIOCL experiments - Spark local mode (dataset: BRCA) ID Memory Threads Max Depth Max Bins
Table 12.
Results of BIGBIOCL experiments described in Table 11 ID Overall Time “Overall time” is the time to execute all iterations. For each iteration, execution time includes the time to build the model, the time to evaluate the model on test data, and the time to evaluate the model on training data.
Table 13.
Configuration of BIGBIOCL experiments - Hadoop YARN Cluster (dataset: BRCA) ID “
Table 14.
Results of BIGBIOCL experiments described in Table 13 ID Overall Time
Table 15.
Configuration of BIGBIOCL experiments - Spark local mode (datasets: THCA and KIRP) ID Memory Threads Max depth Max bins
Table 16.
Results of BIGBIOCL experiments described in Table 15 ID Overall Time et al. et al.
Figure 1 . Inferred correlation network for the BRCA tumor
Figure 2.
Inferred correlation network for the KIRP tumor
Figure 3.
Inferred correlation network for the THCA tumor
Big Data Research Table 17.
Computational time and inferred nodes and edges with the DNA methylation network correlation analysis (Bartlett, et al. Discussion
Our experiments demonstrate that BIGBIOCL can compute multiple classification models for datasets with hundred thousands of features in few hours. In addition, thanks to the possibility to execute the software on a Hadoop cluster, execution time can be reduced even by 75% compared to Spark local mode. Obviously, the possibility of the software to reach a high level of parallelism allows adding computational nodes to the cluster when the size of the input dataset explodes. The first parameter that can be tuned to improve the parallelism and performance is the number of trees of the Random Forests. This number should be increased only when there is an increment in the size of the input dataset. Increasing the number of trees causes an increase of the training time, which can be contained by adding more computational nodes to the cluster (in fact, trees can be computed in parallel in different nodes). We compared BIGBIOCL with standard DNA methylation network analysis and other supervised machine learning methods (i.e., SVM and Naïve Bayes) obtaining new knowledge in terms of extracted CpG sites and related genes. In fact, BIGBIOCL represents a novel approach to DNA methylation data classification. BIGBIOCL performs classification using the entire set of features in the input dataset, even when features are hundreds of thousands. This is made possible by the adoption of big data technologies for the computation of the classification model. Other tools work with a smaller set of features (Cestarelli et al., 2016), or reduce the number of features applying regression algorithms (Danielsson et al., 2015). Datasets used in our experiments were extracted from TGCA and obtained using the HumanMethylation450 platform. This platform provides beta values for more than 485,000 CpG loci. Even if there are more than 28 million of CpG loci in the human genome, data from HumanMethylation450 cover 99% of RefSeq genes, so it is a good starting point to identify drivers for cancer. In our work, we have provided a methodology and a software tool to analyze HumanMethylation450 data and even bigger datasets. Then, genes extracted from the execution of BIGBIOCL (available at supplementary data S1) can be used by biologists to determine their relevance in a given type of cancer. If we consider that there are around 25 thousand of genes in human DNA, limiting their number allows focusing the attention of the researcher. Analyzing results of experiments on BRCA data, we can find some genes that are well known in literature for their role in breast cancer. For example, mutations of the tumor suppressor gene TP53 and of PIK3CA have been often associated with BRCA (Kim et al., 2017). In addition, both inherited and de novo mutations of BRCA1 and BRCA2 – which mainly cause inactivity of such genes - have been associated to patients with breast cancer (King et al., 2013; Antonucci et al., 2017). A recent study (Tsai et al. 2017) argues that up-regulation of the BDNF signaling pathway can be associated to triple negative breast cancer cells (i.e. cells that test negative for HER2, estrogen receptors, and progesterone receptors). We have obtained BDNF as result of several experiments (IDs 12, 15, 16, and 17). Furthermore, other genes that are considered high-confidence oncogenic candidates (Zheng et al., 2016) have been extracted with BIGBIOCL, as ALDH3A1, CLDN15, SFN, and ENDOD1. Conclusion
In conclusion, BIGBIOCL can efficiently manage large datasets, iteratively building equivalent classification models, extracting features (genes in our experiments where features are CpG loci, but the algorithm can potentially be used with other data), and scaling up with the size of the input dataset. Then, results need to be further validated. The algorithm can be improved. It currently builds the classification model on 70% of input data, using 30% of data as test data (the F-measure on test data is used as stopping condition of the iterations). This choice was important during the development and the test of BIGBIOCL. In order to get more precise results and to avoid to loose information, the algorithm could build classification models on 100% of the input data. In addition, as already said, BIGBIOCL can be applied to other type of data, including other NGS experiments and even bigger datasets. Lastly, BIGBIOCL can be used as a component of a pipeline to give sense to raw data, reducing the entropy and focusing the attention on a smaller set of dimensions.
Acknowledgements
We wish to thank the Cineca consortium for assigning us supercomputer resources [grant number HP10CTJZAM] to make possible the big data calculations and for providing the PICO infrastructure to carry on experiments with the Hadoop YARN cluster. This work was partially supported by the ERC Advanced Grant GeCo (datadriven Genomic Computing) [grant number 693174], the MoDiag Regione Lazio Project [grant number A0112-2016-13363], and the SysBioNet Italian Roadmap Research Infrastructures. Additionally, we wish to thank Paola Bertolazzi for supporting this work. Finally, we wish to thank Giulia Fiscon and Federica Conte for the patience demonstrated during the local experimentation phase.
Conflict of Interest: none declared.
References
Akalin,A. et al . (2012) methylKit: a comprehensive R package for the analysis of genome-wide DNA methylation profiles.
Genome Biology , 13:R87 Antonucci,I. et al . (2017) A new case of "de novo" BRCA1 mutation in a patient with early-onset breast cancer.
Clin Case Rep , 5(3), 238-240 Bartlett, T.E., et al. (2014) A DNA methylation network interaction measure, and detection of network oncomarkers.
PloS one et al. (2001) Aberrant patterns of DNA methylation, chromatin formation and gene expression in cancer.
Human molecular genetics
Nature Reviews. Clin Oncol , 2(S1), S4. Bird, A. (2002) DNA methylation patterns and epigenetic memory.
Gene dev . 16(1), 6-21. Breiman, L (2001) Random Forests.
Machine Learning et al . (2016) CAMUR: Knowledge extraction from RNA-seq cancer data through equivalent classification rules.
Bioinformatics , 32(5), 697-704 Cohen, W.W. (1995) Fast effective rule induction.
Proceedings of the twelfth international conference of machine learning , 115-123. Danielsson, A. et al . (2015) MethPed: a DNA methylation classifier tool for the identification of pediatric brain tumor subtypes.
Clinical Epigenetics , 7:62
De Carvalho, D.D. et al . (2012) DNA methylation screening identifies driver epigenetic events of cancer cell survival.
Cancer Cell,
21, 655–67. Du, P. et al . (2010) Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis.
BMC Bioinformatics , 11, 587 Dedeurwaerder, S. et al . (2014) A comprehensive overview of Infinium HumanMethylation450 data processing.
Brief Bioinform , 15(6), 929–941 Feinberg, AP. et al . (2006) The epigenetic progenitor origin of human cancer.
Nat Rev Genet , 7(1), 21-33 Figueroa, AP. et al . (2008) An Integrative Genomic and Epigenomic Approach for the Study of Transcriptional Regulation.
PLOS ONE , 3(3): e1882 Greene, C.S. et al . (2014) Big Data Bioinformatics.
J. Cell. Physiol , 229, 1896–1900 Hall, M et al. (2009). The WEKA Data Mining Software: An Update. SIGKDD Explorations, 11(1). Hayden, E.C. (2014) The $1,000 genome.
Nature , 507.7492, 294. Kim, J.Y. et al . (2017) Clinical implications of genomic profiles in metastatic breast cancer with a focus on TP53 and PIK3CA, the most frequently mutated genes.
Oncotarget , 8(17), 27997-28007 King, M.C. et al . (2013) Breast and ovarian cancer risks due to inherited mutations in BRCA1 and BRCA2.
Science , 302(5645), 643-6 Kulis, M. et al. (2010) DNA methylation and cancer.
Advances in genetics , 70, 27-56. McKinsey, Global Institute (2012) Big Data: The next frontier for innovation, competition, and productivity.
Report . Meng, X. et al. (2016) Mllib: Machine learning in apache spark.
Journal of Machine Learning Research , 17(34), 1-7. Pidsley, R., et al. (2013) A data-driven approach to preprocessing Illumina 450K methylation array data.
BMC genomics , 14(1), 293. Quinlan, R. (1993) C4.5: Programs for Machine Learning.
Morgan Kaufmann Publishers . San Mateo CA: Morgan Kaufmann. Sandoval, J., et. al . (2011) C4.5: Validation of a DNA methylation microarray for 450,000 CpG sites in the human genome.
Epigenetics , 6(6), 692-702. Shendure, J. et. al (2008) Next-generation DNA sequencing.
Nature biotechnology, et al . (2013) Estimating absolute methylation levels at single-CpG resolution from methylation enrichment and restriction enzyme sequencing methods.
Genome Res , 23(9), 1541–1553 Svetnik, V. et al . (2003) Random Forest: a classification and regression tool for compound classification and QSAR modeling.
J Chem Inf Comp Sci , 43(6), 1947-1958 Tan, P. et al . (2005) Introduction to Data Mining.
Addison Wesley , Oxford University Press, Oxford, UK Tomczak, K. et al . (2015) The Cancer Genome Atlas (TCGA) an immeasurable source of knowledge.
Contemporary Oncology, et al . (2017) Brain-derived neurotrophic factor (BDNF) -TrKB signaling modulates cancer-endothelial cells interaction and affects the outcomes of triple negative breast cancer.
PLoS One, et al . (2013) Apache hadoop yarn: Yet another resource negotiator.
In Proceedings of the 4th annual Symposium on Cloud Computing,
ACM, 1-5. Weinstein, J.N. et al . (2013) The cancer genome atlas pan-cancer analysis project.
Nat. Genet , 45, 1113–1120 Weitschek, E. et al. (2014) Next generation sequencing reads comparison with an alignment-free distance.
BMC Research Notes et al . (2014) Supervised DNA barcodes species classification: analysis, comparisons and results.
BioData Min. , 7, 4 Weitschek, E. et al . (2016) Genomic Data Integration: A Case Study on Next Generation Sequencing of Cancer. et al . (2015) Combining multidimensional genomic measurements for predicting cancer prognosis: observations from TCGA.
Brief Bioinform, et al . (2016) Genome-wide DNA methylation analysis identifies candidate epigenetic markers and drivers of hepatocellular carcinoma.
Brief Bioinform , bbw094 Zhuang, J. et al . (2002) A comparison of feature selection and classification methods in DNA methylation studies using the Illumina Infinium platform.