Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Frank Lin is active.

Publication


Featured researches published by Frank Lin.


PLOS ONE | 2011

Computational bacterial genome-wide analysis of phylogenetic profiles reveals potential virulence genes of Streptococcus agalactiae.

Frank Lin; Ruiting Lan; Vitali Sintchenko; Gwendolyn L. Gilbert; Fanrong Kong; Enrico Coiera

The phylogenetic profile of a gene is a reflection of its evolutionary history and can be defined as the differential presence or absence of a gene in a set of reference genomes. It has been employed to facilitate the prediction of gene functions. However, the hypothesis that the application of this concept can also facilitate the discovery of bacterial virulence factors has not been fully examined. In this paper, we test this hypothesis and report a computational pipeline designed to identify previously unknown bacterial virulence genes using group B streptococcus (GBS) as an example. Phylogenetic profiles of all GBS genes across 467 bacterial reference genomes were determined by candidate-against-all BLAST searches,which were then used to identify candidate virulence genes by machine learning models. Evaluation experiments with known GBS virulence genes suggested good functional and model consistency in cross-validation analyses (areas under ROC curve, 0.80 and 0.98 respectively). Inspection of the top-10 genes in each of the 15 virulence functional groups revealed at least 15 (of 119) homologous genes implicated in virulence in other human pathogens but previously unrecognized as potential virulence genes in GBS. Among these highly-ranked genes, many encode hypothetical proteins with possible roles in GBS virulence. Thus, our approach has led to the identification of a set of genes potentially affecting the virulence potential of GBS, which are potential candidates for further in vitro and in vivo investigations. This computational pipeline can also be extended to in silico analysis of virulence determinants of other bacterial pathogens.


PLOS ONE | 2010

A PubMed-wide associational study of infectious diseases.

Vitali Sintchenko; Stephen Anthony; Xuan Hieu Phan; Frank Lin; Enrico Coiera

Background Computational discovery is playing an ever-greater role in supporting the processes of knowledge synthesis. A significant proportion of the more than 18 million manuscripts indexed in the PubMed database describe infectious disease syndromes and various infectious agents. This study is the first attempt to integrate online repositories of text-based publications and microbial genome databases in order to explore the dynamics of relationships between pathogens and infectious diseases. Methodology/Principal Findings Herein we demonstrate how the knowledge space of infectious diseases can be computationally represented and quantified, and tracked over time. The knowledge space is explored by mapping of the infectious disease literature, looking at dynamics of literature deposition, zooming in from pathogen to genome level and searching for new associations. Syndromic signatures for different pathogens can be created to enable a new and clinically focussed reclassification of the microbial world. Examples of syndrome and pathogen networks illustrate how multilevel network representations of the relationships between infectious syndromes, pathogens and pathogen genomes can illuminate unexpected biological similarities in disease pathogenesis and epidemiology. Conclusions/Significance This new approach based on text and data mining can support the discovery of previously hidden associations between diseases and microbial pathogens, clinically relevant reclassification of pathogenic microorganisms and accelerate the translational research enterprise.


Pathology | 2009

Commonly used molecular epidemiology markers of Streptococcus agalactiae do not appear to predict virulence

Frank Lin; Vitali Sintchenko; Fanrong Kong; Gwendolyn L. Gilbert; Enrico Coiera

Aims: Several virulent clones of group B streptococcus (GBS) are known to be associated with certain serotypes and molecular epidemiological markers. It is unclear, however, whether the clinical significance of GBS can be predicted based solely on such molecular markers. The aim of this study was to test the hypothesis that GBS virulence can be predicted by using the molecular epidemiology markers. Methods: We examined 912 human GBS isolates in which 18 distinct molecular markers (including virulence‐associated mobile genetic elements, polysaccharide capsule determinants, variants of a surface antigen and invasin, and antibiotic resistance‐related genes) were characterised using multiplex PCR based reverse line blot assay. All strains were classified in clinically relevant invasive and colonising categories. Relationships between molecular markers and clinical phenotypes were tested using statistical and machine learning analyses. Classifier performance was evaluated by the area under receiver operator characteristic curve (AUC). Results: The distribution of serotypes was comparable with those in previous reports (Ia, 22.1%; III, 34.7%; V, 17.7%). From single marker analyses, only alp3 (which encodes a surface protein antigen, commonly associated with serotype V) showed an increased association with invasive diseases (OR = 2.93, p = 0.0003). Molecular serotype (MS) II (OR = 10.0, p = 0.0007) had a significant association with early‐onset neonatal disease when compared with late‐onset diseases. Predictive analysis with logistic regression and machine learning classifiers, however, only yielded weak predictive power (AUC 0.56–0.71, stratified 10‐fold cross‐validation) across all the subgroups. Conclusion: While some molecular epidemiological markers are important in defining GBS clusters, a definitive predictive relationship between the molecular markers and clinical outcomes may be lacking.


Scientific Reports | 2017

TEPAPA: a novel in silico feature learning pipeline for mining prognostic and associative factors from text-based electronic medical records

Frank Lin; Adrian Pokorny; Christina Teng; Richard J. Epstein

Vast amounts of clinically relevant text-based variables lie undiscovered and unexploited in electronic medical records (EMR). To exploit this untapped resource, and thus facilitate the discovery of informative covariates from unstructured clinical narratives, we have built a novel computational pipeline termed Text-based Exploratory Pattern Analyser for Prognosticator and Associator discovery (TEPAPA). This pipeline combines semantic-free natural language processing (NLP), regular expression induction, and statistical association testing to identify conserved text patterns associated with outcome variables of clinical interest. When we applied TEPAPA to a cohort of head and neck squamous cell carcinoma patients, plausible concepts known to be correlated with human papilloma virus (HPV) status were identified from the EMR text, including site of primary disease, tumour stage, pathologic characteristics, and treatment modalities. Similarly, correlates of other variables (including gender, nodal status, recurrent disease, smoking and alcohol status) were also reliably recovered. Using highly-associated patterns as covariates, a patient’s HPV status was classifiable using a bootstrap analysis with a mean area under the ROC curve of 0.861, suggesting its predictive utility in supporting EMR-based phenotyping tasks. These data support using this integrative approach to efficiently identify disease-associated factors from unstructured EMR narratives, and thus to efficiently generate testable hypotheses.


Journal of Clinical Oncology | 2017

The Cancer Care Treatment Outcomes Ontology (CCTO): A computable ontology for profiling treatment outcomes of patients with solid tumors.

Frank Lin; Tudor Groza; Simon Kocbek; Erick Antezana; Richard J. Epstein

PURPOSE There is as yet no computer-processable resource to describe treatment end points in cancer, hindering our ability to systematically capture and share outcomes data to inform better patient care. To address these unmet needs, we have built an ontology, the Cancer Care Treatment Outcome Ontology (CCTOO), to organize high-level concepts of treatment end points with structured knowledge representation to facilitate standardized sharing of real-world data. METHODS End points from oncology trials in ClinicalTrials.gov were extracted, queried using the keyword cancer, and followed by an expert appraisal. Synonyms and relevant terms were imported from the National Cancer Institute Thesaurus and Common Terminology Criteria for Adverse Events. Logical relationships among concepts were manually represented by production rules. The applicability of 1,847 rules was tested in an index case. RESULTS After removing duplicated terms from 54,705 trial entries, an ontology holding 1,133 terms was built. CCTOO organized concepts into four domains (cancer treatment, health services, physical, and psychosocial health-related concepts), 13 subgroups (including efficacy, safety, and quality of life), and two (taxonomic and evaluative) concept hierarchies. This ontology has a comprehensive term coverage in the cancer trial literature: at least one term was mentioned in 98% of MEDLINE abstracts of phase I to III trials, whereas concepts about efficacy were mentioned in 7,208 (79%) phase I, 15,051 (92%) phase II, and 3,884 (86%) phase III trials. The event sequence of the index case was readily convertible to a comprehensive profile incorporating response, treatment toxicity, and survival by applying the set of production rules curated in the CCTOO. CONCLUSION CCTOO categorizes high-level treatment end points used in oncology and provides a mechanism for profiling individual patient data by outcomes to facilitate translational analysis.


BMC Bioinformatics | 2009

In silico prioritisation of candidate genes for prokaryotic gene function discovery: an application of phylogenetic profiles

Frank Lin; Enrico Coiera; Ruiting Lan; Vitali Sintchenko

BackgroundIn silico candidate gene prioritisation (CGP) aids the discovery of gene functions by ranking genes according to an objective relevance score. While several CGP methods have been described for identifying human disease genes, corresponding methods for prokaryotic gene function discovery are lacking. Here we present two prokaryotic CGP methods, based on phylogenetic profiles, to assist with this task.ResultsUsing gene occurrence patterns in sample genomes, we developed two CGP methods (statistical and inductive CGP) to assist with the discovery of bacterial gene functions. Statistical CGP exploits the differences in gene frequency against phenotypic groups, while inductive CGP applies supervised machine learning to identify gene occurrence pattern across genomes. Three rediscovery experiments were designed to evaluate the CGP frameworks. The first experiment attempted to rediscover peptidoglycan genes with 417 published genome sequences. Both CGP methods achieved best areas under receiver operating characteristic curve (AUC) of 0.911 in Escherichia coli K-12 (EC-K12) and 0.978 Streptococcus agalactiae 2603 (SA-2603) genomes, with an average improvement in precision of >3.2-fold and a maximum of >27-fold using statistical CGP. A median AUC of >0.95 could still be achieved with as few as 10 genome examples in each group of genome examples in the rediscovery of the peptidoglycan metabolism genes. In the second experiment, a maximum of 109-fold improvement in precision was achieved in the rediscovery of anaerobic fermentation genes in EC-K12. The last experiment attempted to rediscover genes from 31 metabolic pathways in SA-2603, where 14 pathways achieved AUC >0.9 and 28 pathways achieved AUC >0.8 with the best inductive CGP algorithms.ConclusionOur results demonstrate that the two CGP methods can assist with the study of functionally uncategorised genomic regions and discovery of bacterial gene-function relationships. Our rediscovery experiments also provide a set of standard tasks against which future methods may be compared.


BMC Cancer | 2016

Computational prediction of multidisciplinary team decision-making for adjuvant breast cancer drug therapies: a machine learning approach

Frank Lin; Adrian Pokorny; Christina Teng; Rachel F Dear; Richard J. Epstein

BackgroundMultidisciplinary team (MDT) meetings are used to optimise expert decision-making about treatment options, but such expertise is not digitally transferable between centres. To help standardise medical decision-making, we developed a machine learning model designed to predict MDT decisions about adjuvant breast cancer treatments.MethodsWe analysed MDT decisions regarding adjuvant systemic therapy for 1065 breast cancer cases over eight years. Machine learning classifiers with and without bootstrap aggregation were correlated with MDT decisions (recommended, not recommended, or discussable) regarding adjuvant cytotoxic, endocrine and biologic/targeted therapies, then tested for predictability using stratified ten-fold cross-validations. The predictions so derived were duly compared with those based on published (ESMO and NCCN) cancer guidelines.ResultsMachine learning more accurately predicted adjuvant chemotherapy MDT decisions than did simple application of guidelines. No differences were found between MDT- vs. ESMO/NCCN- based decisions to prescribe either adjuvant endocrine (97%, p = 0.44/0.74) or biologic/targeted therapies (98%, p = 0.82/0.59). In contrast, significant discrepancies were evident between MDT- and guideline-based decisions to prescribe chemotherapy (87%, p < 0.01, representing 43% and 53% variations from ESMO/NCCN guidelines, respectively). Using ten-fold cross-validation, the best classifiers achieved areas under the receiver operating characteristic curve (AUC) of 0.940 for chemotherapy (95% C.I., 0.922—0.958), 0.899 for the endocrine therapy (95% C.I., 0.880—0.918), and 0.977 for trastuzumab therapy (95% C.I., 0.955—0.999) respectively. Overall, bootstrap aggregated classifiers performed better among all evaluated machine learning models.ConclusionsA machine learning approach based on clinicopathologic characteristics can predict MDT decisions about adjuvant breast cancer drug therapies. The discrepancy between MDT- and guideline-based decisions regarding adjuvant chemotherapy implies that certain non-clincopathologic criteria, such as patient preference and resource availability, are factored into clinical decision-making by local experts but not captured by guidelines.


Journal of Biomedical Informatics | 2014

Gene-disease association with literature based enrichment

Guy Tsafnat; Dennis Jasch; Agam Misra; Miew Keen Choong; Frank Lin; Enrico Coiera

MOTIVATION Gene set enrichment analysis (GSEA) annotates gene microarray data with functional information from the biomedical literature to improve gene-disease association prediction. We hypothesize that supplementing GSEA with comprehensive gene function catalogs built automatically using information extracted from the scientific literature will significantly enhance GSEA prediction quality. METHODS Gold standard gene sets for breast cancer (BrCa) and colorectal cancer (CRC) were derived from the literature. Two gene function catalogs (CMeSH and CUMLS) were automatically generated. 1. By using Entrez Gene to associate all recorded human genes with PubMed article IDs. 2. Using the genes mentioned in each PubMed article and associating each with the articles MeSH terms (in CMeSH) and extracted UMLS concepts (in CUMLS). Microarray data from the Gene Expression Omnibus for BrCa and CRC was then annotated using CMeSH and CUMLS and for comparison, also with several pre-existing catalogs (C2, C4 and C5 from the Molecular Signatures Database). Ranking was done using, a standard GSEA implementation (GSEA-p). Gene function predictions for enriched array data were evaluated against the gold standard by measuring area under the receiver operating characteristic curve (AUC). RESULTS Comparison of ranking using the literature enrichment catalogs, the pre-existing catalogs as well as five randomly generated catalogs show the literature derived enrichment catalogs are more effective. The AUC for BrCa using the unenriched gene expression dataset was 0.43, increasing to 0.89 after gene set enrichment with CUMLS. The AUC for CRC using the unenriched gene expression dataset was 0.54, increasing to 0.9 after enrichment with CMeSH. C2 increased AUC (BrCa 0.76, CRC 0.71) but C4 and C5 performed poorly (between 0.35 and 0.5). The randomly generated catalogs also performed poorly, equivalent to random guessing. DISCUSSION Gene set enrichment significantly improved prediction of gene-disease association. Selection of enrichment catalog had a substantial effect on prediction accuracy. The literature based catalogs performed better than the MSigDB catalogs, possibly because they are more recent. Catalogs generated automatically from the literature can be kept up to date. CONCLUSION Prediction of gene-disease association is a fundamental task in biomedical research. GSEA provides a promising method when using literature-based enrichment catalogs. AVAILABILITY The literature based catalogs generated and used in this study are available from http://www2.chi.unsw.edu.au/literature-enrichment.


BMC Bioinformatics | 2011

BICEPP: an example-based statistical text mining method for predicting the binary characteristics of drugs.

Frank Lin; Stephen Anthony; Thomas M. Polasek; Guy Tsafnat; Matthew P. Doogue

BackgroundThe identification of drug characteristics is a clinically important task, but it requires much expert knowledge and consumes substantial resources. We have developed a statistical text-mining approach (BInary Characteristics Extractor and biomedical Properties Predictor: BICEPP) to help experts screen drugs that may have important clinical characteristics of interest.ResultsBICEPP first retrieves MEDLINE abstracts containing drug names, then selects tokens that best predict the list of drugs which represents the characteristic of interest. Machine learning is then used to classify drugs using a document frequency-based measure. Evaluation experiments were performed to validate BICEPPs performance on 484 characteristics of 857 drugs, identified from the Australian Medicines Handbook (AMH) and the PharmacoKinetic Interaction Screening (PKIS) database. Stratified cross-validations revealed that BICEPP was able to classify drugs into all 20 major therapeutic classes (100%) and 157 (of 197) minor drug classes (80%) with areas under the receiver operating characteristic curve (AUC) > 0.80. Similarly, AUC > 0.80 could be obtained in the classification of 173 (of 238) adverse events (73%), up to 12 (of 15) groups of clinically significant cytochrome P450 enzyme (CYP) inducers or inhibitors (80%), and up to 11 (of 14) groups of narrow therapeutic index drugs (79%). Interestingly, it was observed that the keywords used to describe a drug characteristic were not necessarily the most predictive ones for the classification task.ConclusionsBICEPP has sufficient classification power to automatically distinguish a wide range of clinical properties of drugs. This may be used in pharmacovigilance applications to assist with rapid screening of large drug databases to identify important characteristics for further evaluation.


British Journal of Clinical Pharmacology | 2011

Perpetrators of pharmacokinetic drug–drug interactions arising from altered cytochrome P450 activity: a criteria-based assessment

Thomas M. Polasek; Frank Lin; John O. Miners; Matthew P. Doogue

Collaboration


Dive into the Frank Lin's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Richard J. Epstein

St. Vincent's Health System

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Stephen Anthony

University of New South Wales

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Ruiting Lan

University of New South Wales

View shared research outputs
Researchain Logo
Decentralizing Knowledge