Poster: Mining PubMed for Biomarker-Disease Associations to Guide Discovery

Below is the poster I presented at the 2012 Molecular Med Tri-Con (MMTC) in San Francisco last week. A copy has also been deposited at F1000 Posters, Nature Precedings and FigShare; you can also download a PDF of the poster.

The project began with two simple questions: (1) Which therapeutic areas are seeing the most research for the discovery, refinement or application of biomarkers? (2) Which diseases are most frequently associated with the term “biomarker”? I hope to expand upon this work in a collaboration later this year.


Biomedical knowledge is growing exponentially; however, meta-knowledge around the data is often lacking. PubMed is a database comprising more than 21 million citations for biomedical literature from MEDLINE and additional life science journals dating back to the 1950s. To explore the use and frequency of biomarkers across human disease, we mined PubMed for biomarker-disease associations. We then ranked the top 100 linked diseases by relevance and mapped them to medical subject headings (MeSH) and, subsequently, to the Disease Ontology. To identify biomarkers for each disease, we queried Covance BioPathways, an online data resource that maps commercial biomarker assays to biological and disease pathways. We then integrated pathways-based information to describe both known and potential biomarkers as well as disease-associated genes/proteins for select diseases. This approach identifies therapeutic areas with candidate or validated biomarkers, and highlights those areas where a paucity of biomarkers exists.

Materials and Methods

Text mining was performed using PolySearch, a web-based text mining system for extracting relationships between human diseases, genes, mutations, drugs and metabolites [1]. The MeSH Browser (2012 MeSH) was used to map disease associations to MeSH IDs. Once MeSH IDs were assigned, the Disease Ontology was used to map DOIDs [2]. Interaction networks were constructed in GeneGo MetaCore [3] using the Auto expand algorithm, which gradually expands sub-networks around every object from the seed object list based on interactions identified in the literature. At every step, preference is given to objects with more connectivity to the initial object, and expansion halts when the sub-networks intersect, or when the overall network size reaches a predefined limit. Genes/proteins for which validated commercial assays exist were identified using Covance BioPathways at and are indicated with a red dot. These genes/proteins can be considered potential biomarkers.


Data Extraction and Curation

In June 2011, we mined PubMed for term(‘biomarker’)-disease associations and identified a total of 1,181 disease associations (Table 1). We then curated the top 100 disease associations from the list, mapping each result to both medical subject (MeSH) ID and Disease Ontology ID (DOID), and then subsequently queried the GeneGo diseases ontology for associated biomarkers (Table 2). Of 100 results, 62 map to both MeSH ID and DOID and are shown below.

Table 1. A representative list of term(‘biomarker’)-disease associations mined from PubMed in June 2011. The top 100 disease associations were ranked by Z Score. The Z-score indicates the number of standard deviations that the relevancy score is above the mean; larger Z-scores denote stronger associations. The top 100 data set is available under the Open Data Commons Attribution License at

Table 1

Table 2. The curated list of disease associations minded from PubMed and organized by high-level Disease Ontology. Each specific disease association has a unique MeSH ID, DOID and number of associated genes as defined in the GeneGo MetaCore knowledgebase.

Table 2

Disease Interaction Network and Biomarker Assay Identification

For illustrative purposes, we constructed an interaction network around disease-associated genes for two diseases – one with few associated genes (atherosclerosis) and one with many associated genes (asthma) – using a network building algorithm in GeneGo MetaCore. For each interaction network gene set, we then queried Covance BioPathways, a publicly accessible, web-based data source that integrates biological and disease pathway maps with validated Covance assays and antibody products, to identify commercially available biomarker assays.

Figure 1. Atherosclerosis interaction network. Disease-associated genes are indicated with blue halos; genes without a halo were included by the network building algorithm.  Biomarkers that have commercially validated assays are indicated with a red dot; they either are known or can be considered potential atherosclerosis biomarkers.

Figure 1


Figure 2. Asthma interaction network. All nodes shown are disease-associated genes. Biomarkers that have commercially validated assays are indicated with a red dot; they are either known or can be considered potential asthma biomarkers.

Figure 2


Given the molecular interdependencies within a cell, a disease is rarely a consequence of a single gene abnormality but instead reflects the perturbation of a complex network of biological and signaling pathways. The approach described here describes the detection and ranking of human disease based on research/clinical activity surrounding biomarkers. It also enables the identification of therapeutic areas with candidate or validated biomarkers. The strategy takes an integrative approach to identify candidate disease biomarkers by combining disease-associated genes/proteins with commercially validated assays for known biomarkers. We first constructed a system-level model of disease that incorporates molecular interactions across biological and signaling pathways. We then identified each gene/protein in the model that has an existing commercially validated assay. This research offers an alternative, comprehensive view of key relationships and pathway perturbations that may identify biomarkers of disease emergence or progression.


  1. Cheng et al. PolySearch: a web-based text mining system for extracting relationships between human diseases, genes, mutations, drugs and metabolites. Nucleic Acids Res. 2008 Jul 1;36(Web Server issue):W399-405. Epub 2008 May 16.
    View abstract
  2. Schriml et al. Disease Ontology: a backbone for disease semantic integration. Nucleic Acids Res. 2012 Jan;40(Database issue):D940-6. Epub 2011 Nov 12.
    View abstract
  3. Ekins et al. Pathway mapping tools for analysis of high content data. Methods Mol Biol. 2007;356:319-50.
    View abstract

Walter Jessen is a digital strategist, writer, web developer and data scientist. You can typically find him behind the screen something with an internet connection.

Comments are closed.