Visualizing Gene Ontologies
In my scientific research, one of the tasks I frequently perform is the interpretation of genomic data — a qualitative assessment of the biological processes and pathways that are altered in disease. This is a multi-stage process that involves first measuring the expression level of all the genes in the genome. Today, this is often done using DNA microarrays. DNA microarray technology enables researchers to measure tens of thousands of genes simultaneously. Multiple probes are used per gene to provide measurement reproducibility within a single experiment, and approximately one million data points are collected per sample.
Typically, differentially expressed genes are categorized by their expression profiles, such as genes up- or down-regulated in diseased cells relative to a non-diseased (i.e. normal) control. Such identification is called “expression profiling” or “transcriptome analysis.” The end result of such analysis is a list or lists of differentially expressed genes that meet a determined level of statistical significance.
Then, to understand the biological phenomena involved, each differentially expressed gene is evaluated for biological associations. Traditionally, this was done one gene at a time. However, with the establishment of a gene annotation system called Gene Ontology (GO), we can take a more systematic approach to the identification of gene associations. This is done using statistical tests to identify the enrichment of genes associated with a GO category or term that occurs greater than by chance alone.
The Gene Ontology (GO) project is a major bioinformatics initiative to standardize the representation of gene attributes across species and databases. Gene attributes are described using three structured controlled vocabularies that can be applied to all organisms: cellular component, biological process and molecular function. The Gene Ontology allows biologists to make queries across a large number of genes without researching each one individually.
The relationships used in GO are directed (i.e. not circular). Ontology structure can be represented as a graph, where the nodes or concepts are connected by edges, which represent the relationship between concepts. The relationship structure is acyclic, meaning that cycles are not allowed. The ontologies resemble a hierarchy; child terms are more specific and parent terms are less specific. However, unlike a hierarchy, a child term may have more than one parent term. These terms are often visualized as a hierarchical directed acyclic graph (DAG).
From an analytical perspective, it’s often challenging to assess GO category over-representation because there can be hundreds of significantly associated GO terms from a given gene list, many of which are child terms of the same parent. The key to such analysis is the identification of a GO term that has a high level of information content and, at the same time, has a large enough number of associated genes.
Think of it this way: parent GO terms have a low information content but tens or hundreds of genes associated with them. Thus, there’s lots of genes but they aren’t telling us very much. Child terms that have very high information content (i.e. the biological theme is very specific) have only one or two genes associated with them. The problem here is that we’ve only identified one or two genes. From a biological point of view, it’s hard to rationalize a process being altered if only one gene has changed. In my analyses, I endeavor to select GO terms that are specific but have 10 – 20 genes associated with them.
One way I’ve found useful for distilling GO terms is to view the direct acyclic graph (DAG). Unfortunately, I haven’t been able to find software that allows me to input a list of GO terms I’ve identified and view a DAG. Instead, I have to manually look up each term across the entire GO relationship space using a tree browser. With tens of thousands of terms making up the GO process ontology, manually looking up fifty GO terms using a tree browser isn’t really an efficient use of time.
A week or so ago, I found a Java program called GODAG that allows a user to input and visualize select GO terms by DAG [1]. I managed to get it working on my MacBook (Mac OS 10.5, Leopard). It was a bit challenging since the native Java SQLite driver wasn’t working. Compiling SQLite didn’t work either. The trick was to use SQLite drivers compiled for use on Intel Macs by Angus Hardie). I can now easily visualize ontological structures and rapidly identify GO term hierarchies.
Here is some actual data I’m working on. GO terms are represented by squares. Green squares represent GO terms associated with genes over-expressed in benign peripheral nerve tumors (neurofibroma). Black squares are the necessary linking categories in the paths. High level (i.e. low information content) GO terms are located in the upper region of the DAG and lower level terms that are more specific are located in the lower region. Note the 8 or 9 “branches” that extend down the DAG. These are the biological themes I’m most interested in capturing. Clicking on a square shows the GO ID and term name below the DAG.
The tool also allows for a comparison two lists of GO terms. Here, the green squares again represent GO terms associated with genes over-expressed in neurofibroma and the red squares represent GO terms associated with genes over-expressed in malignant peripheral nerve tumors (MPNST). The blue squares indicate those categories that overlap the two groups, showing potential biological processes that are altered in benign tumors and maintained with malignancy.
I’ve really excited about the utility of this tool for functional genomic analyses. However, there is a number of issues: the image can’t be resized, it isn’t possible to zoom in on a region, groups can’t be hidden and unhidden, gene number can’t be included, and it’s impossible to save and reload the map without entering all the GO terms again. Any Java developers interested in a collaboration? Despite these issues, the program is functional enough currently to help immensely in organizing and annotating ontology associations.
References
- Zhu et al. GO-2D: identifying 2-dimensional cellular-localized functional modules in Gene Ontology. BMC Genomics. 2007 Jan 24;8:30.
Tagged as bioinformatics initiative, biological associations, differentially expressed genes, dna microarray technology, dna microarrays, expression profiles

