Rahul Sharma (Editor)

Cancer Genome Anatomy Project

Updated on
Edit
Like
Comment
Share on FacebookTweet on TwitterShare on LinkedInShare on Reddit

The Cancer Genome Anatomy Project (CGAP), created by the National Cancer Institute (NCI) in 1997 and introduced by Al Gore, is an online database on normal, pre-cancerous and cancerous genomes. It also provides tools for viewing and analysis of the data, allowing for identification of genes involved in various aspects of tumor progression. The goal of CGAP is to characterize cancer at a molecular level by providing a platform with readily accessible updated data and a set of tools such that researchers can easily relate their findings to existing knowledge. There is also a focus on development of software tools that improve the usage of large and complex datasets. The project is directed by Daniela S. Gerhard, and includes sub-projects or initiatives, with notable ones including the Cancer Chromosome Aberration Project (CCAP) and the Genetic Annotation Initiative (GAI). CGAP contributes to many databases and organisations such as the NCBI contribute to CGAP's databases.

Contents

The eventual outcomes of CGAP include establishing a correlation between a particular cancer's progression with its therapeutic outcome, improved evaluation of treatment and development of novel techniques for prevention, detection and treamtent. This is achieved by characterisation of biological tissue mRNA products.

Background

The fundamental cause of cancer is the inability for a cell to regulate its gene expression. To characterise a specific type of cancer, the proteins that are produced from the altered gene expression or the mRNA precursor to the protein can be examined. CGAP works to associate a particular cell's expression profile, molecular signature or transcriptome, which is essentially the cell's fingerprint, with the cell's phenotype. Therefore, expression profiles exist with consideration to cancer type and stage of progression.

Sequencing

CGAP's initial goal was to establish a Tumor Gene Index (TGI) to store the expression profiles. This would have contributions to both new and existing databases. This contributed to two types of libraries, the dbEST and later dbSAGE. This was performed in a series of steps:

  • Cell contents are washed over plates with poly T sequences. This will bind Poly-A tails that exist only on mRNA molecules, therefore selectively keeping mRNA.
  • The isolated mRNA is processed into a cDNA transcript through reverse transcription and DNA polymerisation reactions.
  • The resulting double stranded DNA is then incorporated into E.coli plasmids. Each bacterium now contains one unique cDNA and is replicated to produce clones with the same genetic information. This is termed a cDNA library.
  • The library can then sequenced by high-throughput sequencing techniques. This can characterise both the different genes expressed by the original cell and the amount of expression of each gene.
  • The TGI focused on prostate, breast, ovarian, lung and colon cancers at first, and CGAP extended to other cancers in its research. Practically, issues arose which CGAP accounted for as new technologies became available.

    Many cancers occur in tissues with multiple cell types. Traditional techniques took the whole tissue sample and produced bulk tissue cDNA libraries. This cellular heterogeneity made gene expression information in terms of cancer biology less accurate. An example is prostate cancer tissue where epithelial cells, which have been shown to be the only cell type give rise to cancer, only consist 10% of the cell count. This led to development of laser capture microdissection (LCM), a technique that can isolate individual cell types individual cells, which gave rise to cDNA libraries of specific cell types.

    The sequencing of cDNA will produce the entire mRNA transcript that generated it. Practically, only part of the sequence is required to uniquely identify the mRNA or protein associated. The resultant part of the sequence was termed the expressed sequence tag (EST) and is always at the end of the sequence close to the poly A tail. EST data are stored in a database called dbEST. ESTs only need to be around 400 bases long, but with NGS sequencing techniques this will still produce low quality reads. Therefore, an improved method called serial analysis of gene expression (SAGE) is also used. This method identifies, for each cDNA transcript molecule produced from a cell's gene expression, regions only 10-14 bases long anywhere along the read sequence, sufficient to uniquely identify that cDNA transcript. These bases are cut out and linked together, then incorporated into bacterial plasmids as mentioned above. SAGE libraries have better read quality and generate a larger amount of data when sequenced, and since transcripts are compared in absolute rather than relative levels, SAGE has the advantage of requiring no normalisation of data via comparison with a reference.

    Outcomes and Future

    CGAP is now a centralised location for several genomics tools and genetic databases and is employed widely in cancer and molecular biology research. The databases established by CGAP continues to contribute to knowledge of cancers in terms of their pathways and progression. The transcriptome databases can also be used in non-cancer related research, as they contain information that can be used to quickly and easily identify particular sequenced genes.The data also has clinical impact, as cDNAs can be used to create microarrays for diagnosis and treatment comparison purposes. CGAP has been used in many studies, with examples including:

  • Characterising differences in normal and cancerous endothelial cell gene expression
  • Identifying irregular gene expression as markers for glioblastomas and ovarian cancer
  • Identifying gene expression specific to prostate tissue
  • Comparison of proteins expressed in normal and cancerous reproductive tissue
  • In addition, the vast amount of data generated by CGAP has prompted for improvement of data analysis and mining techniques, with examples including:

  • Comparison of gene expression from multiple cDNA libraries
  • Improved techniques for mining EST libraries
  • Integral, large scale studies of human transcriptome analysis
  • References

    Cancer Genome Anatomy Project Wikipedia