Announced in 2008, shortly after the human 1000 Genomes Project, the 1000 Plant Genomes Project is another, similar highly large-scale genomics endeavour to take advantage of the speed and efficiency of next-generation DNA sequencing. Headed by Dr. Gane Ka-Shu Wong and Dr. Michael Deyholos of the University of Alberta, the project aims to obtain the transcriptome (expressed genes) of 1000 different plant species over the next few years.
- Goals of the Project
- Evolutionary Relationships
- Biotechnology applications
- Project Approach
- Species selection
- Transcriptome vs genome sequencing
- Transcriptome shotgun sequencing
- Plant tissue sampling
- Potential limitations
- Related projects
In light of recent advances in DNA sequencing technologies that have dramatically reduced the cost and time needed to sequence an organism's entire genome, large-scale (involving many organisms) sequencing projects have been and are currently being undertaken. The recently started 1000 genomes project for example, aims to obtain high genome coverage of 1000 individual people to better understand human genetic variation because genomic sequence is the best way to assess this.
Goals of the Project
As of 2002, the number of classified green plant species is around 370,000. However, there are probably many thousands more yet unclassified. Despite this number, very few of these species have detailed DNA sequence information to date; 125,426 species in GenBank, as of 11 April 2012, but most (>95%) have DNA sequence for only one or two genes. "…almost none of the roughly half million plant species known to humanity has been touched by genomics at any level". The 1000 Plant Genomes Project will produce a roughly a 100x increase in the number of species with available broad genome sequence.
There have been efforts to determine the evolutionary relationships between the known plant species, but phylogenies (or phylogenetic trees) created solely using morphological data, cellular structures, single enzymes, or on only a few sequences (like rRNA) can be prone to error; morphological features are especially vulnerable when two species look physically similar though they are not closely related (as a result of convergent evolution for example) or homology, or when two species closely related look very different because, for example, they are able to change in response to their environment very well. These situations are very common in the plant kingdom. An alternative method for constructing evolutionary relationships is through changes in DNA sequence of many genes between the different species which is often more robust to problems of similar-appearing species. With the amount of genomic sequence produced by this project, many predicted evolutionary relationships can be better tested by sequence alignment (figure 1) to improve their certainty.
The list of plant genomes to be sequenced in the project is not random; instead plants that produce valuable chemicals or other products (secondary metabolites in many cases) will be focused on in the hopes that characterizing the involved genes will allow the underlying biosynthetic processes to be used or modified. For example, there are many plants known to produce oils (like olives) and some of the oils from certain plants bear a strong chemical resemblance to petroleum products like the Oil palm and hydrocarbon-producing species. If these plant mechanisms could be used to produce mass quantities of industrially useful oil, or modified such that they do, then they would be of great value. Here, knowing the sequence of the plant’s genes involved in the metabolic pathway producing the oil is a large first step to allow such utilization. A recent example of how engineering natural biochemical pathways works is Golden rice which has involved genetically modifying its pathway, so that a precursor to vitamin A is produced in large quantities making the brown-colored rice a potential solution for vitamin A deficiency. This is concept of engineering plants to do "work" is popular and its potential would dramatically increase as a result of gene information on 1000 plant species. Biosynthetic pathways could also be used for mass production of medicinal compounds using plants rather than manual organic chemical reactions as most are created currently.
Using the 28 Illumina Genome Analyzer next-generation DNA sequencing machines at the Beijing Genomics Institute (BGI – Shenzhen, China), the 3Gb/run (3 billion base pairs per experiment) capacity of each of these machines will enable fast and accurate sequencing of the plant samples.
The selection of plant species to be sequenced has nearly been compiled through an international collaboration of the various funding agencies and researcher groups expressing their interest in certain plants. There has been a focus on those plant species that are known to have useful biosynthetic capacity to facilitate the biotechnology goals of the project, and selection of other species to fill in gaps and explain some unknown evolutionary relationships of the current plant phylogeny. In addition to industrial compound biosynthetic capacity, plant species known or suspected to produce medically active chemicals (such as poppies producing opiates) were assigned a high priority to better understand the synthesis process, explore commercial production potential, and discover new pharmaceutical options. A large number of plant species with medicinal properties have been selected from traditional Chinese medicine (TCM). The largely completed list of selected species can be publicly viewed at [www.onekp.com/samples/list.php].
Transcriptome vs. genome sequencing
Rather than sequencing the entire genome (all DNA sequence) of the various plant species, the project will sequence only those regions of the genome that produce a protein product (coding genes); the transcriptome. This approach is justified by the focus on biochemical pathways where only the genes producing the involved proteins are required to understand the synthetic mechanism, and because these thousands of sequences would represent adequate sequence detail to construct very robust evolutionary relationships through sequence comparison. The numbers of coding genes in plant species can vary considerably, but all have tens of thousands or more making the transcriptome a large collection of information. However, non-coding sequence makes up the majority (>90%) of the genome content. Although this approach is similar conceptually to expressed sequence tags (ESTs), it is fundamentally different in that the entire sequence of each gene will be acquired with high coverage rather than just a small portion of the gene sequence with an EST. To distinguish the two, the non-EST method is known as “shotgun transcriptome sequencing”.
Transcriptome shotgun sequencing
mRNA (messenger RNA) is collected from a sample, converted to cDNA by a reverse transcriptase enzyme, and then fragmented so that it can be sequenced. Other than transcriptome shotgun sequencing, this technique has been called RNA-seq and whole transcriptome shotgun sequencing (WTSS). Once the cDNA fragments are sequenced, they will be de novo assembled (without aligning to a reference genome sequence) back into the complete gene sequence by combining all of the fragments from that gene during the data analysis phase.
Plant tissue sampling
The samples will come from around the world, with a number of particularly rare species being supplied by botanical gardens such as the Fairy Lake Garden (Shenzhen, China). The type of tissue collected will be determined by the expected location of biosynthetic activity; for example if an interesting process or chemical is known to exist primarily in the leaves, the sample will come from the leaves.
Since only the transcriptome is being sequenced, the project will not reveal information about gene regulatory sequence, non-coding RNAs, DNA repetitive elements, or other genomic features that are not part of the coding sequence. Based on the few whole plant genomes collected so far, these non-coding regions will in fact make up the majority of the genome, and the non-coding DNA may actually be the primary driver of trait differences seen between species.
Since mRNA is the starting material, the amount of sequence representation for a given gene will be based on the expression level (how many mRNA molecules it produces). This means that highly expressed genes get better coverage because there is more sequence to work from. The result, then, is that some important genes may not be reliably detected by the project if they are expressed at a low level yet still have important biochemical functions.
Many plant species (especially agriculturally manipulated ones) are known to have undergone large genome-wide changes through duplication of the whole genome. The rice and the wheat genomes, for example, can have 4-6 copies of whole genomes (wheat) whereas animals typically only have 2 (diploidy). These duplicated genes may pose a problem for the de novo assembly of sequence fragments, because repeat sequences confuse the computer programs when trying to put the fragments together, and they can be difficult to track through evolution.
Just as the Beijing Genomics Institute in Shenzhen, China is one of the major genomics centers involved in the 1000 Genomes Project, the institute is the site of sequencing for the 1000 Plant Genomes Project. Both projects are large-scale efforts to obtain detailed DNA sequence information to improve our understanding of the organisms, and both projects will utilize next-generation sequencing to facilitate a timely completion.hai
The goals of the two projects are significantly different. While the 1000 Genomes Project focuses on genetic variation in a single species, the 1000 Plant Genomes Project looks at the evolutionary relationships and genes of 1000 different plant species.
While the 1000 Genomes Project has been initially estimated to cost up to $50 million USD, the 1000 Plant Genomes Project will likely not be as expensive; the difference in cost comes from the target sequence in the genomes. Since the 1000 Plant Genomes Project will only be sequencing the transcriptome, whereas the human project will sequence as much of the genome as is decided feasible, there is a much lower amount of sequencing effort needed in this more specific approach. While this means that there will be less overall sequence output relative to the 1000 Genomes Project, the non-coding portions of the genomes excluded in the 1000 Plant Genomes Project are not important to its goals like they are to the human project. So then the more focused approach of the 1000 Plant Genomes Project minimizes cost while still achieving its goals.
The project is funded by Alberta Innovates - Technology Futures (merger of iCORE , the Alberta Agricultural Research Institute (AARI), Genome Alberta, the University of Alberta, the Beijing Genomics Institute (BGI), and Musea Ventures (a USA-based private investment firm). To date, the project has received $1.5 million CAD from the Alberta Government and another $0.5 million from Musea Ventures. An additional $2.5 million CAD will be contributed by the Alberta government over the next 3 years. In January 2010, BGI announced that it would be contributing $100 million to large-scale sequencing projects of plants and animals (including the 1000 Plant Genomes Project).