Puneet Varma (Editor)

Computational and Statistical Genetics

Updated on
Edit
Like
Comment
Share on FacebookTweet on TwitterShare on LinkedInShare on Reddit

The interdisciplinary research field of Computational and Statistical Genetics uses the latest approaches in genomics, quantitative genetics, computational sciences, bioinformatics and statistics to develop and apply computationally efficient and statistically robust methods to sort through increasingly rich and massive genome wide data sets to identify complex genetic patterns, gene functionalities and interactions, disease and phenotype associations involving the genomes of various organisms. This field is also often referred to as computational genomics. This is an important discipline within the umbrella field computational biology.

Contents

Haplotype Phasing

During the last two decades, there has been a great interest in understanding the genetic and genomic makeup of various species, including humans primarily aided by the different genome sequencing technologies to read the genomes that has been rapidly developing. However, these technologies are still limited, and computational and statistical methods are a must to detect and process errors and put together the pieces of partial information from the sequencing and genotyping technologies.

A haplotype is defined the sequence of nucleotides (A,G,T,C) along a single chromosome. In humans, we have 23 pairs of chromosomes. Another example is maize which is also a diploid with 10 pairs of chromosomes. However, with current technology, it is difficult to separate the two chromosomes within a pair and the assays produce the combined haplotype, called the genotype information at each nucleotide. The objective of haplotype phasing is to find the phase of the two haplotypes given the combined genotype information. Knowledge of the haplotypes is extremely important and not only gives us a complete picture of an individuals genome, but also aids other computational genomic processes such as Imputation among many significant biological motivations.

For diploid organisms such as humans and maize, each organism has two copies of a chromosome - one each from the two parents. The two copies are highly similar to each other. A haplotype is the sequence of nucleotides in a chromosome. the haplotype phasing problem is focused on the nucleotides where the two homologous chromosomes differ. Computationally, for a genomic region with K differing nucleotide sites, there are 2^K - 1 possible haplotypes, so the phasing problem focuses on efficiently finding the most probable haplotypes given an observed genotype. For more information, see Haplotype.

Prediction of SNP genotypes by Imputation

Although the genome of a higher organism (eukaryotes) contains millions of single nucleotide polymorphisms (SNPs), genotyping arrays are pre- determined to detect only a handful of such markers. The missing markers are predicted using imputation analysis. Imputation of un-genotyped markers has now become an essential part of genetic and genomic studies. It utilizes the knowledge of linkage disequilibrium (LD) from haplotypes in a known reference panel (for example, HapMap and the 1000 Genomes Projects) to predict genotypes at the missing or un-genotyped markers. The process allows the scientists to accurately perform analysis of both the genotyped polymorphic markers and the un-genotyped markers that are predicted computationally. It has been shown that downstream studies benefit a lot from imputation analysis in the form of improved the power to detect disease-associated loci. Another crucial contribution of imputation is that it also facilitates combining genetic and genomic studies that used different genotyping platforms for their experiments. For example. although 415 million common and rare genetic variants exist in the human genome,the current genotyping arrays such as Affymetrix and Illumina microarrays can only assay up to 2.5 million SNPs. Therefore, imputation analysis is an important research direction and it is important to identify methods and platforms to impute high quality genotype data using existing genotypes and reference panels from publicly available resources, such as the International HapMap Project and the 1000 Genomes Project. For humans, the analysis has successfully generated predicted genotypes in many races including Europeans and African Americans. For other species such as plants, imputation analysis is an ongoing process using reference panels such as in maize.

A number of different methods exist for genotype imputation. The three most widely used imputation methods are - Mach, Impute and Beagle. All three methods utilize hidden markov models as the underlying basis for estimating the distribution of the haplotype frequencies. Mach and Impute2 are more computationally intensive compared with Beagle. Both Impute and Mach are based on different implementations of the product of the conditionals or PAC model. Beagle groups the reference panel haplotypes into clusters at each SNP to form localized haplotype-cluster model that allows it to dynamically vary the number of clusters at each SNP making it computationally faster than Mach and Impute2.

For more information, see Imputation (genetics).

Genome-wide Association Analysis

Over the past few years, genome-wide association studies (GWAS) have become a powerful tool for investigating the genetic basis of common diseases and has improved our understanding of the genetic basis of many complex traits. Traditional single SNP (single-nucleotide polymorphism) GWAS is the most commonly used method to find trait associated DNA sequence variants - associations between variants and one or more phenotypes of interest are investigated by studying individuals with different phenotypes and examining their genotypes at the position of each SNP individually. The SNPs for which one variant is statistically more common in individuals belonging to one phenotypic group are then reported as being associated with the phenotype. However, most complex common diseases involve small population-level contributions from multiple genomic loci. To detect such small effects as genome-wide significant, traditional GWAS rely on increased sample size e.g. to detect an effect which accounts for 0.1% of total variance, traditional GWAS needs to sample almost 30,000 individuals. Although the development of high throughput SNP genotyping technologies has lowered the cost and improved the efficiency of genotyping. Performing such a large scale study still costs considerable money and time. Recently, association analysis methods utilizing gene-based tests have been proposed that are based on the fact that variations in protein-coding and adjacent regulatory regions are more likely to have functional relevance. These methods have the advantage that they can account for multiple independent functional variants within a gene, with the potential to greatly increase the power to identify disease/trait associated genes. Also, imputation of ungenotyped markers using known reference panels(e.g. HapMap and the 1000 Genomes Project) predicts genotypes at the missing or untyped markers thereby allowing one to accurately evaluate the evidence for association at genetic markers that are not directly genotyped (in addition to the typed markers) and has been shown to improve the power of GWAS to detect disease associated loci.

For more information, see Genome-wide association study

In this era of large amount of genetic and genomic data, accurate representation and identification of statistical interactions in biological/genetic/genomic data constitutes a vital basis for designing interventions and curative solutions for many complex diseases. Variations in human genome have been long known to make us susceptible to many diseases. We are hurtling towards the era of personal genomics and personalized medicine that require accurate predictions of disease risk posed by predisposing genetic factors. Computational and statistical methods for identifying these genetic variations, and building these into intelligent models for diseaseassociation and interaction analysis studies genome-wide are a dire necessity across many disease areas. The principal challenges are: (1) most complex diseases involve small or weak contributions from multiple genetic factors that explain only a minuscule fraction of the population variation attributed to genetic factors. (2) Biological data is inherently extremely noisy, so the underlying complexities of biological systems (such as linkage disequilibrium and genetic heterogeneity) need to be incorporated into the statistical models for disease association studies. The chances of developing many common diseases such as cancer, autoimmune diseases and cardiovascular diseases involves complex interactions between multiple genes and several endogenous and exogenous environmental agents or covariates. Many previous disease association studies could not produce significant results because of the lack of incorporation of statistical interactions in their mathematical models explaining the disease outcome. Consequently much of the genetic risks underlying several diseases and disorders remain unknown. Computational methods such as to model and identify the genetic/genomic variations underlying disease risks has a great potential to improve prediction of disease outcomes, understand the interactions and design better therapeutic methods based on them.

References

Computational and Statistical Genetics Wikipedia