Samiksha Jaiswal (Editor)

CpG site

Updated on
Edit
Like
Comment
Share on FacebookTweet on TwitterShare on LinkedInShare on Reddit
CpG site

The CpG sites or CG sites are regions of DNA where a cytosine nucleotide is followed by a guanine nucleotide in the linear sequence of bases along its 5' → 3' direction. CpG is shorthand for 5'—C—phosphate—G—3' , that is, cytosine and guanine separated by only one phosphate; phosphate links any two nucleosides together in DNA. The CpG notation is used to distinguish this single-stranded linear sequence from the CG base-pairing of cytosine and guanine for double-stranded sequences. The CpG notation is therefore to be interpreted as the cytosine being 5 prime to the guanine base. CpG should not be confused with GpC, the latter meaning that a guanine is followed by a cytosine in the 5' → 3' direction of a single-stranded sequence.

Contents

Cytosines in CpG dinucleotides can be methylated to form 5-methylcytosine. In mammals, methylating the cytosine within a gene can change its expression, a mechanism that is part of a larger field of science studying gene regulation that is called epigenetics. Enzymes that add a methyl group are called DNA methyltransferases.

In mammals, 70% to 80% of CpG cytosines are methylated.

Unmethylated CpG dinucleotide sites can be detected by Toll-like receptor 9 (TLR 9) on plasmacytoid dendritic cells, monocytes, natural killer (NK) cells, and B cells in humans. This is used to detect intracellular Viral infection.

CpG dinucleotides have long been observed to occur with a much lower frequency in the sequence of vertebrate genomes than would be expected due to random chance. For example, in the human genome, which has a 42% GC content, a pair of nucleotides consisting of cytosine followed by guanine would be expected to occur 0.21 * 0.21 = 4.41% of the time. The frequency of CpG dinucleotides in human genomes is 1% — less than one-quarter of the expected frequency. Scarano et al. proposed that the CpG deficiency is due to an increased vulnerability of methylcytosines to spontaneously deaminate to thymine in genomes with CpG cytosine methylation. The total number of CpG sites in humans is 28 million

CpG islands

CpG islands (or CG islands) are regions with a high frequency of CpG sites, though objective definitions for CpG islands are limited. The usual formal definition of a CpG island is a region with at least 200 bp, and a GC percentage that is greater than 50%, and with an observed-to-expected CpG ratio that is greater than 60 %. The "observed-to-expected CpG ratio" can be derived where the observed is calculated as:

( number of  C p G s )

and the expected as:

( number of  C number of  G ) / length of sequence

or

( ( number of  C + number of  G ) / 2 ) 2 / length of sequence

Many genes in mammalian genomes have CpG islands associated with the start of the gene (promoter regions). Because of this, the presence of a CpG island is used to help in the prediction and annotation of genes.

In mammalian genomes, CpG islands are typically 300-3,000 base pairs in length, and have been found in or near approximately 40% of promoters of mammalian genes. About 70% of human promoters have a high CpG content. Given the frequency of GC two-nucleotide sequences, the number of CpG dinucleotides is much lower than would be expected.

A 2002 study revised the rules of CpG island prediction to exclude other GC-rich genomic sequences such as Alu repeats. Based on an extensive search on the complete sequences of human chromosomes 21 and 22, DNA regions greater than 500 bp were found more likely to be the "true" CpG islands associated with the 5' regions of genes if they had a GC content greater than 55%, and an observed-to-expected CpG ratio of 65%.

CpG islands are characterized by CpG dinucleotide content of at least 60% of that which would be statistically expected (~4–6%), whereas the rest of the genome has much lower CpG frequency (~1%), a phenomenon called CG suppression. Unlike CpG sites in the coding region of a gene, in most instances the CpG sites in the CpG islands of promoters are unmethylated if the genes are expressed. This observation led to the speculation that methylation of CpG sites in the promoter of a gene may inhibit gene expression. Methylation, along with histone modification, is central to imprinting. Most of the methylation differences between tissues, or between normal and cancer samples, occur a short distance from the CpG islands (at "CpG island shores") rather than in the islands themselves.

CpG islands typically occur at or near the transcription start site of genes, particularly housekeeping genes, in vertebrates. A C (cytosine) base followed immediately by a G (guanine) base (a CpG) is rare in vertebrate DNA because the cytosines in such an arrangement tend to be methylated. This methylation helps distinguish the newly synthesized DNA strand from the parent strand, which aids in the final stages of DNA proofreading after duplication. However, over time methylated cytosines tend to turn into thymines because of spontaneous deamination. There is a special enzyme in humans (Thymine-DNA glycosylase, or TDG) that specifically replaces T's from T/G mismatches. However, due to the rarity of CpGs, it is theorised to be insufficiently effective in preventing a possibly rapid mutation of the dinucleotides. The existence of CpG islands is usually explained by the existence of selective forces for relatively high CpG content, or low levels of methylation in that genomic area, perhaps having to do with the regulation of gene expression. Recently a study showed that most CpG islands are a result of non-selective forces.

CpG islands in promoters

In humans, about 70% of promoters located near the transcription start site of a gene (proximal promoters) contain a CpG island.

Promoters located at a distance from the transcription start site of a gene also frequently contain CpG islands. An example is the promoter of the DNA repair gene ERCC1, where the CpG island-containing promoter is located about 5,400 nucleotides upstream of the coding region of the ERCC1 gene. CpG islands also occur frequently in promoters for functional noncoding RNAs such as microRNAs.

Methylation of CpG islands stably silences genes

In humans, DNA methylation occurs at the 5' position of the pyrimidine ring of the cytosine residues within CpG sites to form 5-methylcytosines. The presence of multiple methylated CpG sites in CpG islands of promoters causes stable silencing of genes. Silencing of a gene may be initiated by other mechanisms, but this is often followed by methylation of CpG sites in the promoter CpG island to cause the stable silencing of the gene.

Promoter CpG hyper/hypo-methylation in cancer

In cancers, loss of expression of genes occurs about 10 times more frequently by hypermethylation of promoter CpG islands than by mutations. As Vogelstein et al. point out, in a colorectal cancer there are usually about 3 to 6 driver mutations and 33 to 66 hitchhiker or passenger mutations. In contrast, in one study of colon tumors compared to adjacent normal-appearing colonic mucosa, 1,734 CpG islands were heavily methylated in tumors whereas these CpG islands were not methylated in the adjacent mucosa. Half of the CpG islands were in promoters of annotated protein coding genes, suggesting that about 867 genes in a colon tumor have lost expression due to CpG island methylation. A separate study found an average of 1,549 differentially methylated regions (hypermethylated or hypomethylated) in the genomes of six colon cancers (compared to adjacent mucosa), of which 629 were in known promoter regions of genes. A third study found more than 2,000 genes differentially methylated between colon cancers and adjacent mucosa. Using gene set enrichment analysis, 569 out of 938 gene sets were hypermethylated and 369 were hypomethylated in cancers. Hypomethylation of CpG islands in promoters results in overexpression of the genes or gene sets affected.

One 2012 study listed 147 specific genes with colon cancer-associated hypermethylated promoters, along with the frequency with which these hypermethylations were found in colon cancers. At least 10 of those genes had hypermethylated promoters in nearly 100% of colon cancers. They also indicated 11 microRNAs whose promoters were hypermethylated in colon cancers at frequencies between 50% and 100% of cancers. MicroRNAs (miRNAs) are small endogenous RNAs that pair with sequences in messenger RNAs to direct post-transcriptional repression. On averge, each microRNA represses several hundred target genes. Thus microRNAs with hypermethylated promoters may be allowing over-expression of hundreds to thousands of genes in a cancer.

The information above shows that, in cancers, promoter CpG hyper/hypo-methylation of genes and of microRNAs causes loss of expression (or sometimes increased expression) of far more genes than does mutation.

DNA repair genes with hyper/hypo-methylated promoters in cancers

DNA repair genes are frequently repressed in cancers due to hypermethylation of CpG islands within their promoters. In head and neck squamous cell carcinomas at least 15 DNA repair genes have frequently hypermethylated promoters; these genes are XRCC1, MLH3, PMS1, RAD51B, XRCC3, RAD54B, BRCA1, SHFM1, GEN1, FANCE, FAAP20, SPRTN, SETMAR, HUS1, and PER1. About seventeen types of cancer are frequently deficient in one or more DNA repair genes due to hypermethylation of their promoters. As an example, promoter hypermethylation of the DNA repair gene MGMT occurs in 93% of bladder cancers, 88% of stomach cancers, 74% of thyroid cancers, 40%-90% of colorectal cancers and 50% of brain cancers. Promoter hypermethylation of LIG4 occurs in 82% of colorectal cancers. Promoter hypermethylation of NEIL1 occurs in 62% of head and neck cancers and in 42% of non-small-cell lung cancers. Promoter hypermetylation of ATM occurs in 47% of non-small-cell lung cancers. Promoter hypermethylation of MLH1 occurs in 48% of non-small-cell lung cancer squamous cell carcinomas. Promoter hypermethylation of FANCB occurs in 46% of head and neck cancers.

On the other hand, the promoters of two genes, PARP1 and FEN1, were hypomethylated and these genes were over-expressed in numerous cancers. PARP1 and FEN1 are essential genes in the error-prone and mutagenic DNA repair pathway microhomology-mediated end joining. If this pathway is over-expressed the excess mutations it causes can lead to cancer. PARP1 is over-expressed in tyrosine kinase-activated leukemias, in neuroblastoma, in testicular and other germ cell tumors, and in Ewing’s sarcoma, FEN1 is over-expressed in the majority of cancers of the breast, prostate, stomach, neuroblastomas, pancreatic, and lung.

DNA damage appears to be the primary underlying cause of cancer. If accurate DNA repair is deficient, DNA damages tend to accumulate. Such excess DNA damage can increase mutational errors during DNA replication due to error-prone translesion synthesis. Excess DNA damage can also increase epigenetic alterations due to errors during DNA repair. Such mutations and epigenetic alterations can give rise to cancer (see malignant neoplasms). Thus, CpG island hyper/hypo-methylation in the promoters of DNA repair genes are likely central to progression to cancer.

Since age has a strong effect on DNA methylation levels on tens of thousands of CpG sites, one can define a highly accurate biological clock (referred to as epigenetic clock or DNA methylation age) in humans and chimpanzees.

References

CpG site Wikipedia