Rahul Sharma (Editor)

Pseudo amino acid composition

Updated on
Edit
Like
Comment
Share on FacebookTweet on TwitterShare on LinkedInShare on Reddit
Pseudo amino acid composition

Pseudo amino acid composition, or PseAA composition was originally introduced by Kuo-Chen Chou in 2001 to represent protein samples for improving protein subcellular localization prediction and membrane protein type prediction.

Contents

Background

To predict the subcellular localization of proteins and other attributes based on their sequence, two kinds of models are generally used to represent protein samples: (1) the sequential model, and (2) the non-sequential model or discrete model.

The most typical sequential representation for a protein sample is its entire amino acid (AA) sequence, which can contain its most complete information. This is an obvious advantage of the sequential model. To get the desired results, the sequence-similarity-search-based tools are usually utilized to conduct the prediction. However, this kind of approach fails when a query protein does not have significant homology to the known protein(s). Thus, various discrete models were proposed which do not rely on sequence-order.

The simplest discrete model is using the amino acid composition (AAC) to represent protein samples, formulated as follows. Given a protein sequence P with L amino acid residues, i.e.,

  • P = [ R 1 R 2 R 3 R 4 R 5 R 6 R 7 R L ] (1)
  • where R1 represents the 1st residue of the protein P, R2 the 2nd residue, and so forth, according to the amino acid composition (AAC) model, the protein P of Eq.1 can be expressed by

  • P = [ f 1 f 2 f 20 ] T (2)
  • where f u ( u = 1 , 2 , , 20 ) are the normalized occurrence frequencies of the 20 native amino acids in P, and T the transposing operator. Accordingly, the amino acid composition of a protein can be easily derived once the protein sequencing information is known.

    Owing to its simplicity, the amino acid composition (AAC) model was widely used in many earlier statistical methods for predicting protein attributes. However, all the sequence-order information is lost. This is its main shortcoming.

    Concept

    To avoid completely losing the sequence-order information, the concept of PseAA (pseudo amino acid) composition was proposed. In contrast with the conventional amino acid composition that contains 20 components with each reflecting the occurrence frequency for one of the 20 native amino acids in a protein, the PseAA composition contains a set of greater than 20 discrete factors, where the first 20 represent the components of its conventional AA composition while the additional factors incorporate some sequence-order information via various modes.

    The additional factors are a series of rank-different correlation factors along a protein chain, but they can also be any combinations of other factors so long as they can reflect some sorts of sequence-order effects one way or the other. Therefore, the essence of PseAA composition is that on one hand it covers the AA composition, but on the other hand it contains the information beyond the AA composition and hence can better reflect the feature of a protein sequence through a discrete model.

    Meanwhile, various modes to formulate the PseAA composition have also been developed, as summarized in a review.

    Algorithm

    According to the PseAA composition model, the protein P of Eq.1 can be formulated as

  • P = [ p 1 , p 2 , , p 20 , p 20 + 1 , , p 20 + λ ] T , ( λ < L ) (3)
  • where the ( 20 + λ ) components are given by

  • p u = { f u i = 1 20 f i + w k = 1 λ τ k , ( 1 u 20 ) w τ u 20 i = 1 20 f i + w k = 1 λ τ k , ( 20 + 1 u 20 + λ ) (4)
  • where w is the weight factor, and τ k the k -th tier correlation factor that reflects the sequence order correlation between all the k -th most contiguous residues as formulated by

  • τ k = 1 L k i = 1 L k J i , i + k , ( k < L ) (5)
  • with

  • J i , i + k = 1 Γ q = 1 Γ [ Φ q ( R i + k ) Φ q ( R i ) ] 2 (6)
  • where Φ q ( R i ) is the q -th function of the amino acid R i , and Γ the total number of the functions considered. For example, in the original paper by Chou, Φ 1 ( R i ) , Φ 2 ( R i ) and Φ 3 ( R i ) are respectively the hydrophobicity value, hydrophilicity value, and side chain mass of amino acid R i ; while Φ 1 ( R i + 1 ) , Φ 2 ( R i + 1 ) and Φ 3 ( R i + 1 ) the corresponding values for the amino acid R i + 1 . Therefore, the total number of functions considered there is Γ = 3 . It can be seen from Eq.3 that the first 20 components, i.e. p 1 , p 2 , , p 20 are associated with the conventional AA composition of protein, while the remaining components p 20 + 1 , , p 20 + λ are the correlation factors that reflect the 1st tier, 2nd tier, …, and the λ -th tier sequence order correlation patterns (Figure 1). It is through these additional λ factors that some important sequence-order effects are incorporated.

    λ in Eq.3 is a parameter of integer and that choosing a different integer for λ will lead to a dimension-different PseAA composition.

    Using Eq.6 is just one of the modes for deriving the correlation factors or PseAA components. The others, such as the physicochemical distance mode and amphiphilic pattern mode, can also be used to derive different types of PseAA composition, as summarized in a review paper.

    Applications

    Since PseAA composition was introduced, it has been widely used to predict various attributes of proteins, such as structural classes of proteins, enzyme family classes and subfamily classes, GABA(A) receptor proteins, protein folding rates, cyclin proteins, supersecondary structure, subcellular location of proteins, subnuclear location of proteins, apoptosis protein subcellular localization, submitochondria localization, protein quaternary structure, bacterial secreted proteins, conotoxin superfamily and family classification, protease types, GPCR types, human papillomaviruses, outer membrane proteins, membrane protein types, protein secondary structural contents, metalloproteinase family subcellular localization of mycobacterial proteins, antibacterial peptides lipase types, allergenic proteins DNA-binding proteins, essential proteins, cell wall lytic enzymes, cofactors of oxidoreductases, among many other protein attributes and protein-related features (see, e.g., the review paper by Gonzalez-Diaz et al. as well as the relevant references cited therein).

    Ever since the concept of PseAA composition was introduced, it has been widely utilized to predict various protein attributes. It has also been used to incorporate the protein domain or FunD (functional domain) information and GO (gene ontology) information for improving the prediction quality for the subcellular localization of proteins. as well as their other attributes.

    Meanwhile, the concept of PseAA composition has also stimulated the generation of pseudo-folding topological indices and pseudo-folding lattice network.

    Recently, two open accessible tools were established to generate various modes of Chou’s pseudo amino acid composition.

    References

    Pseudo amino acid composition Wikipedia