RANGANATHAN LAB | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

HOME RESEARCH
PEOPLE SCA
STRUCTURES PAPERS
CONTACT METHODS INTERNAL |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

## Note 109: A summary of SCA calculations
(Dated:
January
21,
2012)
This document provides a summary of the current implementation of the statistical coupling analysis (SCA) method. The conceptual goal of SCA is to quantitatively parameterize the statistical patterns encoded in ensembles of proteins that share a common evolutionary origin. The idea is that this statistical analysis will provide a necessary foundation for understanding the physical mechanism and evolutionary origins of natural proteins. ## I. Preliminaries - Multiple Sequence Alignment and FrequenciesA multiple sequence alignment of M sequences of length L is represented by a binary array x
As described below in section II, the conservation of each position in the multiple sequence alignment is measured by the
divergence of the observed frequency f Some calculations also require introducing a background probability for gaps. If γ represents the fraction of gaps in the
alignment, a background probability distribution can be taken as
## II. Position-specific conservation - first order statisticsThe conservation of amino acid a at position i, considered independently of other positions, is measured by the statistical
quantity D
When M is large (the relevant limit for SCA), the Stirling formula leads to the approximation
The value of D What is an appropriate number of sequences to carry out SCA? A more precise relation between the probability P
The values of D Equation 4 gives the conservation of each amino acid a at each position i. An overall positional conservation D
where D
## III. Correlated conservation - second order statistics
## A. General PrinciplesGiven an alignment x
where f
where ϕ
## B. Choice of weightsEquation (9) gives the general definition of the SCA correlation tensor, but what specific form should the weighting function
ϕ take? The approach taken in SCA (versions 3.0 and greater) is to consider the effect on the conservation of each position i
upon removing each sequence s. The idea is that this ”perturbation” will provide an estimate of the significance of
each amino acid at each position in the alignment by its impact on the measure of conservation used (here,
the relative entropy D). To develop this formally, let M
where we remind that x
where
The absolute value of the gradient is taken to ensure positive
weights
It is important to understand the nature of the function ϕ in controlling the patterns of correlations emerging
from the weighted alignment In general, the problem of applying more complex conservation functions and associated weights is deeply connected with fundamentally understanding the nature of the evolutionary process that generates the observed amino acid distributions and the nature of our sampling of these distributions in our databases of available sequences. These are important future research goals, but in the absence of such deeper understanding the current approach in SCA of using the Kullback-Leibler entropy as a measure of conservation and gradients of this entropy function as weights represents a simple and analytically well-defined implementation of the general principles of this method. Distinct from the principle that the relevance of correlations should scale with conservation, the weighting function also
contributes to separating signal (the true evolutionary constraints) from correlation noise due to finite and
biased sampling in practical sequence alignments. For example, one way that weakly conserved amino acids are
expected to show strong correlations in C ## C. Reduction to positional correlationsThe first step in analysis of the SCA correlation tensor is to reduce the four-dimensional array of L positions × L positions × 20 amino acids × 20 amino acids to a L × L matrix of positional correlations. Indeed, the goal in SCA is to identify collectively evolving groups of positions (”protein sectors”) whose correlations are properties of the whole family (or of functional subfamilies, see below) regardless of the amino acids by which their correlation is identified. How should we carry out the dimension reduction of the SCA correlation tensor to a matrix of positional correlations? One
empirical property of the tensor provides a straightforward approach: analysis of the 20 × 20 amino acid correlation
matrices for fixed pairs of positions (i,j) shows that these matrices have approximately rank one (O. Rivoire and
R. Ranganathan, in preparation). That is, the information content in these matrices can be captured in a
single scalar value. To explain this, we carry out the so-called singular value decomposition of
In this decomposition, each 20 × 20 amino acid correlation matrix for each (i,j) is written as a sum of products of three
20 × 20 matrices: λ, a diagonal matrix of singular values that indicate the quantity of variance in
Thus, a matrix of positional correlations
This is one definition of the SCA positional correlation matrix, and is essentially identical to that computed in versions of the
SCA Toolbox from 2.0 to 4.5 (for slight technical differences from these earlier versions see section C). Below, we will show
another empirical property of the SCA correlation tensor that will permit definition of a Two further notes with regard to this dimension reduction step: (1) Since
## D. The alignment projection approach to SCAIn section C above, we indicated how for each pair of positions (i,j), we can reduce the 20 × 20 matrix of amino acid
correlations by singular value decomposition to just the top singular value. The singular vectors corresponding to the top
singular value (P
In practice, the projection matrix can be obtained directly from the weighted frequencies of amino acids at positions in the alignment. This makes sense; the finding that the top singular vectors of amino acid correlations for position i are independent of j implies that the average singular vector should be just a property of the amino acid distribution at site i. The projection matrix can be written as:
Three quantities can be computed directly from the now projected alignment matrix (1) a SCA positional correlation matrix (written as
(2) a SCA sequence correlation matrix (
It is important to note that this sequence correlation matrix represents relationships between sequences where positions are weighted by the conservation-based weighting function ϕ; thus this mapping of sequence space is more likely to reveal functional distinctions between sequence subfamilies rather than just historical relationships. (3) a matrix (Π) that provides a mapping between these two spaces. To explain the Π matrix, we note that the projected
alignment
That is, if we detect collectively evolving groups of amino acid positions (sectors) by analysis of the
## E. Spectral decompositionThe process of identifying sectors from the
where is a diagonal matrix of so-called eigenvalues and Ṽ is a matrix whose columns contain the associated eigenvectors.
The eigenvectors contain the weights for linearly combining amino acid positions into new variables (”eigenmodes”) that are
now de-correlated, and the associated eigenvalues indicate the magnitude of the information in The essence of sector identification is to study the pattern of positional contributions to the statistically
significant top eigenmodes of the However, different eigenvectors are not expected to directly represent statistically independent sectors. Instead, if independent sectors exist for a particular protein family, they will generally correspond to groups of positions emerging along combinations of eigenvectors. The reason is due to the fact that decorrelation of positions (by diagonalizing the SCA correlation matrix - the essence of eigenvalue decomposition) is a weaker criterion than achieving statistical independence (which requires absence of not only pairwise correlations, but lack of any higher order statistical couplings). In other words, if the non-independence of a set of variables is not completely captured in just their pairwise correlations, then just the linear combination of these variables indicated by eigenvectors of the correlation matrix cannot be assumed to produce statistically independent transformed variables (xx).
## F. Independent component analysis - ICAIndependent component analysis (ICA) - an extension of spectral decomposition - is a heuristic method designed to transform
the k statistically significant top eigenmodes of a correlation matrix into k maximally independent components
through an iterative optimization process. In this process, the k top eigenvectors of
We call this process ”posICA” to indicate that ICA is carried out on the eigenvectors of the SCA positional correlation
matrix. In principle, this linear transformation of eigenvectors should help to better define independent sectors (if such exist in
the protein family under study) as groups of positions now projecting along the transformed axes - the independent
components (ICs) of position space (Ṽ It is also possible to apply ICA to the top eigenvectors of the SCA sequence correlation matrix
We call this process ”seqICA” to indicate that ICA is carried out on the eigenvectors of the SCA sequence correlation matrix.
This transformation should provide a better description of sequence subfamilies as groups of sequences emerging largely
orthogonally along the independent components of sequence space (Ũ Upon ICA rotation, sector positions should correspond to positions i contributing significantly to one of the
independent components. For example, ⟨i|Ṽ
## G. Mapping between sequence and positional correlationsAs described in Eq. (21), the Π matrix provides a mapping between the space of positional correlations in
If independent sectors are identified in Ṽ We can also make the inverse mapping in which we use the Π matrix following seqICA to make a mapping from the
sequence independent components |Ũ
If analysis of Ũ Note that we describe the usage of the Π matrix to map between the independent components of the positional (
## IV. AppendixThis section mostly provides information about relationships with previous descriptions of the SCA approach. The original implementation of SCA defined conserved correlations through a specific type of perturbation analysis on the sequence alignment (MATLAB SCA Toolbox 1.5, Sec. B), and this and more recent implementations described dimension reduction of the SCA correlation tensor through a formally different (though practically near-identical) type of matrix norm. Here, we explain these technical differences, and provide a more detailed explanation of the calculation of the projection matrix (equation 18).
## A. Equivalence with previous definitions of conservationD
The pre-factor -1∕M scales the positional conservation parameter for alignments of different size, and represents the
statistical unit of conservation symbolically indicated by kT
## B. The original SCA methodThe implementation of the SCA method introduced originally in Ref. (4) was based on a perturbation to the amino acid
distribution at one test site i to measure the difference in position-specific conservation of each amino acid at a second site j.
In general, the perturbation consisted of restricting the test site to the most prevalent amino acid a
where f
with C
This leads to
which shows that the perturbation procedure also represents a weighted procedure for correlations that is fully consistent with the general principle of SCA outlined in equation 9.
## C. Dimensional reductionIn previous implementations of the SCA method, a reduced matrix
This is known as the Frobenius norm of the 20 × 20 matrix
## D. Binary approximationIn Ref. (2), we make use of a so-called ”binary approximation” of the full alignment in which we consider only the most
frequent amino acid a The L × M binary array x
where C
## E. Cleaning of the first modeIn Ref. (2), we ”cleaned” the first mode of Implementing this principle requires estimating C
This procedure admits an equivalent spectral formulation when the first eigenvalue of
In other words, subtracting Ĉ This simple procedure however rests on the assumption that the sequences in the alignment are subject to essentially the same functional constraints, which does not represent the general case. The procedure presented in SCAv5.0 (Sec. III), which does not involve cleaning the first mode, is a more general solution for alignments with strong heterogeneities in the distribution of sequences. ## F. Projection matrixIn section D we described the empirical finding that for each position i, the top singular vector of amino acid correlations
with all other positions j is essentially independent of j. This led to the key concept of a projection matrix
## G. Implementation of ICAAs introduced above, ICA involves rotation of the k top eigenvectors of a correlation matrix by a k ×k so-called ”unmixing”
matrix W to to yield k maximally independent components. How do we technically obtain the W matrices? Various
implementations of ICA can be used that apply different measures of independence and different algorithms for
optimizing them. We use one of the simplest implementations of ICA, proposed in Ref. (9) with modifications
introduced in Ref. (10) (the results should however be robustly recovered when using other algorithms for
ICA). For posICA the input of the algorithm is the k × L matrix Z whose rows correspond to |Ṽ
The parameter ϵ is a learning rate that has to be sufficiently small for the iterations to converge. The iterations lead to W
## V. SCA Toolboxes
## A. DistributionsDistributions are MATLAB Toolboxes and contain various accessory codes for data formatting, display, and analysis. Previous versions include: (1) SCA Toolbox 1.5: The original SCA method as specified in Lockless and Ranganathan (4) with one modification that was used in all subsequent papers: the division of binomial probabilities by the mean probability of amino acids in the alignment is removed. This version is longer in active use. (2) SCA Toolbox 2.5: The bootstrap-based approach for SCA. Position-specific conservation calculated as in Eq. (4) and correlations calculated as in Eq. (9). Matrix reduction per Eq. (32). (3) SCA Toolbox 3.0: The analytical calculation of correlations weighted by gradients of relative entropy. Position-specific conservation calculated as in Eq. (4) and correlations calculated as in Eq. (9)-(33). For non-binarized alignments, matrix reduction is per Eq. (32). (4) SCA Toolbox 4.0: Analytical calculations as in v3.0, but now including sector identification methods as described in Ref. (2). (5) SCA Toolbox 5.0: Calculation of positional and sequence correlations matrices by the alignment projection method as per Eq. (19) and Eq. (20), and calculation of the mapping between them (Eq. (21). Includes methods for sector identification and exploring relationships between positional and sequence correlations.
## References
[1] T. M. Cover and J. A. Thomas. Elements of information theory. Wiley-Interscience, New-York, 1991. [2] N. Halabi, O. Rivoire, S. Leibler, and R. Ranganathan. Protein sectors: evolutionary units of three-dimensional structure. Cell, 138(4):774–86, 2009. [3] R. G. Smock, O. Rivoire, W. P. Russ, J. F. Swain, S. Leibler, R. Ranganathan, and L. M. Gierasch. An interdomain sector mediating allostery in hsp70 molecular chaperones. Mol. Sys. Bio., 6:414, 2010. [4] S. W. Lockless and R. Ranganathan. Evolutionarily conserved pathways of energetic connectivity in protein families. Science, 286(5438):295–9, 1999. [5] G. M. Süel, S. W. Lockless, M. A. Wall, and R. Ranganathan. Evolutionarily conserved networks of residues mediate allosteric communication in proteins. Nat Struct Biol, 10(1):59–69, 2003. [6] M. E. Hatley, S. W. Lockless, S. K. Gibson, A. G. Gilman, and R. Ranganathan. Allosteric determinants in guanine nucleotide-binding proteins. Proc Natl Acad Sci USA, 100(24):14445–50, 2003. [7] A. I. Shulman, C. Larson, D. J. Mangelsdorf, and R. Ranganathan. Structural determinants of allosteric ligand activation in rxr heterodimers. Cell, 116(3):417–29, 2004. [8] M. Socolich, S. W. Lockless, W. P. Russ, H. Lee, K. H. Gardner, and R. Ranganathan. Evolutionary information for specifying a protein fold. Nature, 437(7058):512–8, 2005. [9] A. J. Bell and T. J. Sejnowski. An information-maximization approach to blind source separation and blind deconvolution. Neural Computation, 7:1129–1159, 1995. [10] S. Amari, A. Cichocki, and H. H. Yang. A new learning algorithm for blind signal separation. In D. Touretzky, M. Mozer, and M. Hasselmo, editors, Advances in neural information processing systems, volume 8, pages 757–763, Cambridge MA, 1996. MIT Press. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||