Note 109: A summary of SCA calculations

Rama Ranganathan1 and Olivier Rivoire2

1The Green Center for Systems Biology, University of Texas Southwestern Medical Center, Dallas, Texas, USA.

2Laboratoire Interdisciplinaire de Physique, CNRS & Université Joseph Fourier, Grenoble, France.

(Dated: January 21, 2012)

This document provides a summary of the current implementation of the statistical coupling analysis (SCA) method. The conceptual goal of SCA is to quantitatively parameterize the statistical patterns encoded in ensembles of proteins that share a common evolutionary origin. The idea is that this statistical analysis will provide a necessary foundation for understanding the physical mechanism and evolutionary origins of natural proteins.

I. Preliminaries - Multiple Sequence Alignment and Frequencies

A multiple sequence alignment of M sequences of length L is represented by a binary array xi,s(a), where xi,s(a) = 1 if sequence s has amino acid a at position i, and 0 otherwise (s = 1,,M is for sequences, i = 1,,L is for positions and a = 1,,20 is for amino acids). The frequency fi(a) of an amino acid a at position i is the number of sequences in the alignment having amino acid a at position i (Mi(a)), divided by the total number of sequences, or equivalently, the average value of xi,s(a) over all sequences s:

 (a)   M-(ia)-    (a)
fi  =  M   = ⟨xi,s⟩s.
(1)

As described below in section II, the conservation of each position in the multiple sequence alignment is measured by the divergence of the observed frequency fi(a) of amino acid a at position i from the background probability q(a) of amino acid a. This background probability is computed from the mean frequency of amino acid a in all proteins in the non-redundant database. Specifically,

q =  (0.073,0.025,0.050,0.061,0.042,0.072,0.023,0.053,0.064,0.089,
     0.023,0.043,0.052,0.040,0.052,0.073,0.056,0.063,0.013,0.033),
where amino acids are ordered according to the alphabetic order of their standard one-letter abbreviation.

Some calculations also require introducing a background probability for gaps. If γ represents the fraction of gaps in the alignment, a background probability distribution can be taken as q(0) = γ for gaps, and q(a) = (1 - γ)q(a) for the 20 amino acids. A practical strategy is to truncate alignments to sequence positions with a frequency of gaps fi(0) no greater than 0.2; this prevents trivial over-representation of gaps in a sequence alignment and ensures calculations are made only at largely non-gapped sequence positions. Alternatively, one can truncate the alignment to the positions present in an atomic structure of a representative member of the protein family.

 

II. Position-specific conservation - first order statistics

The conservation of amino acid a at position i, considered independently of other positions, is measured by the statistical quantity Di(a), the so-called Kullback-Leibler relative entropy (1) of fi(a) given q(a). This definition is derived from the probability PM[fi(a)] of observing fi(a) in an alignment of M sequences given a background probability q(a):

     (a)   ---------M-!-------  (a) Mf(ia)    (a)M (1-f(ia))
PM [fi ] = (M f(a))!(M (1 - f(a))!(q )    (1- q  )        .
              i           i
(2)

When M is large (the relevant limit for SCA), the Stirling formula leads to the approximation

    (a)   - MDi(a)
PM [fi ] ≃ e     ,  with
(3)

 

             (a)                 (a)
D(a)= f(a)ln fi--+ (1- f (a))ln 1--fi--.
 i     i    q(a)       i     1- q(a)
(4)

The value of Di(a) indicates how unlikely the observed frequency of amino acid a at position i would be if a occurred randomly with probability q(a) - a definition of position-specific conservation. Note that Di(a) = 0 only when fi(a) = q(a) and increases more and more steeply as fi(a) deviates from q(a), a result consistent with intuition that a measure of conservation should non-linearly represent the divergence of the observed distribution of amino acids from their randomly expected values.

What is an appropriate number of sequences to carry out SCA? A more precise relation between the probability PM[fi(a)] and the relative entropy Di(a) is

   1       a)    (a)  ln M     ( 1 )
- M--lnPM [fi ] = Di + 2M-- +O   M-- .
(5)

The values of Di(a) are typically of order 1-3 (the scale is given by ln20 3), so the corrective term lnM∕(2M) can be neglected when M is of order of 100 sequences or greater (M = 100 corresponds to lnM∕(2M) 0.02). This gives a lower bound on the size of alignments appropriate for SCA studies.

Equation  4 gives the conservation of each amino acid a at each position i. An overall positional conservation Di taking into account the frequencies of all 20 amino acids can also be defined, but requires introducing a background probability for gaps (see Sec. I). Denoting fi(0) = 1 - a=120fi(a) the fraction of gaps at position i, we can write the probability of jointly observing the frequencies (fi(1),,fi(20)) of each of the 20 possible amino acids at position i as

     (1)     (20)           M !         (0)Mf(0)    (20)Mf (20)   -MD
PM [fi  ,⋅⋅⋅,fi  ] =----(0)-------(20)--(¯q  )  i  ⋅⋅⋅(¯q   )  i  ≃ e    i
                  (M fi )!⋅⋅⋅(M fi  )!
(6)

where Di = a=020fi(a) ln( (a)  (a))
 fi ∕¯q defines the overall conservation at position i.

 

III. Correlated conservation - second order statistics

 

A. General Principles

Given an alignment xi,s(a), a covariance matrix reporting pair-wise correlations between amino acids at positions is defined as

C (ab)= ⟨x(a)x(b)⟩ - ⟨x(a)⟩⟨x(b)⟩ = f(ab)- f(a)f(b),
  ij      i,s j,s s    i,s s j,s s   ij     i  j
(7)

where fij(ab) = xi,s(a)xj,s(b)s represents the joint frequency of having a at position i and b at position j. However, a fundamental principle of SCA is to compute correlations not for the raw alignment xi,s(a), but for a weighted alignment

 (a)   (a) (a)
˜xi,s = ϕi xi,s,
(8)

where ϕi(a) is a functional of the positional conservation Di(a). That is, the alignment ˜xi,s(a) now contains entries that are not merely binary (that is, 0 or 1), but that quantitatively represent the significance of observing each amino acid in each sequence in the alignment. Replacing ˜xi,s(a) for xi,s(a) in Eq. (7), it is straightforward to see that the result is a weighted correlation matrix which now reports the significance of correlations as judged by the degree of conservation of the underlying amino acids:

 ˜(ab)   (a) (b) (ab)
C ij  = ϕi ϕ j Cij  .
(9)

 

B. Choice of weights

Equation (9) gives the general definition of the SCA correlation tensor, but what specific form should the weighting function ϕ take? The approach taken in SCA (versions 3.0 and greater) is to consider the effect on the conservation of each position i upon removing each sequence s. The idea is that this ”perturbation” will provide an estimate of the significance of each amino acid at each position in the alignment by its impact on the measure of conservation used (here, the relative entropy D). To develop this formally, let Mi(a) be the number of sequences with amino acid a at position i, and M be the total number of sequences. When sequence s is left out, the frequency fi(a) = Mi(a)∕M becomes

      M  (a)- x(a)   (      )       x(a)    (    )
fi(a,s)= --i-----is, =  1 + 1-- f(ia) - -i,s-+O   -12  ,
         M - 1          M         M        M
(10)

where we remind that xi,s(a) = 1 if sequence s has amino acid a at position i, and 0 otherwise. In the limit of large number of sequences M, expanding Di(a) viewed as a function of fi,s(a), to first order in 1∕M leads to

             x(a)  (a)
D (ia),s ≈ ˆD(ia)- -i,s-∂Di--,
              M  ∂f(ia)
(11)

where ˆDi(a) is the relative entropy Di(a) with fi(a) replaced by (1+ 1∕M )fi(a). Ignoring the scaling factor of 1∕M (or, equivalently, rescaling the perturbation in conservation Di,s(a) -Dˆi(a) by M to be independent of alignment size), we find that this perturbation approach indicates a weighting function ϕ for the alignment that is the gradient of relative entropy:

      ||  (a)||  || [  (a)     (a) ]||
ϕ(ia)= ||∂Di--||= ||ln  fi-(1--q--)-||.
      |∂f(ia)|  |   (1- f(ia))q(a) |
(12)

The absolute value of the gradient is taken to ensure positive weights1. Applying these weights to the general definition given in equation  9), we have the specific form used for the SCA correlation tensor in versions 3.0 and greater of the SCA toolbox:

       |    ||   (b)|
˜(ab)   ||∂D-(ia)||||∂D-j-||  (ab)
Cij  = ||∂f(a)|||| ∂f(b)||C ij .
          i      j
(13)

It is important to understand the nature of the function ϕ in controlling the patterns of correlations emerging from the weighted alignment ˜xi,s(a). The weights chosen by the sequence perturbation approach described above (a version of so-called ”jackknife resampling”) have the property of rising steeply as the frequencies of amino acids fi(a) approach one. As a consequence, these weights will damp correlations in Cijab arising from weakly conserved amino acids (the gradient of Di(a) approaches zero as fi(a) q(a)), and will emphasize conserved correlations. In essence, this function imposes a particular mathematical form on the basic underlying principle of SCA that the functional relevance of correlations should scale with their conservation. In this regard, different weighting functions are possible (1) if mathematical formalisms other than the Kullback-Leibler entropy are used for defining positional conservation, or (2) if other approaches than the particular sort of sequence perturbation analysis described here are used for determining weights. For example, the SCAv.5.0 MATLAB toolbox includes the possibility of using Rényi relative entropies (a generalization of the Kullback-Leibler entropy) or indeed even arbitrary user-defined functions of fi(a) and q(a), as a measure of conservation. In addition, as shown in section IV.B , the original implementation of SCA (versions 1-2) involved a different perturbation technique that leads to a somewhat different weighting function. Regardless of these differences, the salient point is that all of these approaches are variations on the general principle in SCA of a weighted correlation (Eq. (9)).

In general, the problem of applying more complex conservation functions and associated weights is deeply connected with fundamentally understanding the nature of the evolutionary process that generates the observed amino acid distributions and the nature of our sampling of these distributions in our databases of available sequences. These are important future research goals, but in the absence of such deeper understanding the current approach in SCA of using the Kullback-Leibler entropy as a measure of conservation and gradients of this entropy function as weights represents a simple and analytically well-defined implementation of the general principles of this method.

Distinct from the principle that the relevance of correlations should scale with conservation, the weighting function also contributes to separating signal (the true evolutionary constraints) from correlation noise due to finite and biased sampling in practical sequence alignments. For example, one way that weakly conserved amino acids are expected to show strong correlations in Cijab is due to the existence of small clades of historically related sequences (bias in sampling) in which correlations are more due to the lack of sufficient time to diverge rather than due to functional constraints. In addition, when the number of sequences M is on the same order as the number of positions L, we expect significant correlation noise due to limitations in sampling. The non-linear weighting function used in SCA has the effect of minimizing the effect of these noise sources in interpretation of correlations.

C. Reduction to positional correlations

The first step in analysis of the SCA correlation tensor is to reduce the four-dimensional array of L positions × L positions × 20 amino acids × 20 amino acids to a L × L matrix of positional correlations. Indeed, the goal in SCA is to identify collectively evolving groups of positions (”protein sectors”) whose correlations are properties of the whole family (or of functional subfamilies, see below) regardless of the amino acids by which their correlation is identified.

How should we carry out the dimension reduction of the SCA correlation tensor to a matrix of positional correlations? One empirical property of the tensor provides a straightforward approach: analysis of the 20 × 20 amino acid correlation matrices for fixed pairs of positions (i,j) shows that these matrices have approximately rank one (O. Rivoire and R. Ranganathan, in preparation). That is, the information content in these matrices can be captured in a single scalar value. To explain this, we carry out the so-called singular value decomposition of ˜Cij(ab) for each (i,j):

  ab  ∑20   acc  cb
C˜ij =    Pij λijQij.
      c=1
(14)

In this decomposition, each 20 × 20 amino acid correlation matrix for each (i,j) is written as a sum of products of three 20 × 20 matrices: λ, a diagonal matrix of singular values that indicate the quantity of variance in C˜ij(ab) captured, and P and Q, matrices of singular vectors whose columns give weights for the combination of amino acids at positions i and j that are associated with each singular value. Interestingly, the singular value decomposition of ˜Cij(ab) for each (i,j) has the property that λij1 λijc for c1. That is, the information in the amino acid correlation matrix for each pair of positions can be represented by one number, the top singular value (also known as the ”spectral norm”):

  (ab)    a1 1  1b
C˜ij  ≃ Pij λijQij.
(15)

Thus, a matrix of positional correlations C˜ij can be defined simply by taking the spectral norm of C˜ij(ab) for each pair (i,j) of positions:

C˜ij = λ1ij.
(16)

This is one definition of the SCA positional correlation matrix, and is essentially identical to that computed in versions of the SCA Toolbox from 2.0 to 4.5 (for slight technical differences from these earlier versions see section  C). Below, we will show another empirical property of the SCA correlation tensor that will permit definition of a ˜
Cij matrix using a different mathematical approach that is used in SCAv5.0.

Two further notes with regard to this dimension reduction step: (1) Since ˜
Cijab = ˜
Cjiba, Qijcb = Pjibc, we can simplify Eq. (15) to  ˜
Cij(ab) Pija1λij1Pjib1. (2) Also, since (Pija1)-1 = Pij1a, we note that this reduction corresponds to C˜ij a,bPij1a˜CijabPji1b.

 

D. The alignment projection approach to SCA

In section C above, we indicated how for each pair of positions (i,j), we can reduce the 20 × 20 matrix of amino acid correlations by singular value decomposition to just the top singular value. The singular vectors corresponding to the top singular value (Pija1 and Pji1b for positions i and j, respectively) contain the weights for the amino acids at these positions that contribute to the top singular value for each (i,j). Interestingly, study of these top singular vectors shows another empirical finding about the SCA correlation tensor. For a given position i, we find that Pij1a is approximately independent of j. That is, the amino acids by which a position i makes correlations with other positions j is nearly the same, and therefore is essentially a property of just position i taken independently. This is a non-trivial finding and suggests a simple but powerful approach for SCA in which we reduce the alignment itself directly from the M × L × 20 three-dimensional weighted tensor (˜xi,sa) introduced in equation 8 to a M × L two-dimensional weighted matrix (˜xi,s). That is, we can use the collection of top singular vectors for each position i averaged over all j (Pia) as a ”projection matrix” to reduce the amino acid dimension of the alignment (see appendix, section F):

     ∑    a a
˜xi,s =   P¯i ˜xsi
      a
(17)

In practice, the projection matrix can be obtained directly from the weighted frequencies of amino acids at positions in the alignment. This makes sense; the finding that the top singular vectors of amino acid correlations for position i are independent of j implies that the average singular vector should be just a property of the amino acid distribution at site i. The projection matrix can be written as:

¯Pa=  ---⟨˜xasi⟩s----= -----ϕaifai-----.
 i   (∑ ⟨˜xb ⟩2s)1∕2  (∑  (ϕbfb)2)1∕2
        b si          b  i i
(18)

Three quantities can be computed directly from the now projected alignment matrix ˜xi,s:

(1) a SCA positional correlation matrix (written as ˜CijP, to distinguish it formally from ˜Cij defined above):

      |                (      ) (      ) |
 ˜P   ||1-∑          -1-  ∑        ∑      ||
C ij = ||M    ˜xsi˜xsj - M 2     ˜xsi      ˜xsj ||,
          s               s        s
(19)

(2) a SCA sequence correlation matrix (˜CstS) that represents the statistical groupings of sequences into subfamilies (if any):

      |               (      )(      ) |
 ˜S   ||1∑          1-- ∑        ∑      ||
C st = ||L   ˜xsi˜xti - L2    ˜xsi      ˜xti ||.
          i             i        i
(20)

It is important to note that this sequence correlation matrix represents relationships between sequences where positions are weighted by the conservation-based weighting function ϕ; thus this mapping of sequence space is more likely to reveal functional distinctions between sequence subfamilies rather than just historical relationships.

(3) a matrix (Π) that provides a mapping between these two spaces. To explain the Π matrix, we note that the projected alignment ˜xi,s can be written by singular value decomposition as ˜x = n=1min(L,M)|UnλnV n|. This provides a mapping from the space of positional correlations to the space of sequence correlations by:

    ∑
Π =    |Un⟩⟨Vn|
     n
(21)

That is, if we detect collectively evolving groups of amino acid positions (sectors) by analysis of the ˜
CijP matrix, then Π provides a mapping that can test whether these sectors are associated with the divergence of specific subfamilies in the alignment. In addition, the matrix Π provides the inverse mapping - from the space of sequence correlations to the space of positional correlations. With this inverse mapping, it is possible to use prior knowledge of functional divergence in the sequence alignment to target identification of a sector that is responsible for this divergence. We will show the technical details of these mappings below in section G.

 

E. Spectral decomposition

The process of identifying sectors from the C˜ijP matrix begins with spectral (or eigenvalue) decomposition. The motivation is that the presence of significant correlations between positions in the ˜CijP matrix indicates that treating the amino acid positions as the basic functional units of proteins is not the most informative representation. Instead, we should seek a reparameterization of the protein in which the units are collective groups of amino acids that coevolve per the positional correlation matrix (the ”sectors”). Eigenvalue decomposition is the simplest first step in achieving this reparameterization. This decomposition is always available for any square positive semi-definite matrix (such as ˜CijP) and represents the matrix as a product of three matrices:

 ˜P   ˜ ˜˜⊤
C   = VΔ V  ,
(22)

where ˜Δ is a diagonal matrix of so-called eigenvalues and is a matrix whose columns contain the associated eigenvectors. The eigenvectors contain the weights for linearly combining amino acid positions into new variables (”eigenmodes”) that are now de-correlated, and the associated eigenvalues indicate the magnitude of the information in ˜CijP that is captured.

The essence of sector identification is to study the pattern of positional contributions to the statistically significant top eigenmodes of the C˜ijP matrix. To determine the number k of eigenmodes that are significant, we compare the spectrum of ˜CijP with that derived from randomized alignments where the amino acids are scrambled independently at each position. In such randomized alignments, the eigenmodes reflect only the spurious correlations that are possible due to finite sampling in the alignment and provides a basis for a significance cutoff for eigenvalue magnitudes. As pointed out earlier, when the number of sequences M in the alignment is not large compared to the number of positions L, we expect the majority of weak correlations to be accountable by finite sampling considerations; indeed, we find that typically just the top few eigenmodes of the C˜ijP matrix are significant. These top modes can be examined for distinct groups of correlated amino acid positions to define sectors.

However, different eigenvectors are not expected to directly represent statistically independent sectors. Instead, if independent sectors exist for a particular protein family, they will generally correspond to groups of positions emerging along combinations of eigenvectors. The reason is due to the fact that decorrelation of positions (by diagonalizing the SCA correlation matrix - the essence of eigenvalue decomposition) is a weaker criterion than achieving statistical independence (which requires absence of not only pairwise correlations, but lack of any higher order statistical couplings). In other words, if the non-independence of a set of variables is not completely captured in just their pairwise correlations, then just the linear combination of these variables indicated by eigenvectors of the correlation matrix cannot be assumed to produce statistically independent transformed variables (xx).

 

F. Independent component analysis - ICA

Independent component analysis (ICA) - an extension of spectral decomposition - is a heuristic method designed to transform the k statistically significant top eigenmodes of a correlation matrix into k maximally independent components through an iterative optimization process. In this process, the k top eigenvectors of ˜CP, written as |1,,|k, are rotated by a k × k so-called ”unmixing” matrix WP to to yield k maximally independent components |1p,,|kp:

 ˜p      P ˜
V1...k = W  V1...k
(23)

We call this process ”posICA” to indicate that ICA is carried out on the eigenvectors of the SCA positional correlation matrix. In principle, this linear transformation of eigenvectors should help to better define independent sectors (if such exist in the protein family under study) as groups of positions now projecting along the transformed axes - the independent components (ICs) of position space (p).

It is also possible to apply ICA to the top eigenvectors of the SCA sequence correlation matrix ˜CS (equation 20). Given a spectral decomposition ˜CS = ŨS˜Ũ, we can use ICA to derive a different unmixing matrix WS that can be used to rotate the k top eigenvectors of C˜S, written as |Ũ1,,kŨk, to yield k maximally independent components |Ũ1s,,|Ũks:

  s      S
˜U1...k = W U˜1...k
(24)

We call this process ”seqICA” to indicate that ICA is carried out on the eigenvectors of the SCA sequence correlation matrix. This transformation should provide a better description of sequence subfamilies as groups of sequences emerging largely orthogonally along the independent components of sequence space (Ũs).

Upon ICA rotation, sector positions should correspond to positions i contributing significantly to one of the independent components. For example, i|jp> ηi, where jp is the j-th independent component and ηi is a threshold of significance obtained empirically. Similarly, sequence subfamilies should correspond to sequences s contributing significantly to one of the independent components s|Ũns> ηs where ηs represents an empirical cutoff.

 

G. Mapping between sequence and positional correlations

As described in Eq. (21), the Π matrix provides a mapping between the space of positional correlations in C˜P (which defines sectors) and the space of sequence correlations in ˜CS (which defines functional subfamilies). We can apply Π following posICA to the positional independent components |1p,,|kpto produce |Ũ1p,,|Ũkp, the sequence space mapped onto the positional ICs:

˜U p  = ΠV˜p
 1...k     1...k
(25)

If independent sectors are identified in p, then this mapping can indicate how sectors control relationships between sequences in Ũp. For example, in the S1A serine proteases  (2), we find evidence for three quasi-independent sectors from study of p, and show that each of these sectors is associated with an independent functional classification of sequences.

We can also make the inverse mapping in which we use the Π matrix following seqICA to make a mapping from the sequence independent components |Ũ1s,,|Ũksto produce |1s,,|ks, the positional correlation space mapped onto the sequence ICs:

  s     ⊤  s
˜V1...k = Π  ˜U1...k.
(26)

If analysis of Ũs shows interesting functional separations of sequences comprising the protein family, then this mapping can indicate whether a distinct sector is responsible for this functional divergence. For example, in the Hsp70/110 family of molecular chaperones, study of Ũs indicates a clear separation of the allosteric Hsp70 proteins and the non-allosteric Hsp110 proteins alone one independent component  (3). Interestingly, the corresponding component in s reveals the sector that is associated with this functional divergence.

Note that we describe the usage of the Π matrix to map between the independent components of the positional (C˜P) and sequence (˜CS) correlation matrices, but the same mapping can also be made just to the eigenvectors of ˜CP and ˜CS. Indeed, in cases where multiple independent sectors are not evident, sector identification and sequence space mappings are conducted directly from the eigenvalue decompositions without application of ICA. In this case, we define Ũ = Π as the sequence space mapping from the eigenvectors of the ˜CP matrix (Eq. 22), and = ΠŨ as the position space mapping from the eigenvectors of the ˜CS matrix.

 

IV. Appendix

This section mostly provides information about relationships with previous descriptions of the SCA approach. The original implementation of SCA defined conserved correlations through a specific type of perturbation analysis on the sequence alignment (MATLAB SCA Toolbox 1.5, Sec. B), and this and more recent implementations described dimension reduction of the SCA correlation tensor through a formally different (though practically near-identical) type of matrix norm. Here, we explain these technical differences, and provide a more detailed explanation of the calculation of the projection matrix (equation 18).

 

A. Equivalence with previous definitions of conservation

Di(a) is equivalent to measures of positional conservation introduced in previous reports of the SCA method. In essence, Di(a) is the asymptotic limit for large M for ΔGistat,a (MATLAB SCA Toolbox v1.0, as reported in Refs. (4–7)), and ΔEistat,a (SCA Toolbox v1.5, as reported in Ref. (8)):

ΔGstat,a= ΔEstat,a = --1-lnPM [f (a)] ≃ D(a).
   i        i       M        i      i
(27)

The pre-factor -1∕M scales the positional conservation parameter for alignments of different size, and represents the statistical unit of conservation symbolically indicated by kT* or γ* in previous works.

 

B. The original SCA method

The implementation of the SCA method introduced originally in Ref. (4) was based on a perturbation to the amino acid distribution at one test site i to measure the difference in position-specific conservation of each amino acid at a second site j. In general, the perturbation consisted of restricting the test site to the most prevalent amino acid ai, a manipulation that extracts a sub-alignment with size equal to fi(ai)M. For test sites in which sub-alignments retained sufficient size and diversity to be globally representative of the full alignment (i.e., fi(ai)M > 100 sequences), a difference conservation value was calculated:

                              [ (    [   ])    (   [     ])]
ΔΔGstat,b,ai= Δ ΔEstat,b,ai= - -1- ln  PM  f(b)  - ln PM  f(b)|ai   ,
    j,i           j,i        M           j             j|i
(28)

where fj|i(b)|ai is the frequency of amino acid b in the sub-alignment obtained by retaining only the sequences having a well represented amino acid ai at position i. ΔΔGj,istat,b,ai represents the change in the conservation of amino acid b at position j due to the perturbation introduced at position i, a measure of their correlation (the term was renamed to ΔΔE in subsequent publications and is ignored entirely now to avoid confusion with Gibbs energies). The first term on the right hand side, -1-
M ln(    [   ])
 PM  fj(b), corresponds to Dj(b). A basic tenet of the original SCA approach was that perturbations lead to sub-alignments that are representative of the full alignment, a condition satisfied typically by only the most frequent amino acid at a subset of positions. Under this assumption, fj|i(b)|ai fj(b) for most amino acids b at positions j. We may therefore expand the second term, -1
M- ln(    [ (b)|ai])
  PM  fj|i, by writing

         (aib)         (aib)   (ai) (b)          (aib)
f(b)|ai= fij--= f (b)+ fij----fi--fj- = f(b)+ C-ij--
 j|i     fi(ai)   j         f(iai)         j    f(aii)
(29)

with Cij(aib) defined as in Eq. (7), so that

  -1-  (   [ (b)|ai])    (b)   C(ijaib)∂D-(bj)
- M  ln  PM  fj|i    ≈ D j +  f(ai) ∂f(b).
                             i     j
(30)

This leads to

    stat,b,a      1  ∂D(jb) (ab)
ΔΔG j,i   i ≈ --(ai) --(b)-Ciji ,
              fi   ∂fj
(31)

which shows that the perturbation procedure also represents a weighted procedure for correlations that is fully consistent with the general principle of SCA outlined in equation 9.

 

C. Dimensional reduction

In previous implementations of the SCA method, a reduced matrix Cij was defined from ˜Cij(ab) by

     ┌ --------------
     ││ (∑   (    ) )
¯Cij = │∘ (     ˜C(iajb) 2).
         a,b
(32)

This is known as the Frobenius norm of the 20 × 20 matrix ˜Cij(ab) and can be expressed in terms of the singular values λijc of this matrix as Cij = ( c(λijc)2)12. Since λij1 λijc for c1, the Frobenius norm of ˜
Cij(ab) is well-approximated by the spectral norm of the matrix, Cij Cijsp = λij1.

 

D. Binary approximation

In Ref. (2), we make use of a so-called ”binary approximation” of the full alignment in which we consider only the most frequent amino acid ai at position i. The alignment is then represented by a binary array xi,s where xi,s = 1 if sequence s contains the most frequent amino acid at position i, and 0 otherwise (i.e., xi,s = xi,s(ai)). As a consequence of the non-linear dependence of Di(a), Di(a), and Di with respect to fi(a), the overall conservation Di is well approximated by the conservation of the prevalent amino acid (Di(ai) or Di(ai)), a result that justifies the use of the binary approximation.

The L × M binary array xi,s is formally obtained from the L × M × 20 array xi,s(a) by application of a ”projector” Bia defined by Bia = 1 if a = ai, and 0 otherwise: xi,s = Biaxi,s(a). From the observation that ϕi(ai)Bia is an approximation of ϕi(a)Pia, where Pia is defined in Eq. (??), it results that

 ˜(bin)   (ai) (aj)  (bin)
C ij   = ϕi  ϕj  |C ij  |,
(33)

where Cij(bin) = xi,sxj,ss -⟨xi,ssxj,ss = fij(aiaj) - fi(ai)fj(aj). The matrix ˜Cij(bin) corresponds to the matrix of positional correlations considered in Ref. (2).

 

E. Cleaning of the first mode

In Ref. (2), we ”cleaned” the first mode of C˜ij(bin) and identified the sectors based on the subsequent eigenvectors of this matrix. This procedure rests on the observation that statistical correlations between positions can arise from a combination of phylogenetic bias and functional constraints acting independently on the positions, but that such statistical correlations would directly depend on the degree of conservation of the positions. From this point of view, if four equally conserved positions i,j,k,ℓ satisfy ˜
Cij, ˜
Ckℓ  ˜
Cik, ˜
Ciℓ,˜
Cjk,˜
Cjℓ, the absence of strong inter-correlation between the pairs (i,j) and (k,ℓ), contrasted with the presence of strong intra-correlations within them, can be attributed to correlations of functional origin.

Implementing this principle requires estimating Cij*, the (weighted) correlation expected from phylogeny for a pair of positions with same degree of conservation as i,j. Given Cij*, functionally significant correlations can then be deduced as those for which ˜
Cij > Cij*. An approximate estimation of Cij* follows from the premise that most correlations are actually not functionally significant. By averaging  ˜
Cij over the positions j we thus get an estimation of the degree of correlation to be expected from a position with degree of conservation such as i. This suggests to estimating Cij* by2

      ⟨C˜ ⟩⟨C˜ ⟩
Cˆi*j = --iℓ˜ℓ--kj-k.
        ⟨Ck ℓ⟩kℓ
(34)

This procedure admits an equivalent spectral formulation when the first eigenvalue of  ˜
Cij is separated by a gap from the rest of the spectrum. Standard perturbation theory (see SI of Ref. (2)) indeed indicates that in such a case3

ˆC*ij ≃ ˜λ1⟨i|˜V1⟩⟨V˜1|j⟩.
(35)

In other words, subtracting Ĉ* from ˜
C is equivalent to subtracting the first mode from  ˜
C.

This simple procedure however rests on the assumption that the sequences in the alignment are subject to essentially the same functional constraints, which does not represent the general case. The procedure presented in SCAv5.0 (Sec. III), which does not involve cleaning the first mode, is a more general solution for alignments with strong heterogeneities in the distribution of sequences.

F. Projection matrix

In section D we described the empirical finding that for each position i, the top singular vector of amino acid correlations with all other positions j is essentially independent of j. This led to the key concept of a projection matrix Pia (based on averaging Pij1a over all j) which can be used to reduce the weighted alignment tensor ˜xi,sa to a weighted alignment matrix ˜xi,s. But, there is one technical complication to discuss that leads to the specific form of the projection matrix described in Eq. (18). To make a proper matrix of projection vectors Pia by averaging Pij1a over all j, it is essential that the signs of the singular vectors in Pij1a all be the same. However, since ˜Cijab = ϕiaϕjb(fij(ab) - fiafjb) can have both positive and negative values, the signs of the singular vectors are arbitrarily fixed in computing the singular value decomposition and the average over these vectors is not appropriate. To avoid this, we can force the SCA correlation tensor to have only positive values by computing Ĉij(ab) = ϕiaϕjbfij(ab), a SCA correlation tensor computed without subtracting the randomly expected correlations (the random expectation for correlations are subtracted at a later step in defining the C˜P and ˜CS matrices, Eqs. (19) and (20)). Since Ĉij(ab) has only non-negative entries, the elements of Pij1a have all same sign (Perron-Frobenius theorem) and we can compute Pia the average of Pij1a over j. The calculation in Eq. (18) is an excellent approximation to Pia.

 

G. Implementation of ICA

As introduced above, ICA involves rotation of the k top eigenvectors of a correlation matrix by a k ×k so-called ”unmixing” matrix W to to yield k maximally independent components. How do we technically obtain the W matrices? Various implementations of ICA can be used that apply different measures of independence and different algorithms for optimizing them. We use one of the simplest implementations of ICA, proposed in Ref. (9) with modifications introduced in Ref. (10) (the results should however be robustly recovered when using other algorithms for ICA). For posICA the input of the algorithm is the k × L matrix Z whose rows correspond to |1,,|k while for seqICA it is the k × M matrix Z whose rows correspond to |Ũ1,,|Ũk. The algorithm iteratively updates the unmixing matrix W, starting from the k × k identity matrix W = Ik, with increments ΔW given by

       (     (          2      )     ⊤ )
ΔW  = ϵ  Ik + 1 - 1+-exp(- W-Z-) (W Z ) W.
(36)

The parameter ϵ is a learning rate that has to be sufficiently small for the iterations to converge. The iterations lead to Ws for seqICA and Wp for posICA. The vectors |jpare obtained by applying Wp to |j, and the vectors |Ũnsby applying Ws to |Un; these vectors are normalized to unit norm.

 

V. SCA Toolboxes

 

A. Distributions

Distributions are MATLAB Toolboxes and contain various accessory codes for data formatting, display, and analysis. Previous versions include:

(1) SCA Toolbox 1.5: The original SCA method as specified in Lockless and Ranganathan (4) with one modification that was used in all subsequent papers: the division of binomial probabilities by the mean probability of amino acids in the alignment is removed. This version is longer in active use.

(2) SCA Toolbox 2.5: The bootstrap-based approach for SCA. Position-specific conservation calculated as in Eq. (4) and correlations calculated as in Eq. (9). Matrix reduction per Eq. (32).

(3) SCA Toolbox 3.0: The analytical calculation of correlations weighted by gradients of relative entropy. Position-specific conservation calculated as in Eq. (4) and correlations calculated as in Eq. (9)-(33). For non-binarized alignments, matrix reduction is per Eq. (32).

(4) SCA Toolbox 4.0: Analytical calculations as in v3.0, but now including sector identification methods as described in Ref. (2).

(5) SCA Toolbox 5.0: Calculation of positional and sequence correlations matrices by the alignment projection method as per Eq.  (19) and Eq.  (20), and calculation of the mapping between them (Eq.  (21). Includes methods for sector identification and exploring relationships between positional and sequence correlations.

 

References

 

[1]    T. M. Cover and J. A. Thomas. Elements of information theory. Wiley-Interscience, New-York, 1991.

[2]    N. Halabi, O. Rivoire, S. Leibler, and R. Ranganathan. Protein sectors: evolutionary units of three-dimensional structure. Cell, 138(4):774–86, 2009.

[3]    R. G. Smock, O. Rivoire, W. P. Russ, J. F. Swain, S. Leibler, R. Ranganathan, and L. M. Gierasch. An interdomain sector mediating allostery in hsp70 molecular chaperones. Mol. Sys. Bio., 6:414, 2010.

[4]    S. W. Lockless and R. Ranganathan. Evolutionarily conserved pathways of energetic connectivity in protein families. Science, 286(5438):295–9, 1999.

[5]    G. M. Süel, S. W. Lockless, M. A. Wall, and R. Ranganathan. Evolutionarily conserved networks of residues mediate allosteric communication in proteins. Nat Struct Biol, 10(1):59–69, 2003.

[6]    M. E. Hatley, S. W. Lockless, S. K. Gibson, A. G. Gilman, and R. Ranganathan. Allosteric determinants in guanine nucleotide-binding proteins. Proc Natl Acad Sci USA, 100(24):14445–50, 2003.

[7]    A. I. Shulman, C. Larson, D. J. Mangelsdorf, and R. Ranganathan. Structural determinants of allosteric ligand activation in rxr heterodimers. Cell, 116(3):417–29, 2004.

[8]    M. Socolich, S. W. Lockless, W. P. Russ, H. Lee, K. H. Gardner, and R. Ranganathan. Evolutionary information for specifying a protein fold. Nature, 437(7058):512–8, 2005.

[9]    A. J. Bell and T. J. Sejnowski. An information-maximization approach to blind source separation and blind deconvolution. Neural Computation, 7:1129–1159, 1995.

[10]    S. Amari, A. Cichocki, and H. H. Yang. A new learning algorithm for blind signal separation. In D. Touretzky, M. Mozer, and M. Hasselmo, editors, Advances in neural information processing systems, volume 8, pages 757–763, Cambridge MA, 1996. MIT Press.