Overview

The statistical coupling analysis (SCA) is an approach for characterizing the pattern of evolutionary constraints on and between amino acid positions in a protein family. Given a representative multiple sequence alignment of the family, the analysis provides methods for quantitatively measuring the overall functional constraint at each sequence position (the position-specific, or "first-order" analysis of conservation), and for measuring and analyzing the coupled functional constraint on all pairs of sequence positions (the pairwise-correlated, or "second-order" analysis of conservation). The premise is that extending the traditional definition of conservation to include correlations between positions will contribute to defining the architecture of functional interactions between amino acids, and more importnatly, help define the basic physical principles underlying protein structure, function, and evolution.

A brief, non-technical description of SCA is given below in the context of one protein family, but in-depth tutorials that describe both the process and logic of the SCA methods are available in the SCA Tutorials page and are included in the distribution of the SCA codes. Our plan is to continue to add additional explanatory notes and examples to the tutorial page as we progress in our work. In addition, a technical note describing the mathematical details of the method is available for viewing and download.

 

I. Requests for materials

SCA Download

The SCA is available as a MATLAB toolbox, and can be obtained online for academic (non-profit) usage by going to the SCA download page . The code continues to be updated and modified as we make progress; the current version is SCA 5.0. For questions regarding downloads or for earlier versions of the SCA toolbox, please contact Rama Ranganathan. In addition to the core calculations, the toolbox includes various codes for data formatting, display, and analysis through hierarchical clustering, spectral decomposition, and singular value decomposition. Four tutorials with sample alignments that illustrate the analytic process are provided. A number of multiple sequence alignments of protein families that we have worked on are available for download as well.

 

II. SCA calculations - an example in the PDZ domain


We provide an illustration of the SCA calculations using the PDZ domain family of protein interaction modules. A script for reproducing this analysis is provided in the SCA toolbox. The multiple sequence alignment contains 233 eukaryotic PDZ domains, and is trimmed to the 92 positions that show gap frequency of less than 20% (A). Numbering is (arbitrarily) per the structure of the PSD95 third PDZ domain (pdb code 1BE9). PDZ_msa

Position-specific conservation: In SCA, the overall conservation of each positon i in the multiple sequence alignment taken independently is measured by a statistical parameter known as the relative entropy Di (A, and see the technical note for more details). Di is a non-linear function of amino acid frequency, rising more steeply as frequencies approach one. For the PDZ domain, we find a pattern of Di that follows a heavy-tailed distribution (B). Coloring in red the ~40% highest conserved residues (Di > 1) on a slice through the core of a representative structure of the PDZ domain, we find a simple structural pattern: more conserved residues tend to occur within the protein core and at functional surfaces, and less conserved residues tend to occur on the remainder of the protein surface (C).

Correlated conservation: The basic concept of SCA is a generalization of sequence conservation to include pairwise (or in principle, even higher-order) correlations between sequence positions. If properly parameterized to reduce the effect of noise introduced by both limited sampiing (finite number of sequences) and biased sampling (historical relationships between sequences), these correlations should represent the statistical signature of conserved functional interactions between amino acid residues. In SCA, a correlation matrix is calculated for all pairs of sequence positions by weighting the raw covariance of amino acids at positions by a non-linear function of their position-specific conservation (technical note). That is, SCA measures conserved correlations between sequence positions.

For the PDZ family, panel D shows the 92 position by 92 position SCA correlation matrix. Inspection of this matrix shows that correlations are not simply dominated by proximity in primary or secondary structure; many positions show weak correlation to neighboring positions but significant correlation to positions distant along the sequence. A study of the pattern of correlations for one position in the peptide binding pocket (the first position of the a2 helix, 372 in PSD95pdz3) illustrates some of the basic results of the SCA. Like for all correlations in the SCA matrix (not shown), the correlations for position 372 follows a heavy-tailed distribution that can be fit to a log-normal function (E). Mapping the positions showing correlations greater than 2 standard deviation cutoff reveals a heterogeneous pattern that connects the environment of position 372 through substrate peptide to a distant surface site on the a1 helix (F). In the PDZ domain from the Par-6 cell polarity protein, this surface serves as an allosteric control site through which binding of the small GTPase cdc42 regulates substrate affinity. Panel G shows the same view, but in space-filling representation to show that the two functional surfaces are connected through a few intermediary residues that are located in the protein core.Cij_372

Methods for more globally analyzing the SCA correlation matrix to extract patterns of statistically correlated amino acid positions are a subject of ongoing active investigation. One simple approach introduced previously is hierarchical clustering, in which positions are re-ordered by their profile of correlations to other positions. For the PDZ domain, this analysis reveals two main features: (1) most positions show weak overall correlations with other positions and do not cluster well, and (2) a subset of positions show stronger correlations and seem to be organized into clusters. Though qualitative, this approach has provided insight into biologically important networks of amino acids in several protein families, especially when alignment size and quality are such that clusters of correlated positions robustly emerge.

More generally, the essence of the problem of interpreting the SCA matrix is to develop methods for separating the functionally significant correlations (the "signal" reflecting evolutionary constraint) from correlations that could arise due to limited sampling (statistical noise) and biased sampling (historical noise). Analysis of the pattern of remaining significant correlation provides a basis for decomposing the protein sequence into functional units (termed "protein sectors") that represent a groups of co-evolving amino acid positions. We have described the use of methods such as spectral (or eigenvalue) decomposition and more generally, signular value decomposition of the SCA matrix to systematicallty identify protein sectors, and these are described in the tutorials that accompany the SCA codes. In the PDZ domain, this analysis demonstrates the existence of a protein sector that links that ligand binding site to two distantly positioned surfaces sites through a network of physically contiguous amino acids located in the protein core. Further detail on the analytical methods will be given on this page shortly.

 

III. Notes on other implementations....coming soon.