USING EVOLUTIONARY INFORMATION
TO STUDY PROTEIN STRUCTURE
Richard A. Goldstein
Siena Biotech, via Fiorentina 1,
53100 Siena, Italy
Received: 26th
July 2002 / Published: 15th
May 2003
Abstract
The genomic data available to computational biologists represents the product
of the complex processes of evolution. In particular, the forces of mutation,
duplication, and selection have acted to sculpt modern protein sequence and
structure in the context of changing functional requirements. Just as crystallographers
are able to determine protein structures through an analysis of X-ray diffraction
patterns, we wish to read the evolutionary history of proteins in order to
understand their structures, functions, and interactions. To this end, we
have been developing models of natural site substitutions that are informed
by the protein structure and function and the resulting variations in selective
pressures, even when the structure and function of the protein are unknown.
By phrasing the substitution process in terms of the underlying properties
of the constituent amino acids we can build models that are both much more
accurate and more interpretable. The model is applied to a large set of globular
proteins as well as a set of G-protein coupled receptors, identifying general
structural and functional features of these biomolecules.
Introduction
The various genome projects have produced a plethora of gene sequences encoding
proteins for which we have little information. While there are extensive experimental
efforts to characterize the structure, function, and other characteristics
of these proteins, there still remains a substantial backlog. In addition,
many proteins of major interest are resistant to many of these experimental
techniques. This has helped to spur the development of techniques to predict
the characteristics of these proteins based only on sequence information.
Often we have multiple sequences of related proteins from different organisms.
It has been long recognized that these multiple sequences provide us with
a valuable opportunity, that a set of related sequences convey more information
than just a single example. The challenge has been to extract meaningful information
from these multiple sets.
Given an alignment, it is possible to identify the conserved residues, to
characterize the amount of sequence variation at each location using such
concepts as "sequence entropy", and to look for correlated changes between
different locations in the protein. These approaches generally treat the observed
protein sequences as a random sampling from the space of all possible sequences.
This, of course, is false. One obvious problem is the uneven distribution
of proteins among different organisms, depending upon the relative importance
of the species to individual scientific investigators. More insidiously, homologous
proteins are related by a phylogenetic structure that can induce confounding
tendencies in the data. For example, Figure 1 shows a phylogenetic pattern
where two substitutions occurred in different branches of the tree. In this
simple example, there is a complete correlation between the third and seventh
positions, even though this does not represent the effect of compensatory
substitutions. One approach to handling these complications is to model the
evolutionary process explicitly. This is the approach that we take here.
Figure
1. Example of an evolutionary trajectory producing an artificial
correlation between sequence locations.
The standard method to model the site substitutions that occur during evolution
is through a "substitution matrix", a 20 x 20 matrix representing the probability
that one amino acid would be replaced by another in a given length of evolutionary
time. Standard approaches generally use a single substitution matrix for all
locations in the protein, implicitly assuming that all locations in the protein
can be represented by the same model, that is, are under similar selective
pressure. This is, of course, unrealistic. It has been shown that substitution
rates vary with surface accessibility, secondary structure, and functional
significance. One method to approach this problem is to subdivide the various
locations in proteins according to their local structure, constructing and
using structure-dependent substitution matrices (1, 2, 3, 4). This approach
still assumes that all locations with the same local structure are under similar
selective pressure, ignoring differences based on the inevitable "coarse graining"
of the structural classifications as well as selective pressure due to function.
Recently we developed an approach, which we call a Hidden States Model (HSM),
for dealing with these differences in selective pressure (5, 6, 7, 8). In
this model, each location is assumed to belong to one of a set of possible
"site classes", each corresponding to a single substitution matrix. The various
substitution matrices are unknown, as is the site class to which each location
belongs. Instead, each location in the protein has the same set of a priori
probabilities for belonging to each site class. The a priori probabilities
as well as the set of substitution matrices are determined based on a set
of related proteins through a maximum likelihood formulation. The result of
this procedure is a set of site classes with corresponding substitution matrices,
as well as the ability to calculate the a posteriori probability that any
given location is a member of any particular site class. This then provides
us with information regarding 1) which locations are under related selective
pressure, 2) what is the nature of this selective pressure, and 3) when is
the selective pressure different for different subsets of proteins.
The central challenge in this approach is the total number of parameters that
must be adjusted in the optimization process. We deal with this situation
by representing the entire substitution matrix with a biologically-inspired
reduced set of parameters. In general, we consider the local propensity, or
fitness, of each amino acid for any location described by a given site class,
and then represent the probability of substitution of one amino acid for another
in terms of the differences in these local fitnesses. The functional forms
of this representation can be quite general, with additional parameters that
can be optimized based on the observed data.
In this paper, we first investigate the nature of the substitution process
at different types of locations in a set of globular proteins. We then demonstrate
the application of these models for understanding the selective pressure acting
on one particular set of proteins, G-protein coupled receptors (GPCRs).
Methods
We first recap our model for site substitutions, as described elsewhere (5,
7, 8). We first consider that there are a number of different site classes,
which characterize locations in the protein under similar selective pressure.
As described above, the model does not assign locations to site classes; instead
we define an unknown prior probability P(k),
that any given location belongs to site class k.
As all locations must belong to a site class, SkP(k)
= 1.
We need to reduce the number of adjustable parameters that characterize each
particular substitution matrix. In order to do this, we first consider that
there is a relative fitness Fk(Ai)
of amino acid Ai
for any location described by a particular site class k,
related to the logarithm of the probability of finding such an amino acid
at this location described by this site class. The instantaneous rate of substitution
Qkij
from amino acid Ai
to Aj
at site class k
is then reflected by the relative changes in fitness. In this paper, we use
a few different models. In our analysis of a general set of globular proteins,
we use so-called Metropolis kinetics, where advantageous substitutions (DF
½
Fk(Aj)
- Fk(Ai)
0) are accepted at a maximum site-class dependent rate nk
, while disadvantageous substitutions (DF
< 0) are accepted with a probability that decreases exponentially with
the resulting change in fitness.
The Metropolis scheme is the only kinetics scheme ensuring detailed balance,
and where a favorable substitution is always accepted at the maximum rate.
In addition, as we were most interested in modeling the general nature of
the selective pressure at different locations, we further parameterized the
fitness of each amino acid at a given site class as a function of the physical-chemical
properties of the amino acids:
where
ql(
Ai)
represents the value of physical-chemical property
l
of amino acid
Ai,
and

and

represent site-class specific adjustable parameters. In this study, we used
the four orthogonal property indices developed by Scheraga and coworkers (9).
The first property is positively correlated with turn propensity and negatively
correlated with
a-helix
propensity; the second is positively correlated with size and bulk, the third
is positively correlated with
b-sheet
propensity, and the last is negatively correlated with hydrophobicity, meaning
hydrophilic residues have high positive values in this index.
For the analysis of the G-Protein Coupled Receptors, we used a more general
function of the form:
where nk
again characterizes the overall substitution rate for site class k,
and lk
and bk
are parameters of the function. Note that this model is equivalent to the
Metropolis scheme under the conditions lk
= 0 and bk
» 1. In contrast to the case for the general set of globular proteins, we
left the values of Fk(Aj)
as independently adjustable parameters.
To determine the substitution matrix M,
representing the possible substitutions from amino acid Ai
to Aj
for any particular amount of evolutionary time t,
the Q
matrix is exponentiated:
The model involves a large number of adjustable parameters. We will notate
the parameters for site class k,
including the prior probability P(k),
as {q}k.
For the study of the large set of globular proteins, this includes P(k),
nk,
and alk
and f
lk
for the four different physical-chemistry parameters (the values of ql(Ai)
for the twenty amino acids are measured, not adjustable, parameters). For
the GPCR study, these parameters include P(k),
nk,
lk
and bk,
and the twenty fitness parameters of the amino acids Fk(Ai)
(as the fitness values are relative, one of these parameters is set to zero).
We will notate the entire set of all parameters, including the parameters
for all of the site classes, as Q.
These parameters are adjusted in order to maximize the log likelihood, that
is, the log of the conditional probability that the observed data would result
if the model were correct. At each location l,
we first calculate the probability P(Dl|
qk,
T)
of the observed amino acids at that location, Dl,
resulting from the evolutionary dynamics if the location was assigned to site
class k
with model parameters qk.
given the evolutionary tree topology and branch lengths T.
Since each location can be represented by any of the site classes and each
site class has distinct parameters qk
we have to sum over all possible site classes to calculate the total likelihood
for that location, Ll:
The log-likelihood for the entire set of proteins is calculated as the sum
of the log of this likelihood at each location in the alignment.
While we do not know to which site class a location belongs a
priori, following optimization of the model we can calculate a
posteriori probabilities. The conditional probability that a location
l
belongs to site class k
is given by:
Datasets
A general protein data set was constructed by selecting 42 proteins of length
greater than 80 residues from the list constructed by Hobohm and Sander (10),
all with 6 to 11 homologs of 30% or greater sequence identity listed in the
HSSP database (11). The average number of homologs for each protein was 10.5.
A multiple alignment and phylogenetic tree was created for each set using
the program ClustalV (12). The sequence, structure, and surface accessibilities
were found by use of the DSSP program on the corresponding PDB files (13,
14). Residues were considered exposed if greater than 18% of their surface
area was exposed to solvent.
Models with two site classes were optimized where Fk(Ai)
was a function of all four of Scheraga's orthogonal indices. Separate
analyses were performed for buried and exposed residues. In each case, we
calculated how much each physical chemical parameter contributed to the variance
of the fitness values of the different amino acids for each of the site classes.
For the GPCR project, we selected a group of 185 amine-binding proteins, obtaining
the multiple alignment from GPCRdb (15). We used PHYLIP (16), which uses a
parsimony approach to calculate the best tree from a given set of data. Resulting
trees were optimized for their branch lengths using PAML (17). A model consisting
of 5 site classes was optimized. We then calculated the posterior probabilities
of the site classes for each location In order to interpret the selective
pressure described by each site class, we calculated the correlation coefficient
between the fitness values and physicochemical properties of amino acids.
These properties were derived from the AAindex database (18), which contains
434 different amino acid indices. We avoided indices related to spectroscopic
methods and selected 145 physicochemical indices (see supplement for AAindex
database codes of the used indices). Solvent accessibility calculations for
rhodopsin were done using the publicly available software GETAREA 1.1 (19).
Results
General properties for globular proteins
Representations for the selective constraints on exposed locations is shown
in Figure 2. The optimization resulted in two distinct site classes, one site
class representing the majority of sites (represented by the relative size
of the pie charts for the two site classes), with a faster rate of variation
(larger nk),
with the fitness of the amino acids primarily determined, unsurprisingly,
by the hydrophilicity. In addition, there was a preference for small residues
as well as a slight preference for residues with high turn propensity. The
less common site class, conversely, had a slower rate of variation (smaller
nk),
and had a strong preference for hydrophobic residues.
Figure
2. Pie charts representing the various contributions to the
selective pressures acting on surface locations belonging to the two site
classes. The relative sizes of the charts represents the percentage of the
surface locations assigned to these classes. The color scheme represents the
various Scheraga factors, including hydrophilicity (red), a-helix
or turn propensity (blue), bulk (green), and ß-sheet propensity (magenta).
Solid colors represent a positive correlation with the Scheraga factor, while
a striped pattern represents a negative correlation.
Helical propensity was also important, with smaller preferences for bulkier
residues and residues with a larger sheet propensity.
The situation for buried residues is portrayed in Figure 3. The faster-varying
locations actually occupied a minority of the buried locations, and predictably
preferred hydrophobic residues, although this preference was less strong than
the tendency for faster-varying exposed locations for hydrophilic residues.
Equally strong was a tendency for residues with propensity for b
sheets, as well as a moderate preference for residues with a-helical
preferences, as well as large residues. The larger group of locations were
in a slower-varying site class with a strong tendency towards small residues,
and smaller preferences for hydrophilicity, turn propensity, and b-sheet
propensity.
Figure
3. Pie charts representing the various contributions to the
selective pressures acting on buried locations belonging to the two site classes.
The relative sizes of the charts represents the percentage of the surface
locations assigned to these classes. The color scheme is as for Figure 2.
G-Protein Coupled Receptors
The various parameters for the five site-class model obtained for the amine-binding
GPCRs are listed in Table 1.
By mapping the locations in the amine-binding proteins to the known structure
of Bovine bacteriorhodopsin, we can identify which locations are assigned
to different site classes. Locations which are likely to reside in the membrane
are largely assigned to site classes 1, 2, and 3, while loop locations are
almost entirely assigned to classes 4 and 5. In addition, transmembrane locations
in site classes 3 and 4 are generally in areas exposed to the membrane, while
locations in site classes 1 and 2 generally face into the interior of the
protein.
Table
1. Overall substitution rates and properties of preferred amino
acids for the five site-class model optimized on the set of amine-binding
GPCRs
Discussion
Both the buried and exposed locations can be divided into two different site
classes, a faster-varying set of locations and a slower-varying set. It is
not surprising that the faster varying locations on the exterior prefer hydrophilic
residues, while the faster-varying locations on the interior prefer hydrophobic.
It is surprising, however, that for the slower-varying sites in both contexts
these preferences are reversed. It is likely that the faster-varying sites
are under less purifying selective pressure than the sites that vary more
slowly. While most locations in the inside would be under some selective pressure
to remain hydrophobic, the other, more specialized forms of selective pressure
acting on some locations may favor conservation of such things as particular
hydrogen bond or ionic bond formations. In these locations, this specialized
selective pressure may well favor hydrophilic residues in a way that "trumps"
the more general forms of selective pressure felt by more average locations
in the protein. These locations would have slower substitution rates as well
as more complex forms of the selective pressure. Similarly, these locations
may be involved in specific packing or aromatic stacking interactions, so
the preference for larger residues might be explainable. Conversely, the locations
on the protein exterior that change slowly might be under more specific forms
of selective pressure that prefer hydrophobic residues. In both instances,
the needs of stabilizing a specific conformation may result in a tendency
for specific locations to have selective pressure opposite in form to that
of other, seemingly similar locations. One interesting point to note is that,
for buried locations, the majority of sites are slower varying. Another observation
is that the dominant hydrophobic selective pressure is on faster-varying external
locations to remain hydrophilic. This provides further evidence for the reverse
hydrophobic effect, that is, the need to avoid stabilizing alternative conformations
where these particular locations are buried (20).
The analysis of the GPCRs demonstrate that we can obtain specific structural
information from sets of aligned sequences, even identifying trans-membrane
residues facing into the protein interior or out into the lipid membrane.
All this information is gathered as a result of the optimization procedure,
with no a
priori knowledge about structure or function. As such, it is a powerful
way of generating important information about the new proteins whose sequences
are becoming available.
Acknowledgments
This contribution represents the work of a number of different investigators
in my lab, including Jeffrey Koshi, Darin Taverna, Matthew Dimmic, and Orkun
Soyer, as well as a collaborator, Richard Neubig. Financial support was provided
by NIH Grants GM08270 and LM0577, NSF equipment grant BIR9512955, and a grant
from the University of Michigan Program in Bioinformatics.
References
[1] Wako, H. & Blundell, T. (1994). Use of amino acid environment-dependent
substitution tables and conformational propensities in structure prediction
from aligned sequences of homologous proteins. I. Solvent accessibility classes.
J. Mol. Biol. 238:682-692.
[2] Wako, H. & Blundell, T. (1994). Use of amino acid environment-dependent substitution
tables and conformational propensities in structure prediction from aligned
sequences of homologous proteins. II. Secondary structures. J. Mol. Biol.
238:693-708.
[3] Koshi, J. M. & Goldstein, R. A. (1995). Context-dependent optimal substitution
matrices. Protein Engineering 8:641-645.
[4] Goldman, N., Thorne, J. L., Jones, D. T. (1996). Using evolutionary trees
in protein secondary structure prediction and other comparative sequence analyses,
J. Mol. Evol. 263:196-208.
[5] Koshi, J. M. & Goldstein, R. A. (1998). Models of natural mutations including
site heterogeneity. Proteins 32:289-295.
[6] Koshi, J. M. & Goldstein, R. A. (2001). Analyzing site heterogeneity during
protein evolution. Pacific Symposium on Biocomputing 6:191-202.
[7] Dimmic, M. W. & Goldstein, R. A. (2000). Modeling evolution at the amino acid
level using a general fitness model. In Pacific Symposium on Biocomputing
2000. (Altman, Dunker, Hunger, Lauderdale, and Klein, eds), World Scientific,
Singapore, 18-29.
[8] Soyer, O., Dimmic, M. W., Neubig, R. R., Goldstein, R. A. (2002). Modeling protein
evolution with applications to the understanding of G-Protein Coupled Receptors.
Proteins, submitted.
[9] Kidera, A., Konishi, Y., Oka, M., Ooi, T., Scheraga, H. A. (1985). Statistical
analysis of the physical properties of the 20 naturally occurring amino acids.
J. Prot. Chem. 4:23-55.
[10] Hobohm, U. & Sander, C. (1994). Enlarged representative set of protein structures.
Protein Sci. 3:522-524.
[11] Sander, C. & Schneider, R (1991). Database of homology-derived protein structures
and the structural meaning of sequence alignment. Proteins 9:56-68.
[12] Higgins, D. G., Bleasby, A. J., Fuchs, R. (1992). Clustal V: Improved software
for multiple sequence alignment. CABIOS 8:189-191.
[13] Kabsch, W. & Sander, C. (1983). Dictionary of protein secondary structure:
Pattern recognition of hydrogen-bonded and geometrical features. Biopoly.
22:2577-2637.
[14] Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E. F. Jr.,
Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T., Tasumi, M. (1997).
The protein data bank: A computer-based archival file for macromolecular structures.
J. Mol. Biol. 112:535-542.
[15] Horn, F., Weare, J., Beukers, M. W., Horsch, S., Bairoch, A., Chen, W.,
Edvardsen, O., Campagne, F., Vriend, G. (1998). GPCRDB: an information system for
G protein coupled receptors. Nucl. Acid Res. 26:275-279.
[16] Felsenstein, J. (1993). PHYLIP - Phylogeny Inference Package. Cladistics 5:164-66.
[17] Yang, Z. (1994). Maximum Likelihood Phylogenetic Estimation from DNA Sequences
with Variable Rates over Sites: Approximate Methods. J. Molec. Evol. 39:306-314.
[18] Kawashima, S., Ogata, H., Kanehisa M. (1999). AAindex: Amino acid index database.
Nucleic Acid Res. 27:368-369.
[19] Fraczkiewicz, R. & Braun, W. (1998). Exact and efficient analytical calculation
of the accessible surface areas and their gradients for macromolecules. J.
Comp. Chem. 19:319-333.
[20] Koshi, J. M. & Goldstein, R. A. (1997). Mutation matrices: correlations and
implications. Proteins 27:336-344.
Published
in "Molecular Informatics: Confronting Complexity", Martin G. Hicks
& Carsten Kettner (Eds.), Proceedings of the Beilstein-Institut Workshop,
May 13th
- 16th
2002, Bozen, Italy