Data-driven methods in protein engineering: new ways to utilize sequence and structures of proteins

Mathematical Biology and Ecology Seminar
Wednesday, October 22, 2008 - 11:00
1 hour (actually 50 minutes)
Skiles 255
School of Chemistry & Biochemistry, Georgia Tech
After rational protein design and combinatorial protein engineering (directed evolution), data-driven protein engineering emerges as a third generation of techniques for improving protein properties. Data-driven protein engineering relies heavily on the use of mathematical algorithms. In the first example, we developed a method for predicting the positions in the amino acid sequence that are critical for the catalytic activity of a protein. With nucleotide sequences of both functional and non-functional variants and a Support Vector Machine (SVM) learning algorithm, we set out to narrow the interesting sequence space of proteins, i.e. find the truly relevant positions. Variants of TEM-1 β-lactamase were created in silico using simulations of both mutagenesis and recombination protocols. The algorithm was shown to be able to predict critical positions that can tolerate up to two amino acids. Pairs of amino acid residues are known to lead to inactive sequences, unless mutated jointly. In the second example, we combine SVM, Boolean learning (BL), and the combination of the two, BLSVM, to find such interactive residues. Results on interactive residues in two fluorescent proteins, Discosoma Red Fluorescent Protein (Ds-Red) and monomeric Red Fluorescent Protein (mRFP), will be presented.