What is protein structure prediction: Anfinsen Dogma, machine learning, the first reliable algorithm RaptorX

Anec > Biology Knowledge

This article is about the history of protein structure prediction, from Anfinsen Dogma to machine learning for predict secondary structure. Jinbo Xu developing the first reliable deep residual neural network to calculate contact maps and predict the complete 3D structure of proteins.

Anfinsen Dogma and Protein Structure Prediction

Although protein almost undertakes all life activities and is highly efficient, each protein can only perform one or a class of task, and result in a great variety. Their primary structure is a straight chain that can be composed of 20 types of amino acids. However, they only exert their biological functions after folding into a specific three-dimensional structure.

In 1961, American biochemist Christian Anfinsen discovered that certain chemicals targeting hydrogen bonds and disulfide bonds would cause RNase to lose its structure and biological activity. If these chemicals were removed, the denatured RNase would restore its original state. The repeated denaturation and renaturation of proteins led Anfinsen to propose a hypothesis: Under appropriate conditions, amino acid chain will automatically fold into its lowest free energy state that is stable under slight disturbances in the surrounding environment; the lowest free energy state lies in a valley with no other low-energy states nearby. The free energy curve looks more like a steep funnel rather than a flat bottom. (Currently, there are also a few proteins with stable misfolding, such as prions related to mad cow disease.) Therefore, it is also known as Anfinsen's dogma, or thermodynamic hypothesis.

This suggests that the three-dimensional information of proteins is already contained in amino acid sequence. Since the principle that structure determines function is particularly significant in life sciences, we can also directly infer the protein's function from the amino acid sequence and even replace certain amino acids to design proteins according to our will.

Machine learning is unreliable for predict protein structure

Currently, three experimental instruments have been invented to determine protein three-dimensional structures: nuclear magnetic resonance, X-ray diffraction, and cryo-electron microscopy. But they are expensive and inefficient. By 2010, only around 100,000 protein structures had been determined. This data is used as homologous sequences to improve AI prediction of protein structures.

This problem is so challenging that scientists have resorted to predicting protein secondary structures instead. They calculated probability of each amino acid appearing in α-helix, β-sheet, turns, and random coils. If a certain region of sequence contains many amino acids likely to form α-helix, that region is predicted to be α-helix. This method ignores interactions between amino acids and has an accuracy of only about 50%. The most representative example is the Chou-Fasman method in the 1970s. The GOR method not only considers probability, but also includes the effects of 16 neighboring amino acids on protein structure. However, it ignores amino acids that are far apart and is limited to a single sequence, so its accuracy does not exceed 65%.

In the early 1990s, several thousand protein structures were already precisely analyzed. It was found that higher-level structures are more conserved than primary structures. If similarity between two amino acid sequence exceeds 30%, they are considered homologous sequences with similar structure and function. If similarity exceeds 60%, there is reason to believe that they share the same structure. The PHD algorithm considers homologous sequences, and features extracted from sequence (such as probabilities, local sequences, and physicochemical properties) are input into a BP neural network that analyzes relationship between these features and protein's secondary structure as if processing the context of a sentence. The 3-layer BP neural network is not deep learning, and PHD still does not consider global information, so its accuracy is only around 70%.

RaptorX: Jinbo Xu invented the first reliable AI for protein structure

Before the first reliable RaptorX artificial intelligence algorithm came out, no algorithm was trustworthy for protein structure prediction. The accuracy is only about 70%, let alone to predict three-dimensional structure.

From 2001 to 2006, the tool used by Jinbo Xu was energy minimization whose core concept is that amino acid chain folds into its lowest energy state automatically. Just like pushing a ball at the top of a mountain, it will roll to the valley immediately. Various interactions between side groups, such as van der Waals forces, electrostatic forces, hydrogen bonds, hydrophobic interactions, etc., are described by energy functions. Although the minimum potential energy can be calculated by a computer theoretically, this method highly depends on understanding of physical concepts and model establishment. Considering all factors is difficult, and large amounts of computational resources are consumed. This works well for smaller molecules, but is not suitable for complex molecules whose predictions are far from experimental data, especially proteins containing thousands of atoms or more. Therefore, Jinbo Xu believed that energy optimization had no future.

Consequently, Jinbo Xu turned to machine learning. If energy optimization requires humans to guide computer step by step, then machine learning encourages them to learn known protein sequences and structures by themselves, and to discover the laws among them for unknown structures prediction. He tried various methods, including deep learning that is highly effective in image recognition, but the results were poor. Between 2006 and 2016, these ten years were the darkest period in this field: people thought protein structure prediction was unsolvable and left the field; research funding became scarcer; fewer teams participated in CASP competition.

The breakthrough came in 2016. "For a sequence with 300 amino acids, previous deep learning only used its local information to predict its structure," Jinbo Xu said. "The key is to make AI use all global information from the first to the 300th amino acid." All information is summarized into a matrix that is input into a deeper residual neural network to calculate the distance between every two amino acids, and generate a matrix called a contact map (this is a matrix with only two elements, 1 and 0; if the distance is less than a certain threshold, such as 8 Å, they are considered interacting, and the element is set to 1, otherwise, it is set to 0). Then, contact map is input in commercial software for protein design to obtain 3D structure.

Jinbo Xu used his specially designed residual neural network to predict the structure of a membrane protein with more than 200 amino acids and found the error was only about 0.2 nanometers, equivalent to the width of two atoms. Why choose membrane proteins? They must be embedded in biological membranes to maintain their effective shapes, so the failure to be extracted results in their absence from databases. RaptorX involving deep learning predicts its structure well even without using data from PDB for training. This suggests that deep learning neural network truly captured some underlying regularities after understanding global information.

Frequently Asked Questions

Why was academia completely defeated by Google in the field of protein structure prediction?

In 2014, he didn't even have a GPU and could only rely on CPUs to train small deep neural networks. By 2016, although a 12GB GPU was used to construct a 60-layer neural network, too many parameters could not be added, otherwise, training would be failed due to insufficient memory. Jinbo Xu applied for research funding to set up a computer with four GPUs. His small team had to write every line of code by themselves. The total number of team members and collaborators did not exceed five.

This is in stark contrast to the AlphaFold team which is backed by Google's abundant computing power and talent resources. The AlphaFold team has assembled about 30 professionals, including several AI algorithm engineers for code optimization. Google also specially developed TPUs to replace GPUs for all their AI training. The academic community has been falling further and further behind AlphaFold in protein structure prediction due to difficulties in obtaining sufficient funding.