1. Introduction
In recent years, the subject of computational biology has experienced rapid and significant expansion, leading to a fundamental shift in how we comprehend and manipulate biological systems. The impact of computational approaches on protein engineering and molecular design is especially noticeable, as they have completely transformed the capacity to create and enhance proteins with new and unique capabilities. The incorporation of computational methodologies alongside conventional biological methods has created new opportunities for advancement in biotechnology, medicines, and related disciplines. This collaboration has resulted in improved and focused approaches for manipulating proteins, finding new drugs, and creating innovative biomolecules with improved capabilities.
Computational methods are becoming essential for customizing proteins for different biotechnological uses. Each year, a variety of tools and methodologies are being created and improved to keep up with the growing needs and difficulties of protein engineering [
1]. The progress in machine learning and artificial intelligence has greatly improved the precision of protein structure predictions and the detection of functional regions, enabling more accurate manipulation of protein activities [
2]. The use of computational approaches has greatly influenced the field of enzyme design. These approaches have allowed for the development of proteins that have enhanced catalytic efficiencies and new functionality [
3]. For example, the utilization of machine learning models to forecast protein stability and interactions has simplified the design procedure, enabling the quick creation and manufacture of proteins without the limitations of living cells.
The combination of computational and experimental methods has expedited the design process by allowing the development of targeted libraries for laboratory evolution. This has resulted in a reduction of the extensive sequence space that requires sampling [
4]. Platforms such as Mutexa demonstrate attempts to develop intelligent ecosystems that integrate fast computation with bioinformatics and quantum chemistry, making the process of identifying potential protein variants more efficient [
5]. However, there are still obstacles to overcome in expanding the use of these technologies and making them available to a wider group of academics. This is crucial in order to fully utilize their potential in addressing global issues like sustainable development and healthcare [
6].
Computational methods have gained significance in the field of drug development, thanks to recent progress in deep learning and artificial intelligence. These advancements have made it easier to quickly identify a wide range of powerful and specific ligands. These advancements have the capacity to make the drug discovery process more accessible to the general public, offering new possibilities for the efficient creation of safer and more efficient small-molecule medicines. The advancement of computational tools and their integration with experimental approaches is paving the way for remarkable innovation and application in protein design within the field of synthetic biology.
The continuous progress in computational biology is paving the way for a forthcoming period of protein engineering and molecular design, marked by enhanced accuracy, efficiency, and creativity. In order to overcome current hurdles and fully utilize the promise of biotechnology and pharmaceuticals, it is imperative to integrate computational and experimental approaches as the area continues to develop. This study seeks to present a thorough summary of the most recent developments in computational approaches used in protein engineering and molecular design. It emphasizes the significant influence of these technologies on the field.
Figure 1.
Development and application of AI algorithms in biotechnology. (A), (B) Various AI algorithms significantly contribute to the development of biotechnology. Representatively, CNN (Convolutional Neural Network) are utilized for protein structure prediction through the prediction of distances and contact maps between residues. Additionally, RNN (Recurrent Neural Network) play a crucial role in sequence optimization through temporal relationship and sequential pattern modeling. (C) Recently, algorithm such as GAN (Generative Adversarial Network), RL (Reinforcement Learning), Transfer Learning and Few-Shot Learning have demonstrated their efficiency in modeling protein structures and interactions. These advanced algorithms are being utilized to overcome limitations in data collection required for model training as well as limitations in designing new proteins. (D) Explainable AI (XAI) provides transparency and insight into modeling results by elucidating the decision-making process behind the vague “black box” judgment criteria of existing AI-based predictive models. Advances in AI algorithms have significantly progressed protein engineering. however, they still require experimental validation. The integration of domain expertise and AI-based methodologies, also known as informed AI, can potentially enhance model efficiency, reliability, and to provide more accurate insights consistent with validated domain knowledge.
Figure 1.
Development and application of AI algorithms in biotechnology. (A), (B) Various AI algorithms significantly contribute to the development of biotechnology. Representatively, CNN (Convolutional Neural Network) are utilized for protein structure prediction through the prediction of distances and contact maps between residues. Additionally, RNN (Recurrent Neural Network) play a crucial role in sequence optimization through temporal relationship and sequential pattern modeling. (C) Recently, algorithm such as GAN (Generative Adversarial Network), RL (Reinforcement Learning), Transfer Learning and Few-Shot Learning have demonstrated their efficiency in modeling protein structures and interactions. These advanced algorithms are being utilized to overcome limitations in data collection required for model training as well as limitations in designing new proteins. (D) Explainable AI (XAI) provides transparency and insight into modeling results by elucidating the decision-making process behind the vague “black box” judgment criteria of existing AI-based predictive models. Advances in AI algorithms have significantly progressed protein engineering. however, they still require experimental validation. The integration of domain expertise and AI-based methodologies, also known as informed AI, can potentially enhance model efficiency, reliability, and to provide more accurate insights consistent with validated domain knowledge.
Figure 2.
This figure illustrates the advanced computational techniques used in protein structure prediction, ligand-protein interaction modeling, and enzyme engineering. (A) Homology modeling (left image) infers the structure of a protein with an unknown structure by using the structure of a related sequence, based on the observation that proteins with similar sequences tend to have similar structures, while threading techniques (right image) predict a new structure by scoring the alignment of the target sequence against a template library with protein fold information when no structurally similar sequences are available; both methods are utilized for protein structure prediction in the absence of experimental data. (B) Quantum mechanics is used to predict the interactions between a ligand and a protein, while molecular mechanics is applied to model the interactions between a protein and its surrounding environment. The combined use of these two approaches, known as a hybrid method, has been enhanced by recent advancements in parallel computing technologies, overcoming previous limitations and contributing to the development of high-success-rate drugs. (C) The diagram on the left illustrates the process of aligning various protein sequences, enabling researchers to extract information more efficiently from refined sequences. Phylogenetic analysis allows for the determination of relative distances between elements, and by integrating MSA (Multiple Sequence Alignment) with phylogenetic approaches, information can be analyzed more effectively. (D) Structure-based design methods (left) are used for protein-ligand binding and provide examples of various underlying analytical techniques. Sequence-based design methods (right) are primarily applied to protein-protein interactions and can be broadly categorized into gene and protein sequence analysis. (E) Applying machine learning to enzyme engineering allows for predicting enzyme activity based on library data, improving enzyme stability, and facilitating enzyme development. It also helps explore methods to enhance the efficiency of catalysts or assists in selecting the appropriate catalyst. (F) The development of deep learning software such as AlphaFold has enabled rapid results in high-throughput virtual screening without the need for experimental procedures. Additionally, such software can significantly contribute to understanding enzyme-protein interactions within enzyme libraries, particularly in terms of stability, activity, and selectivity.
Figure 2.
This figure illustrates the advanced computational techniques used in protein structure prediction, ligand-protein interaction modeling, and enzyme engineering. (A) Homology modeling (left image) infers the structure of a protein with an unknown structure by using the structure of a related sequence, based on the observation that proteins with similar sequences tend to have similar structures, while threading techniques (right image) predict a new structure by scoring the alignment of the target sequence against a template library with protein fold information when no structurally similar sequences are available; both methods are utilized for protein structure prediction in the absence of experimental data. (B) Quantum mechanics is used to predict the interactions between a ligand and a protein, while molecular mechanics is applied to model the interactions between a protein and its surrounding environment. The combined use of these two approaches, known as a hybrid method, has been enhanced by recent advancements in parallel computing technologies, overcoming previous limitations and contributing to the development of high-success-rate drugs. (C) The diagram on the left illustrates the process of aligning various protein sequences, enabling researchers to extract information more efficiently from refined sequences. Phylogenetic analysis allows for the determination of relative distances between elements, and by integrating MSA (Multiple Sequence Alignment) with phylogenetic approaches, information can be analyzed more effectively. (D) Structure-based design methods (left) are used for protein-ligand binding and provide examples of various underlying analytical techniques. Sequence-based design methods (right) are primarily applied to protein-protein interactions and can be broadly categorized into gene and protein sequence analysis. (E) Applying machine learning to enzyme engineering allows for predicting enzyme activity based on library data, improving enzyme stability, and facilitating enzyme development. It also helps explore methods to enhance the efficiency of catalysts or assists in selecting the appropriate catalyst. (F) The development of deep learning software such as AlphaFold has enabled rapid results in high-throughput virtual screening without the need for experimental procedures. Additionally, such software can significantly contribute to understanding enzyme-protein interactions within enzyme libraries, particularly in terms of stability, activity, and selectivity.
Figure 3.
This figure illustrates various computational techniques used to enhance sampling efficiency and reduce computational resources in biomolecular simulations, highlighting their distinct approaches and applications. (A) Diagram of replica exchange molecular dynamics (left). This method forms multiple replicas and allows efficient simulation sampling through periodic exchanges of components between these replicas. It is particularly suitable for scenarios involving high-energy barriers in biomolecular interactions and can be conducted at different temperatures. Diagram illustrating the difference between metadynamics and adaptive sampling methods in terms of stochastic reset (right). Stochastic reset refers to the model probabilistically reverting to a previous state; metadynamics prevents this by introducing a bias potential, while adaptive sampling intentionally restarts the model at specific locations to enhance the sampling method. (B) Diagram of the MARTINI model and its advantages (left). The MARTINI model simplifies molecular systems by grouping multiple elements (primarily atoms) into larger entities called beads, rather than treating each element individually. This simplification reduces the degrees of freedom, significantly lowering computational resources required and enabling longer simulations with limited resources. Schematic of Elastic Network Models (ENMs) (right). ENMs represent the forces between biomolecules in large simulation environments using a spring model, where each node typically represents an alpha carbon. The longer the distance, the stronger the pulling force, allowing the possible conformations of biomolecules upon deformation to be inferred through this model. (C) Neural network potentials, such as Torch MD, enable 3D modeling and high-energy barrier calculations through machine learning. When combined with enhanced sampling techniques or experimental data, neural network potentials can achieve greater accuracy and efficiency. (D) An integrated model utilizing machine learning tools such as dimensionality reduction, regression, and clustering enables the modeling of complex biomolecular systems, such as detecting protein-ligand interactions.
Figure 3.
This figure illustrates various computational techniques used to enhance sampling efficiency and reduce computational resources in biomolecular simulations, highlighting their distinct approaches and applications. (A) Diagram of replica exchange molecular dynamics (left). This method forms multiple replicas and allows efficient simulation sampling through periodic exchanges of components between these replicas. It is particularly suitable for scenarios involving high-energy barriers in biomolecular interactions and can be conducted at different temperatures. Diagram illustrating the difference between metadynamics and adaptive sampling methods in terms of stochastic reset (right). Stochastic reset refers to the model probabilistically reverting to a previous state; metadynamics prevents this by introducing a bias potential, while adaptive sampling intentionally restarts the model at specific locations to enhance the sampling method. (B) Diagram of the MARTINI model and its advantages (left). The MARTINI model simplifies molecular systems by grouping multiple elements (primarily atoms) into larger entities called beads, rather than treating each element individually. This simplification reduces the degrees of freedom, significantly lowering computational resources required and enabling longer simulations with limited resources. Schematic of Elastic Network Models (ENMs) (right). ENMs represent the forces between biomolecules in large simulation environments using a spring model, where each node typically represents an alpha carbon. The longer the distance, the stronger the pulling force, allowing the possible conformations of biomolecules upon deformation to be inferred through this model. (C) Neural network potentials, such as Torch MD, enable 3D modeling and high-energy barrier calculations through machine learning. When combined with enhanced sampling techniques or experimental data, neural network potentials can achieve greater accuracy and efficiency. (D) An integrated model utilizing machine learning tools such as dimensionality reduction, regression, and clustering enables the modeling of complex biomolecular systems, such as detecting protein-ligand interactions.
Figure 4.
This figure highlights various approaches that enhance the accuracy and reliability of drug discovery processes by integrating computational models, experimental data, and deep learning methods. It showcases how combining these elements can improve prediction performance, structural accuracy, and lead compound optimization. (A) A model integrating output data from various software improves prediction performance, generates new evaluation metrics, and provides more reliable information during the virtual screening stage. Input parameters include docking scores, molecular (or component) poses, and representations of complexes. (B) Experimental data-based libraries enable the use of various software tools. These libraries compile 3D structures obtained through methods such as X-ray crystallography, electron microscopy (EM), and NMR spectroscopy. By leveraging actual data, software like AlphaFold and HADDOCK can achieve highly accurate structural predictions, ultimately contributing to the drug development process. (C) A deep learning model for simulating the binding of lead compound candidates to target proteins can achieve superior performance by integrating structure-activity relationship data with experimental data. Experimental data can be sourced from databases like PDB, which mainly include data obtained from X-ray crystallography, electron microscopy (EM), and NMR spectroscopy. Ultimately, the integrated deep learning model enhances selectivity and affinity during the lead compound optimization stage, improving efficiency and accuracy at every step.
Figure 4.
This figure highlights various approaches that enhance the accuracy and reliability of drug discovery processes by integrating computational models, experimental data, and deep learning methods. It showcases how combining these elements can improve prediction performance, structural accuracy, and lead compound optimization. (A) A model integrating output data from various software improves prediction performance, generates new evaluation metrics, and provides more reliable information during the virtual screening stage. Input parameters include docking scores, molecular (or component) poses, and representations of complexes. (B) Experimental data-based libraries enable the use of various software tools. These libraries compile 3D structures obtained through methods such as X-ray crystallography, electron microscopy (EM), and NMR spectroscopy. By leveraging actual data, software like AlphaFold and HADDOCK can achieve highly accurate structural predictions, ultimately contributing to the drug development process. (C) A deep learning model for simulating the binding of lead compound candidates to target proteins can achieve superior performance by integrating structure-activity relationship data with experimental data. Experimental data can be sourced from databases like PDB, which mainly include data obtained from X-ray crystallography, electron microscopy (EM), and NMR spectroscopy. Ultimately, the integrated deep learning model enhances selectivity and affinity during the lead compound optimization stage, improving efficiency and accuracy at every step.
Figure 5.
Enhanced functionalities of proteins through computational protein design and development. (A) Advancements in computational techniques, including deep learning models like RFdiffusion, AlphaFold2, and ProteinMPNN, have significantly improved de novo protein design. Zernike polynomials, Molecular Surface Interaction Fingerprinting (MaSIF), and molecular dynamics techniques help optimize protein-protein interactions. (B) ThermoMPNN is a computational tool that uses a deep neural network trained to predict stability changes in point mutations of a given protein with an initial structure. DeepEvo is an AI-based protein engineering strategy using a protein language model that can predict thermostability variants. (C) Allosteric transition simulations using multiscale modeling and Markov state models can predict protein functions, enabling the creation of customized allosteric regulatory proteins and the development of new protein functions. (D) Deep learning-based computational tools like Rosetta precisely modify protein structures to enhance binding capabilities, enabling the de novo protein design with customized binding properties. (E) Computational Design for domain fusion and chimeric proteins uses structural databases and computer technologies such as machine learning to generate multifunctional proteins.
Figure 5.
Enhanced functionalities of proteins through computational protein design and development. (A) Advancements in computational techniques, including deep learning models like RFdiffusion, AlphaFold2, and ProteinMPNN, have significantly improved de novo protein design. Zernike polynomials, Molecular Surface Interaction Fingerprinting (MaSIF), and molecular dynamics techniques help optimize protein-protein interactions. (B) ThermoMPNN is a computational tool that uses a deep neural network trained to predict stability changes in point mutations of a given protein with an initial structure. DeepEvo is an AI-based protein engineering strategy using a protein language model that can predict thermostability variants. (C) Allosteric transition simulations using multiscale modeling and Markov state models can predict protein functions, enabling the creation of customized allosteric regulatory proteins and the development of new protein functions. (D) Deep learning-based computational tools like Rosetta precisely modify protein structures to enhance binding capabilities, enabling the de novo protein design with customized binding properties. (E) Computational Design for domain fusion and chimeric proteins uses structural databases and computer technologies such as machine learning to generate multifunctional proteins.
Figure 6.
Protein engineering applications using computational approaches in biotechnology and pharmaceuticals. (A) High-throughput sequencing data and geometric deep learning can enhance antibody binding prediction capabilities. Computational technologies such as deep learning enable sequence-based antibody design, providing advanced approaches to antibody engineering. (B) Computational and structural methods, such as deep learning and quantum mechanical molecular dynamics simulations, have enabled the prediction of atomic-level movements of biomolecules, leading to improvements in the applicability, accuracy, and specificity of protein-based biosensors. (C) Advancements in computational technologies such as machine learning, combined with high-throughput screening, have enabled improved enzyme engineering with enhanced catalytic properties, leading to increased stability, activity, and selectivity of enzymes. (D) Computational technologies play a crucial role in therapeutic protein design, particularly in predicting peptide-MHC binding affinity. These methods not only advance personalized medicine but also accelerate the clinical application of protein therapeutics.
Figure 6.
Protein engineering applications using computational approaches in biotechnology and pharmaceuticals. (A) High-throughput sequencing data and geometric deep learning can enhance antibody binding prediction capabilities. Computational technologies such as deep learning enable sequence-based antibody design, providing advanced approaches to antibody engineering. (B) Computational and structural methods, such as deep learning and quantum mechanical molecular dynamics simulations, have enabled the prediction of atomic-level movements of biomolecules, leading to improvements in the applicability, accuracy, and specificity of protein-based biosensors. (C) Advancements in computational technologies such as machine learning, combined with high-throughput screening, have enabled improved enzyme engineering with enhanced catalytic properties, leading to increased stability, activity, and selectivity of enzymes. (D) Computational technologies play a crucial role in therapeutic protein design, particularly in predicting peptide-MHC binding affinity. These methods not only advance personalized medicine but also accelerate the clinical application of protein therapeutics.
Figure 7.
Challenges and future perspectives in computational approaches to protein engineering applications. (A) Current force fields have limitations in accurately capturing changes in electrostatic interactions, which impacts the accuracy and reliability of simulations. Integrating computational tools with experimental validation is essential for enhancing the accuracy and efficiency of protein design. Ethical issues related to bias, transparency, and accountability arise in the application of AI in protein engineering. (B) The integration of multi-scale modeling approaches is essential for understanding the complex dynamics of protein systems and developing proteins with new functions, and the advancement of these models holds great potential in the field of computational protein design. The combination of computational protein design and synthetic biology enables the development of innovative proteins.
Figure 7.
Challenges and future perspectives in computational approaches to protein engineering applications. (A) Current force fields have limitations in accurately capturing changes in electrostatic interactions, which impacts the accuracy and reliability of simulations. Integrating computational tools with experimental validation is essential for enhancing the accuracy and efficiency of protein design. Ethical issues related to bias, transparency, and accountability arise in the application of AI in protein engineering. (B) The integration of multi-scale modeling approaches is essential for understanding the complex dynamics of protein systems and developing proteins with new functions, and the advancement of these models holds great potential in the field of computational protein design. The combination of computational protein design and synthetic biology enables the development of innovative proteins.