TY - JOUR
T1 - Enzyme structure correlates with variant effect predictability
AU - van der Flier, Floris
AU - Estell, Dave
AU - Pricelius, Sina
AU - Dankmeyer, Lydia
AU - van Stigt Thans, Sander
AU - Mulder, Harm
AU - Otsuka, Rei
AU - Goedegebuur, Frits
AU - Lammerts, Laurens
AU - Staphorst, Diego
AU - van Dijk, Aalt D.J.
AU - de Ridder, Dick
AU - Redestig, Henning
PY - 2024/12
Y1 - 2024/12
N2 - Protein engineering increasingly relies on machine learning models to computationally pre-screen promising novel candidates. Although machine learning approaches have proven effective, their performance on prospective screening data leaves room for improvement; prediction accuracy can vary greatly from one protein variant to the next. So far, it is unclear what characterizes variants that are associated with large prediction error. In order to establish whether structural characteristics influence predictability, we created a novel high-order combinatorial dataset for an enzyme spanning 3,706 variants, that can be partitioned into subsets of variants with mutations at positions exclusively belonging to a particular structural class. By training four different supervised variant effect prediction (VEP) models on structurally partitioned subsets of our data, we found that predictability strongly depended on all four structural characteristics we tested; buriedness, number of contact residues, proximity to the active site and presence of secondary structure elements. These dependencies were also found in several single mutation enzyme variant datasets, albeit with dataset specific directions. Most importantly, we found that these dependencies were similar for all four models we tested, indicating that there are specific structure and function determinants that are insufficiently accounted for by current machine learning algorithms. Overall, our findings suggest that improvements can be made to VEP models by exploring new inductive biases and by leveraging different data modalities of protein variants, and that stratified dataset design can highlight areas of improvement for machine learning guided protein engineering.
AB - Protein engineering increasingly relies on machine learning models to computationally pre-screen promising novel candidates. Although machine learning approaches have proven effective, their performance on prospective screening data leaves room for improvement; prediction accuracy can vary greatly from one protein variant to the next. So far, it is unclear what characterizes variants that are associated with large prediction error. In order to establish whether structural characteristics influence predictability, we created a novel high-order combinatorial dataset for an enzyme spanning 3,706 variants, that can be partitioned into subsets of variants with mutations at positions exclusively belonging to a particular structural class. By training four different supervised variant effect prediction (VEP) models on structurally partitioned subsets of our data, we found that predictability strongly depended on all four structural characteristics we tested; buriedness, number of contact residues, proximity to the active site and presence of secondary structure elements. These dependencies were also found in several single mutation enzyme variant datasets, albeit with dataset specific directions. Most importantly, we found that these dependencies were similar for all four models we tested, indicating that there are specific structure and function determinants that are insufficiently accounted for by current machine learning algorithms. Overall, our findings suggest that improvements can be made to VEP models by exploring new inductive biases and by leveraging different data modalities of protein variants, and that stratified dataset design can highlight areas of improvement for machine learning guided protein engineering.
KW - Machine learning
KW - Predictability
KW - Protein engineering
U2 - 10.1016/j.csbj.2024.09.007
DO - 10.1016/j.csbj.2024.09.007
M3 - Article
AN - SCOPUS:85205933762
SN - 2001-0370
VL - 23
SP - 3489
EP - 3497
JO - Computational and Structural Biotechnology Journal
JF - Computational and Structural Biotechnology Journal
ER -