Mechanistically Interpretable QSAR and Machine Learning Models for Predicting Ames TA98 + S9 Mutagenicity of Aromatic and Heteroaromatic Amines
Lihui Xin
Co-Presenters: Individual Presentation
College: Hennings College of Science Mathematics and Technology
Major: BA.BIOLOGY
Faculty Research Mentor: Kar, Supratik
Abstract:
Chemical mutagenicity remains a major bottleneck in pharmaceutical discovery and environmental risk assessment because experimental testing is slow, costly, and dependent on biological materials. This study focuses on the Ames TA98 + S9 endpoint (Salmonella typhimurium TA98 with metabolic activation), a critical assay for identifying frame shift mutagens among aromatic and heteroaromatic amines that require bioactivation. Using a curated dataset of 305 compounds enriched in aromatic/heteroaromatic amines and their derivatives, we developed mechanistically interpretable regression based QSAR and machine learning models to support screening, prioritization, and confidence guided decision making. The best-performing QSAR model was a multiple linear regression equation comprising nine descriptors that capture (i) N-O topological patterns, (ii) detour matrix connectivity, (iii) secondary/tertiary nitrogen edge distance relationships, and (iv) polarizability weighted autocorrelations. These features provide interpretable links to metabolic activation potential and DNA interaction propensity. Internal validation showed strong fitness and robustness (R2 = 0.75; Q2Loo = 0.73). External testing remained predictive (R2 = 0.72; Q²F1/R2pred = 0.72), and the model satisfied standard Golbraikh–Tropsha criteria, while an MAE-based diagnostic indicated scope to further reduce error (MAE95 = 0.60). The combined QSAR and ML models were subsequently applied to a large external library of heteroaromatic amines to fill mutagenicity data gaps, prioritize candidates for confirmatory testing, and support safer chemical design and regulatory screening. The next step includes deployment as an open source application with applicability domain flags and batch prediction capabilities for community use.