Quantitative definition of chemical synthetic pathway complexity of organic compounds

(1) Dublin High School, Dublin, CA, (2) Saratoga High School, Saratoga, CA, (3) Leigh High School, San Jose, CA, (4) Monta Vista High School, Cupertino, CA, (5) Westlake High School, Austin, TX, (6) BASIS Independent Silicon Valley, San Jose, CA, (7) Milpitas High School, Milpitas, CA, (8) Department of Computer Science & Engineering, Aspiring Scholars Directed Research Program, Fremont, CA, (9) Department of Chemistry, Biochemistry, & Physics, Aspiring Scholars Directed Research Program, Fremont, CA

https://doi.org/10.59720/22-009
Cover photo for Quantitative definition of chemical synthetic pathway complexity of organic compounds

Irrespective of the final application of a molecule, synthetic accessibility is the rate-determining step in discovering and developing novel entities. However, synthetic complexity is challenging to quantify as a single metric, since it is a composite of several measurable metrics, some of which include cost, safety, and availability. Moreover, defining a single synthetic accessibility metric for both natural products and non-natural products poses yet another challenge given the structural distinctions between these two classes of compounds. Here, we propose a model for synthetic accessibility of all chemical compounds, inspired by the Central Limit Theorem, and devise a novel synthetic accessibility metric assessing the overall feasibility of making chemical compounds that has been fitted to a Gaussian distribution. Our approach utilizes a Gaussian mixture model (GMM) and Autoencoder, which rank synthetic complexity for natural products. This model can inform total synthesis of natural products, process chemistry in pharmaceutical contexts, materials science, and chemical engineering. Based on our findings, we conclude that the Autoencoder model is better suited to model the true probability distribution of synthetic complexity for natural products.

Download Full Article as PDF