1 Introduction

Artificial intelligence in chemistry has historically focused on predicting scalar molecular properties from molecular structure, a field established as quantitative structure-activity relationship (QSAR) by the pioneering work of Corwin Hansch. [1] While foundational, early QSAR models were often limited by assumptions of linearity, a constraint that prompted the shift toward classical machine learning (ML) techniques in the 1990s and 2000s, such as support vector machines (SVM) [2] and random forests [3], which applied to expert-designed features like molecular fingerprints. [4-5] The deep learning era ushered in end-to-end learning with graph neural network (GNNs) [6-8] and transformers, [9] which automatically extract hierarchical features from raw molecular data. While architectures such as graph convolutional network (GCNs) [10-12], message passing neural network (MPNNs) [13-14], and Kolmogorov-Arnold graph neural network (KA-GNNs) [15-16] have progressively improved predictive accuracy on benchmarks [17-19], the efficiency of small-molecule drug discovery has not seen a commensurate improvement. Comprehensive surveys indicate that despite the proliferation of complex multimodal architectures, generalization to unseen chemical domains remains limited. [20-23]

We propose that this fragility (manifesting in both the proposal of inaccessible candidates and the misprediction of activity cliffs) may partly reflect a mismatch of priorities: many models optimize for the semantics of function (what a molecule does) before learning the syntax of construction (how a molecule is made). This review argues that synthesis planning is a strong candidate for the foundational pre-training objective for generative chemistry. By analogy with large language models, which acquired generalizable reasoning capabilities in part by mastering the structural grammar of text, we hypothesize that artificial chemical intelligence (ACI) may acquire more robust physical reasoning by training on the causal logic of molecular transformation.

This review analyzes the literature on multistep synthesis planning from January 2020 to February 2026, a period of rapid progress built on decades of foundational work. While early expert systems proved that retrosynthesis could be computationally formalized, [24-29] their reliance on hand-curated rules limited their scalability. The transition to learnable models began with data-driven approaches to single-step prediction [30-33], setting the stage for the first generation of modern, multistep planners. Foundational methods from 2018—2019, including Monte Carlo tree search (MCTS) [34], DFPN-E [35], and the Molecular Transformer [36], receive detailed treatment as the architectural precursors against which subsequent progress is measured. We characterize the period from 2018 to roughly 2023 as the Era of Navigability, where the primary scientific objective was to demonstrate that algorithms could effectively navigate the combinatorial explosion of the retrosynthetic tree. By the primary metric of this era, stock-termination rate (STR), modern planners now routinely achieve success rates exceeding 99% against large inventories. We argue that this saturation signals the end of the navigability phase and necessitates a transition to the Era of Validity, where evaluation must pivot from finding a path to verifying the chemical correctness of the proposed route.

In this review, we center on multistep topological planning: identification of structures of starting materials and intermediates involved in the synthetic procedure. We distinguish this from quantitative planning, which involves prediction of specific reaction conditions (catalysts, solvents, temperature) and outcomes (yield). While quantitative variables are critical for experimental success, this review focuses on the topological problem, which most current generative frameworks prioritize. The frameworks discussed may require domain-specific adaptation for other areas such as catalysis, materials chemistry, or biocatalytic synthesis. [37-39]

The review is organized as follows. Section 3 establishes the conceptual foundation by examining failures in static structure-property mapping and motivating synthesis planning as a pre-training objective. Section 4 highlights the limitations of scalar accessibility proxies, and Section 5 formalizes the planning problem. Section 6 analyzes the architectural distinction between explicit graph search and direct sequence generation. Section 7 critically evaluates navigability-era benchmarks, exposing how stock set inflation and conditioned target selection obscure planning failures. Section 5.2.1 introduces the solvability hierarchy (Solv- $N$ ), a new framework that separates topological connectivity (Solv-1) from selectivity (Solv-2) and executability (Solv-3). Section 9 briefly reviews advances in quantitative planning. Finally, Section 10 outlines the path toward a chemical foundation model that integrates physical constraints into the generative process, a necessary transition visualized in Table 2.

Domain	Pre-Training Objective	Ground Truth Data	Gold Standard Benchmark	Result
Natural Language	Next-token prediction	Internet-scale text	GLUE/MMLU	Emergent reasoning
Structural Biology	3D structure prediction	PDB & evolutionary constraints	CASP	Zero-shot folding & design
Chemistry (status quo)	SMILES masking; static graph reconstruction	Static molecular graphs (ZINC, GDB)	MoleculeNet, TDC	Struggles to generalize (activity cliffs)
Chemistry (blueprint)	Multistep synthesis planning	Causal reaction trajectories (experimental, QM-filtered)	Solv-N hierarchy	Artificial Chemical Intelligence

:::caption{#tab-paradigm-shift}[Table 2.] Paradigm comparison across domains illustrating the proposed blueprint for artificial chemical intelligence (ACI). Multistep synthesis planning is proposed as the foundational pre-training objective for chemistry, evaluated via the Solv-N hierarchy. :::

TL;DR

This review is two things at once. First, it’s a comprehensive review of multistep synthesis planning methods (Sec 5-8). Sections 3 and 4 are intended to tell you why you should be interested in synthesis planning at all.

2 The Review at a Glance

This review examines multistep data-driven retrosynthetic planning as a central component of artificial chemical intelligence, with a primary focus on small-molecule organic synthesis as represented in contemporary reaction corpora and benchmark settings.[40] Rather than revisiting static structure-property modeling in detail, we focus on how multistep planning systems construct routes from commercially available starting materials to target molecules and how this process can inform chemistry-aware representation learning.

A recurring theme is synthetic accessibility. Across a range of benchmarks [15, 20-23], modern GNN- and transformer-based models have improved scalar property prediction while still struggling to generalize reliably to new chemical domains, often proposing candidates whose experimental realization is unclear. Therefore, we treat multistep synthesis planning not only as a downstream application but as a candidate organizing objective for aligning generative models with the causal logic of molecular transformations (Table 1).

Within this scope, the main objectives of this review are:

formalization of the multistep retrosynthetic planning problem and clarification of the distinction between topological planning (identifying reactants, intermediates, and overall route structure) and quantitative planning (conditions, yields, and related variables);
overview of the major planning architectures, including search-based systems such asMonte Carlo tree search (MCTS)[34] and neural value-guided alternatives like Retro*[41], and the emerging direct sequence generation approaches exemplified by the Molecular Transformer and DirectMultiStep;[36, 42]
analysis of the evaluation practices with an emphasis on how stock-set design, target selection, and benchmark construction can inflate apparent performance in the Era of Navigability, where stock-termination rate is near saturation;

To organize existing and emerging methods, we introduce the solvability hierarchy (Solv- $N$ ). Solv- $1$ focuses on topological reachability (existence of a complete route from stock to target); Solv- $2$ incorporates basic chemical plausibility and selectivity; and Solv- $3$ addresses executability, including conditions and outcomes where data permit. We use this hierarchy to structure the transition from the navigability-centric phase of method development toward an Era of Validity, in which evaluation is increasingly tied to chemical realism, and to articulate how multistep synthesis planning may function as a foundational pre-training task for artificial chemical intelligence.

3 The Limitations of Static Structure-Activity Mapping

3.1 Out-of-Distribution Failure in Molecular Property Prediction

To motivate synthesis planning as a candidate pre-training objective, we first examine a recurring limitation in models that learn exclusively from static structure-property correlations: a sharp failure to generalize beyond the training distribution. This issue is particularly well-documented in bioactivity prediction, where the objective is to map molecular topology (structure) to functional outcomes such as binding affinity, inhibitory potency, or toxicity. Despite the proliferation of increasingly sophisticated deep learning architectures, rigorous benchmarking studies[22] indicate that improvements in performance on familiar chemical space do not carry over when models are tested on structurally novel molecules outside of the distribution of the training set.

For instance, several benchmarking studies evaluating graph neural networks and transformers find that, despite reporting state-of-the-art performance on global metrics likeroot mean square error (RMSE), these models frequently fail to outperform classical machine learning baselines under rigorous scaffold splitting. [43-45] Gaussian processes and support vector machines utilizing fixed fingerprints often match or exceed the predictive accuracy of more complex neural architectures, suggesting that increases in model complexity do not reliably translate into improved generalization to novel chemical space. The disparity is most acute at activity cliff: when minor structural modifications lead to disproportionate shifts in biological potency.

van Tilborg et al. demonstrated that on these critical edge cases, deep learning models often exhibit comparable or worse predictive accuracy than descriptor-based methods [46]. This failure likely has a mechanistic origin: standard optimization objectives (e.g., RMSE) encourage models to smooth the structure-activity landscape, effectively treating the sharp discontinuities characteristic of specific molecular recognition events as noise rather than signal [47]. Consequently, the subtle electronic or steric syntax that distinguishes a therapeutic from a toxic analogue is often lost in the learned representation. [48-49]

The inadequacy of purely topological learning is further evidenced by the architectural adaptations developed to mitigate these failures. Recent state-of-the-art frameworks frequently augment end-to-end learning with explicit handcrafted features or auxiliary supervision. For example, multi-level fusion graph neural network (MLF-GNN) [50] fuses learned graph representations with traditional Morgan fingerprints; activity cliff explanation supervised GNN (ACES-GNN) [51] incorporates auxiliary attention constraints derived from manually curated activity cliff pairs; and GraphCliff [49] employs gating mechanisms to preserve local features. Similarly, Shi et al. apply Group Lasso regularization to enforce explicit separation of scaffold from decoration substructures [52], while activity cliff-awareness network (ACANet) [53] applies a special training objective (contrastive triplet loss) that simultaneously pulls together metric representations of molecules with similar activity and pushes apart those with different activity, enforcing the metric structure and shaping the learned latent space to separate active and inactive compounds. Collectively, these design choices imply that standard end-to-end training on static topology is insufficient for robust navigation of the activity landscape.

This fragility extends to structure-based drug design, where models explicitly incorporate the three-dimensional geometry of the target protein. Multimodal architectures demonstrate measurable improvements over ligand-only baselines[54-55], but remain vulnerable to the same generalization failures. Zhang et al. documented a narrow evaluation trap defined by benchmark memorization rather than physical generalization [56], and Wang et al. showed empirically that complex 3D convolutional networks frequently fail to enrich active binders in realistic decoy scenarios [57]. Even alternative pre-training strategies have not resolved this deficit. Standard self-supervised objectives such as motif prediction or masked atom reconstruction can induce detrimental bias on activity cliffs by encouraging models to memorize global scaffold patterns at the expense of local functional group sensitivity. [58-59] When data leakage is removed via strict structural splitting, the performance of many deep learning models drops substantially, in some cases to levels approaching simple nearest-neighbor heuristics. [60] Taken together, these persistent limitations across diverse architectures suggest that the bottleneck is not a lack of model complexity, but the fundamental insufficiency of learning chemical reasoning solely through static structure-property correlations.

3.2 Parallels in Natural Language and Computer Vision

The recent history of natural language processing (NLP) and computer vision offers a compelling parallel. Early NLP systems mirrored the current state of QSAR, relying on supervised learning for specific scalar tasks, such as sentiment classification or entailment. These models functioned as narrow experts, achieving high performance within their training distribution but failing when faced with novel phrasing or context. [61] A fundamental shift occurred when the field adopted a generative objective: predicting the next token in a sequence. By optimizing for the grammar of text rather than a specific label, models acquired internal representations capable of generalized reasoning, eventually outperforming supervised baselines on tasks they were never explicitly trained to solve. [61-63]

A similar progression occurred in computer vision, which transitioned from classifiers trained on discrete categories (e.g., ImageNet classes) to foundation models trained on the correspondence between images and text. [64-65] Cherti et al. found that the robustness of visual models scales with this form of pre-training, decoupling the learned representation from any single classification task [66]. In both domains, the move from specialized label prediction to broad structural learning resulted in representations that were more durable under distribution shift.

By analogy, current chemical AI appears to occupy a developmental stage similar to pre-generative NLP. While the term foundation model is frequently applied in chemistry, many architectures lack the defining characteristic of their linguistic counterparts: an emergence of capacity to transfer knowledge to entirely new tasks without any task-specific training, a property known as zero-shot generalization. [63, 67] The implication is that the field must identify its chemical equivalent of next-token prediction: an objective that forces models to internalize the rules of molecular transformation rather than merely correlating static graphs with properties.

3.3 Synthesis Planning as a Pre-training Objective

Identifying the chemical equivalent of next-token prediction requires distinguishing between the syntax of notation and the syntax of matter. Early chemical language models treated the string representation of molecules (SMILES [68]) as the grammar to be learned. However, models trained on static masked language modeling (MLM) frequently fail to outperform simple regression baselines on downstream property tasks. [69-70] While larger models show improved prediction of simple physicochemical properties such as lipophilicity, their performance on complex bioactivity tasks tends to plateau, suggesting that learning the grammar of molecular notation is not sufficient for genuine chemical understanding. [71] Furthermore, it has been demonstrated [72] that scaling SMILES-based pre-training even to 1.1 billion molecules yields diminishing returns, consistent with models learning statistical regularities of the notation rather than internalized chemical rules.

We propose that the true syntax of chemistry is the transformation of matter through reactivity. Consequently, the chemical analogue of next-token prediction is synthesis planning: the step-by-step prediction of how a molecule is constructed. This objective bifurcates into forward synthesis (reactants to products) and retrosynthesis (products to reactants). We propose that multistep retrosynthesis is the candidate for foundational pre-training most likely to yield robust physical reasoning.

This preference is grounded in both data quality and computational tractability. Retrosynthesis can be supervised directly by the vast accumulated data of published multistep routes, providing training trajectories grounded in successful experimental execution. In contrast, multistep forward training typically relies on algorithmically generated routes, which must assume perfect separability of byproducts and lack experimental validation. Furthermore, the search space for retrosynthesis is bounded by the structural complexity of the target, whereas forward search branches with the size of the starting material inventory, rendering unconstrained exhaustive exploration computationally prohibitive.

Recent literature provides preliminary support for this reactivity-centric hypothesis. Chen et al. found that framing activity prediction as conditional structure generation improved performance on activity cliffs relative to scalar regression. [73] Similarly, architectures like REMO [74] and HiCLR [75] demonstrate that pre-training objectives requiring the reconstruction of reaction centers or reactants yield representations that capture functional group nuances better than static baselines. By requiring the model to predict missing atoms from their chemical environment, these training objectives encourage the model to internalize local reactivity rules rather than simply memorize structural patterns.

While encouraging, these results represent only the initial validation of the hypothesis. Current reactivity-aware models demonstrate improved transfer to related tasks, but they have not yet exhibited the emergence, zero-shot reasoning on unrelated problems, characteristic of foundation models in NLP. We postulate that scaling multistep retrosynthesis forces models to internalize electronic constraints, functional group compatibility, and selectivity boundaries, providing the necessary inductive bias to bridge this gap. Whether this approach is capable of yielding a true chemical reasoner remains a critical empirical question for the field.

4 Evaluating Synthetic Accessibility

Generative models that lack explicit synthesis constraints frequently propose structures that are topologically valid but chemically unsynthesizable. [76] Assessing whether a candidate molecule can be made is therefore a prerequisite for practical molecular design. For the past decade, the field has relied on heuristic scores to estimate this feasibility without the computational cost of full synthesis planning. This section reviews the growing evidence that these heuristic approximations diverge from experimental reality, motivating the transition toward explicit route generation.

4.1 Intrinsic Limitations of Heuristic Accessibility Scores

The prevailing evaluation paradigm treats synthetic accessibility as an intrinsic molecular property. Real-world feasibility, by contrast, is highly context-dependent: it relies on the specific inventory of starting materials, the operational scope of available reactions, and purification constraints. Compressing this context-dependent feasibility into a single numerical score creates a metric that often correlates poorly with experimental success.

This reductionist approach originated with the synthetic accessibility score (SAscore) [77], which estimates difficulty by quantifying the statistical prevalence of substructures in public databases and applying penalties for complexity features such as non-standard ring fusions or stereocenters. Although this simple approach was a reasonable practical compromise when full retrosynthetic planning was computationally intractable in 2009, it rests on the flawed assumption that visual structural complexity is a reliable proxy for synthetic effort. A complex scaffold may be accessible via a single transformation like a Diels-Alder cycloaddition, whereas a simple structure may be elusive due to subtle stereochemical constraints [77]. Recent evaluations confirm that SAscore frequently penalizes valid complex structures, such as PROTACs, while failing to flag difficult chiral centers [78]. On realistic datasets, Li and Chen demonstrated that SAscore performance collapses to near-random guessing [79], and Liu et al. found the mean scores for feasible and infeasible candidates to be statistically indistinguishable [80]. Topological complexity metrics, however rigorously formalized, often fail to capture synthetic difficulty when they ignore reaction-specific constraints; for instance, Flamm et al. showed that assembly theory metrics assign optimal complexity scores to pathways that delay ring closure until the final step—a strategy that is topologically efficient but synthetically implausible [81].

Learned replacements for these heuristics have demonstrated similar limitations. Coley et al. introduced synthetic complexity score (SCScore) [82] to derive complexity directly from reaction data, aiming to capture the “synthetic gradient” from simple reactants to complex products. However, the model functions primarily as a measure of reactant popularity rather than mechanistic difficulty; Parrot et al. quantified this by showing that SCScore exhibits effectively zero correlation with the solvability determinations of explicit retrosynthetic search. [83] Addressing the out-of-distribution fragility of neural models, Voršilák et al. proposed SYBA [84], a Bayesian classifier based on substructure frequency differences between synthesized and unsynthesized molecules. Despite this statistical grounding, blind evaluations indicate that SYBA similarly fails to reliably distinguish solvable targets from impossible ones. [85] Even retrosynthetic accessibility score (RAscore), which is explicitly supervised by the outcomes of retrosynthetic planning software [86], retains the artifacts of structural pattern matching. Chen and Jung observe that RAscore frequently assigns high accessibility probabilities to unsynthesizable analogs solely due to their topological similarity to training examples. [87]

Beyond structural insensitivity, heuristic scores cannot account for shifting inventory constraints. Calvi et al. demonstrate that static scorers assign identical values regardless of whether key intermediates are commercially available. [88] This invariance to supply chain realities means that optimizing for general synthesizability yields a fundamentally different chemical space than optimizing for in-house inventories. [89] While heuristics correlate with solvability in drug-like space, this relationship collapses for functional materials such as organic semiconductors. [90]

The consequences of this divergence are most acute in generative optimization. When accessibility is defined by a proxy, reinforcement learning agents exemplify Goodhart’s law [91]. Gao and Coley identified that unconstrained generators routinely assign high feasibility scores to impossible structures. [76] Gao et al. subsequently showed that agents optimizing SCScore generate simple long-chain structures to minimize complexity penalties, while those optimizing SAscore produce repetitive fused scaffolds. [92] Seo et al. quantified this failure: fragment-based models trained to maximize SAscore achieved a 0.00% success rate when validated by a rigorous retrosynthesis oracle. [93] Similarly, Koziarski et al. found that maximizing SAScore improved the proxy metric without improving actual synthesizability [94], and Gao et al. observed genetic algorithms drifting almost exclusively into unsynthesizable chemical space [95].

These limitations reflect the historical necessity of estimating feasibility when explicit planning was computationally intractable. While heuristic scores retain utility as coarse pre-filters for high-throughput screening, they are unreliable as primary evaluation metrics for generative chemistry.

4.2 The Necessity of Explicit Route Generation

To address the disconnect between heuristic estimation and experimental reality, Parrot et al. argue that the most operationally reliable metric of synthesizability is the explicit construction of a route terminating in available starting materials. [83] Empirical support for this definition is found in the performance of reaction-based generative models. By constructing molecules via explicit reaction templates rather than atom-by-atom assembly, these architectures inherently constrain the output to the logic of available chemistry. In direct comparisons, reaction-based models achieve synthesis validation rates between 56% and 100%, whereas shape-first models relying on post-hoc heuristic filtering achieve rates as low as 23%. [93-94, 96]

We adopt this perspective as the foundation for the subsequent analysis, with the additional requirement that every transformation should satisfy selectivity constraints (Tier 2, Section 5.2.1). By enforcing explicit route generation, this framework also resolves the ambiguity of inventory constraints, as a molecule is deemed accessible only if the planner can connect it to the defined stock set. This results in a shift in objective: from training classifiers to estimate synthesizability toward developing planners that demonstrate it.

5 Problem Formulation and Definitions

5.1 Formalizing Retrosynthetic Logic

Retrosynthetic analysis, introduced in 1963 by Vleduts [97] and formalized by E. J. Corey in 1969 [98], is the logical deconstruction of a target molecule into progressively simpler precursors. The fundamental operation is the disconnection: a conceptual cleavage of a strategic bond that implies a forward chemical reaction capable of forming it. This operation transforms the target structure into a set of immediate precursors or synthons. The analysis is recursive; each precursor becomes a subsequent target for disconnection, generating a branching tree of potential pathways. This process terminates only when a branch reaches a starting material: a compound present in the chemist’s available inventory. Early computational implementations, such as LHASA [24] and SECS [25], established this logic but relied on hand-coded heuristics that could not scale to the full diversity of organic chemistry [27].

Mathematically, retrosynthesis can be formulated as a search over a directed bipartite AND/OR graph $G=(V_M \cup V_R, E)$ , in which two types of nodes alternate. Molecule nodes $V_M$ represent OR choices: the planner selects one disconnection from several reactions that could produce the molecule in question. Reaction nodes $V_R$ represent AND constraints: once reaction is selected, all of its required precursors must be obtained, with each becoming a new molecule node to be solved. A solved synthetic route is a subgraph of this graph with every terminal node belonging to the available starting material stock set $\mathcal{S}_\text{stock}$ .

Within this framework, it is helpful to distinguish two computational problems. The search feasibility problem asks whether any valid route exists that connects the target to $\mathcal{S}_\text{stock}$ . The optimality problem asks which of these feasible routes is best under a chosen cost function (e.g., step count, price, or safety). In the current literature, most benchmarks primarily measure search feasibility, often termed “solvability”, because defining a universally valid cost function for chemical optimality remains an open challenge.

The central difficulty in planning is that the AND/OR graph is implicit. The graph is too large to precompute; it must be built incrementally during search. Since the number of plausible disconnections grows exponentially with depth (see Table 3), exhaustive enumeration is intractable. The planner’s task, therefore, is to allocate a limited computational budget to expand only the most promising branches. This is complicated by the sparse reward signal: a precursor set may appear chemically sound at step 1 but fail to connect to stock at a later step.

Note

Table 3. The original tree-size table is available as an interactive version in the Combinatorial Explosion section, where you can vary branching factor and route depth directly. Core point: exhaustive retrosynthetic search scales as O(b^d), so even moderate increases in branching factor or route depth render brute-force enumeration intractable.

:::caption{#fig-combinatorial-explosion}[Figure 1. Schematic Diagram of the Retrosynthetic Tree Search.] A target molecule (e.g., nitrazepam) presents multiple viable strategic bond disconnections. Each disconnection (marked with a different color) yields a new set of required precursors, initiating a recursive branching process. :::

5.2 Reaction Templates and Chemical Rules

:::caption{#fig-template-extraction}[Figure 2. Extraction and Representation of Reaction Templates.] The sequential process of deriving local graph-transformation rules from reaction data (Section 5.2). (Step 1) A chemical transformation is represented as a Reaction SMILES string. (Step 2) Atom-to-atom mapping establishes a rigorous correspondence between reactant and product atoms to identify the reaction center (bonds formed and cleaved). (Step 3) Unchanged molecular topology is discarded to isolate the generalized rule. Manual extraction typically yields broad, minimal templates, whereas automated algorithms (e.g., RDChiral ) retain explicit local environment guards (e.g., atom degree, hydrogen counts) to constrain applicability. By definition, these templates strictly enforce Syntactic and Topological validity (Tiers 0–1, Section 5.2.1) but cannot guarantee molecule-wide Selectivity (Tier 2). :::

In template-based planning, the legal moves are defined by reaction templates. A template is a subgraph transformation rule: it specifies the atoms whose bonds change (the reaction center) and a minimal neighborhood of context, typically encoded as reaction SMARTS. For example, an amide hydrolysis template describes the transformation of [C:1](=[O:2])[N:3] into [C:1](=[O:2])[OH] + [N:3].

Crucially, templates are local: they define the change at the reaction site but contain no information about the rest of the molecule. For instance, a template for Grignard addition may match a ketone substructure perfectly, even if an unprotected carboxylic acid elsewhere in the molecule would quench the reaction immediately. Consequently, applying a template guarantees only syntactic and topological validity (meaning the graph edit is structurally legal) but does not guarantee selectivity, i.e., that the reaction will actually proceed as intended in the presence of competing functional groups.

Templates originate from two sources. Expert systems like Chematica [28-29] rely on hundreds of hand-coded rules [99-102] augmented with steric and electronic guards. Conversely, modern deep learning approaches automatically extract templates from reaction databases by atom-mapping reactants to products and identifying the changed core [103-104] (Figure 2). While automated extraction scales to tens to hundreds of thousands of reactions, it often produces noisy rule sets that lack the rigorous context guards of expert systems.

Tier	Definition	Treatment in Current Planners
0. Syntactic	Obeys graph-theoretic rules (valency, aromaticity, charge balance).	Templates: enforced. Sequence models: no guarantee.
1. Topological	Correct reaction center modification per mechanistic template.	Templates with applicability check: enforced.
2. Selectivity	Correct outcome among multiple chemically plausible pathways.
— Chemoselectivity	Correct functional group reacts; no incompatible FG conflicts.	Human-curated rules: partial. Learned policies: statistical.
— Regioselectivity	Correct site among non-equivalent positions.	Learned policies: statistical. QM: rarely integrated.
— Diastereoselectivity	Correct relative stereochemistry.	Learned policies: statistical. QM: rarely integrated.
— Enantioselectivity	Correct absolute stereochemistry.	Largely ignored.
— Stoichiometry	Control of single vs. multiple equivalent transformations.	Largely ignored; single-equivalent assumed.
3. Executability	Lab-realistic conditions (yield, purification, safety, scale).	Condition predictors: single-step. Route-level: rarely integrated.

:::caption{#tab-validity-hierarchy}[Table 4. Hierarchy of Chemical Validity in Retrosynthetic Planning.] A proposed transformation must satisfy constraints at multiple levels to be experimentally realizable. Specific failure modes are visualized in Figures 3—5. Template-based methods guarantee syntactic and topological validity (Tiers 0—1) by construction, but provide no formal control over selectivity (Tier 2). Sequence-based methods lack formal guarantees at any tier, requiring explicit post-hoc validation. :::

5.2.1 Hierarchy of Chemical Validity

The term validity in retrosynthetic planning frequently obscures the distinction between graph-theoretic connectivity and experimental feasibility. To address this ambiguity, we define four levels of constraints (Table 4) that a proposed transformation must satisfy.

Syntactic (Tier 0) and Topological (Tier 1) validity refer to the construction of a well-formed molecular graph and a legal reaction center modification. Template-based methods enforce these constraints by definition, whereas template-free sequence models must learn them from the training data. As a result, unconstrained generation can yield proposals that violate valence rules or posit chemically impossible bond migrations, as illustrated in Figure 3.

Selectivity (Tier 2) requires that the transformation be chemically plausible in the presence of competing functional groups and stereochemical requirements. While learned policies implicitly capture some of these constraints from training data, the standard template formalism does not guarantee them. A template defined by a local graph edit may be topologically applicable but fail to encode the global molecular context required to prevent unintended reactions at competing functional groups, i.e. ensure that the intended transformation occurs selectively at the target site rather than across chemically similar sites. (Figure 4). Similarly, while reaction SMARTS can encode stereochemistry, automated extraction frequently yields non-specific rules. Consequently, a planner may satisfy Tier 1 validity by applying a generic template that discards the necessary stereochemical information (Figure 5). Without explicit verification, these selectivity constraints are never directly enforced—whether they are satisfied depends entirely on the statistical quality of the policy rather than on any structural guarantee.

Validating Tier 2 constraints requires distinguishing reactive environments that standard fingerprints often fail to differentiate. Kogej et al. address this with SMARTS-RX, a curated vocabulary of functional group patterns (e.g., distinguishing heteroaryl from phenyl halides) that captures the electronic context necessary to predict reaction failure [105]. Integrating such granular definitions into the planning loop is likely a prerequisite for automated Tier 2 verification.

Insight

We’re not necessarily saying existing methods violate Tier 0-2 validity constantly (some manual expert audits show that they sometimes do); rather, it’s simply an epistemic blind spot (which, arguably, is even worse: progress begins with measurement)

Executability (Tier 3) demands that the step be viable under specific laboratory conditions, including yield, purification, and safety.

:::caption{#fig-sequence-hallucinations}[Figure 3. Illustrative Failure Modes of Template-Free Sequence Policies Across the Solv-N Hierarchy.] Unconstrained autoregressive models (Section 6.2.2) may propose transformations that violate fundamental chemical and topological constraints (Section 5.2.1). (Rxn 1) A Tier 2 (Selectivity) violation: the proposed disconnection is topologically valid (Solv-1) but chemically implausible. Methyllithium strongly favors 1,2-addition over the implied conjugate addition (Solv-2C). Furthermore, even if substituted with a soft nucleophile (e.g.,) to force 1,4-addition, the reaction violates regioselectivity constraints (Solv-2R), as conjugate addition occurs at the β-carbon, not the α-carbon depicted. (Rxn 2) A Tier 1 (Topological) violation: the model hallucinates a non-physical migration of the aryl substituents from a para- to a meta-relationship during the addition step. (Rxn 3) A Tier 0 (Syntactic) violation: the generation of a pentavalent carbon atom, violating basic valency rules. While template-based policies prevent Tier 0 and 1 errors by construction, sequence models require rigorous post-hoc sanitization to identify such structural anomalies. :::

:::caption{#fig-template-selectivity}[Figure 4. The Insufficiency of Topological Validity: Tier 2 Failures in Template Application.] While reaction templates guarantee Syntactic (Tier 0) and Topological (Tier 1) validity (Section 5.2.1) by enforcing valid local graph edits, they are inherently blind to the global molecular context governing Selectivity (Tier 2). All four proposed disconnections perfectly match the methyllithium addition template, but only one is experimentally viable. (Rxn 1) A fully valid (Solv-2) transformation, correctly depicting the exhaustive alkylation of two equivalent carbonyls. (Rxn 2) A Solv-2S (Stoichiometric) violation: proposing mono-addition to a symmetric dicarbonyl lacking a control mechanism, which would inevitably over-react to form the Rxn 1 product. (Rxn 3) A Solv-2R (Regioselective) violation: attempting selective addition to one of two competing electrophilic sites. The intrinsic reactivity difference is insufficient, yielding a complex mixture. (Rxn 4) A Solv-2C (Chemoselective) violation: the strongly basic organolithium reagent will be immediately quenched by the unprotected carboxylic acid via proton transfer, precluding the intended nucleophilic addition. These failure modes demonstrate why planners must integrate explicit selectivity verification (Section 8.1) rather than relying on local template applicability. :::

:::caption{#fig-template-stereochemistry}[Figure 5. Stereochemical Blind Spots in Topological Planning: Tier 2 Failures.] Diels-Alder cycloadditions highlight the inability of standard 2D reaction templates (Section 5.2) to enforce 3D spatial constraints. All four proposed disconnections perfectly match the [4+2] cycloaddition template (Solv-1), but three fail critical experimental constraints (Section 5.2.1). (Rxn 1) A fully valid (Solv-2) transformation: the inclusion of a specific chiral organocatalyst (e.g., MacMillan imidazolidinone ) correctly maps to the enantiopure endo product. (Rxn 2) A Solv-2E (Enantioselective) violation: the planner proposes the (S,S,S) enantiomer, but the specified catalyst strictly induces the (R,R,R) geometry. (Rxn 3) A Solv-2E violation: attempting to synthesize an enantiopure target without a source of chiral induction. The forward reaction will yield a racemic mixture. (Rxn 4) A Solv-2D (Diastereoselective) violation: the proposed disconnection targets the exo isomer, but the unconstrained forward reaction intrinsically favors the endo transition state via secondary orbital interactions. These examples emphasize that physical executability requires models to internalize geometric and kinetic control, not merely graph connectivity. :::

5.3 Inventory Definitions and Search Boundaries

Retrosynthetic search terminates only when all required precursors lie in the chosen inventory of starting materials. The definition of that stock therefore acts as a difficulty dial: expanding the inventory reduces the depth the planner must reach and increases success rates. In practice, evaluations often treat two distinct inventory types as interchangeable. The physical tier comprises genuinely in-stock, rapidly deliverable compounds (typically $\sim 10^5$ — $10^6$ entries). The virtual tier consists of make-on-demand listings that are purchasable in name but typically require vendor synthesis (often $\sim 10^7$ — $10^9$ entries).

Allowing termination in the virtual tier relaxes the planning task by permitting routes to stop at complex intermediates whose remaining synthesis is simply outsourced. This shifts the operational burden from the algorithm to the vendor, often incurring lead times that are incompatible with iterative screening cycles. Consequently, high success rates against make-on-demand inventories reflect a different, less constrained objective than delivering actionable routes from physical stock.

5.4 Evaluation Metrics

The literature relies on three primary metrics, each probing a different aspect of validity. Solvability measures the fraction of targets for which a planner finds any route terminating in the stock set. Because this metric strictly evaluates topological connectivity (Tier 1), we adopt the termstock-termination rate (STR) [106]. This redefinition clarifies that the metric assesses the capacity to navigate the search graph rather than the chemical correctness of the result.

To approximate higher-tier validity, studies typically employ two proxies. Route reconstruction (Top- $K$ accuracy) assesses whether the planner recovers known experimental routes. While reconstructed steps inherit the validity of the historical data (Tier 2—3), this metric is conservative; it penalizes valid, novel routes that differ from the reference. Round-trip accuracy evaluates self-consistency by applying a forward reaction predictor to the proposed precursors. While useful for filtering syntactic errors, this check is model-dependent. If the forward predictor shares the training distribution or architectural biases of the planner, a successful round-trip confirms consistency rather than independent chemical correctness.

5.5 Data Sources and Reaction Databases

Data-driven planners are fundamentally bounded by the quality of their training datasets. Most modern systems rely on reactions extracted from patent literature (USPTO) [40, 107], which introduces distinct biases. First, because the patents overwhelmingly report successful transformations, the model is trained exclusively on reactions that worked, with no exposure to failed attempts or undesired outcomes. Models learn feasible disconnections but receive no direct supervision regarding failure modes. This absence of negative data weakens the model’s ability to identify infeasibility and selectivity boundaries. Second, automated extraction frequently obscures reaction roles. Datasets often represent reactions as unordered mixtures, treating structural reactants, auxiliary reagents, catalysts, and solvents as interchangeable participants rather than distinguishing their roles in the transformation. This forces models to infer chemical roles from co-occurrence statistics rather than explicit labels, occasionally leading to incoherent proposals where solvents or bases are treated as stoichiometric building blocks. While proprietary databases (e.g., Reaxys, Pistachio) offer cleaner curation, their licensing restrictions limit their utility for reproducible benchmarking. Consequently, open-source development is still limited by the noise and ambiguity of raw patent text.

6 Algorithmic Architectures for Planning

Computational approaches to retrosynthesis have converged on two distinct paradigms: search-based planning (verify-then-search) and direct route generation (generate-then-verify). In the former, retrosynthesis is cast as a discrete optimization problem over an AND/OR graph. A single-step model proposes local disconnections, which a search algorithm then assembles into a complete route under explicit constraints (e.g., inventory availability and depth limits). In the latter, the route is serialized as a token sequence, and transformer architectures model the conditional probability of the entire pathway. This section traces the evolution of both approaches and makes explicit their central trade-off: the formal validity guarantees of explicit search (Section 5.2.1, Tiers 0—1) versus the global conditioning learned implicitly by sequence models.

Note

Table 6. Methods comparison. The architecture comparison table is easier to browse in the interactive Methods section, which lets you filter by paradigm, inspect representative systems, and compare method families side by side. The core contrast remains: explicit search offers stronger formal guarantees, direct sequence generation offers global conditioning and speed, and hybrid systems try to combine both.

6.1 Graph-Based Search Strategies

Search-based planners decouple chemical logic from algorithmic traversal. The operational pipeline typically consists of three distinct modules: (1) an expansion model that maps a product to candidate precursors; (2) a feasibility filter that prunes invalid or out-of-scope transformations; and (3) a scoring function that estimates the cost or probability of completing the route from a given state. By separating these components, search-based architectures allow for the modular improvement of chemical reasoning without altering the underlying search logic (Figure 6).

:::caption{#fig-explicit-search-mechanics}[Figure 6. Mechanics of Explicit Graph Search in Retrosynthetic Planning.] The operational pipeline of verify-then-search architectures (Section 6.1). (Stage 1) The expansion phase utilizes a single-step policy (Section 6.2)—either a template-based classifier (Section 6.2.1) or a template-free sequence generator (Section 6.2.2)—to propose candidate precursor sets (Nodes A–E) and assign prior expansion probabilities (p). (Stage 2) An optional feasibility policy explicitly prunes invalid or out-of-scope moves prior to evaluation. (Stage 3) Value estimation updates the expected utility (Q) of the expanded node. Probabilistic Exploration approaches (e.g., MCTS, Section 6.1.1) rely on stochastic rollouts using a lightweight, latency-optimized policy to estimate termination probability, assigning negative rewards for unpurchasable dead ends. Value-guided optimization frameworks (e.g., Retro*, Section 6.1.2) replace rollouts with learned heuristics, directly predicting the cost-to-go either from the isolated target node or by aggregating context across the entire AND/OR search graph (e.g., RetroGraph). :::

6.1.1 Probabilistic Exploration (MCTS)

The historical analogy between retrosynthesis and combinatorial games [108] was computationally realized when Segler et al. adapted Monte Carlo tree search (MCTS) to chemical planning [34]. This adaptation, 3N-MCTS, established the modern paradigm by replacing hand-engineered heuristics with learned components within the search loop. In this framework, the expansion model ranks transformation rules, a feasibility filter screens the proposed reactions, and stochastic simulations (rollouts) estimate the value of the node by probing how likely a given molecule node (intermediate) is to lead to a completed route terminated with a purchasable starting material within a fixed number of steps.

To train the feasibility filter without experimental failure data, the authors [34] employed algorithmic negatives: applying templates to known reactants and labeling any unreported products as invalid (Figure 7). While effective on dense proprietary databases like Reaxys, this closed-world assumption degrades on sparser open databases such as USPTO. Conflating undocumented products with chemically impossible ones creates significant false negatives, systematically penalizing valid but novel routes. Consequently, most (but not all [109]) subsequent open-source planners [110] have abandoned explicit feasibility filters, embedding feasibility assessments implicitly into the expansion model.

:::caption{#fig-algorithmic-negatives}[Figure 7. Generation of Algorithmic Negatives for Feasibility Policies.] The construction of synthetic negative data, a standard technique for training in-scope filters and feasibility classifiers (e.g., 3N-MCTS and RetroGFN ). (Step 1–2) A literature reaction is abstracted into a reaction template (Section 5.2). (Step 3) Template Misapplication: the rule is applied exhaustively to other reactants in the database. Any generated product not explicitly recorded in the positive corpus is labeled an algorithmic negative. While this captures genuine chemical impossibilities (True Negatives), the closed-world assumption—that unreported equals impossible—systematically generates False Negatives. For example, the intramolecular lactamization shown is chemically viable but penalized simply for lacking precedent. (Step 4) Product Swapping: true reactants are paired with a structurally similar but incorrect product to train discriminators. The reliance on algorithmic negatives highlights the epistemic limits of positive-only patent databases (Section 5.5) and motivates the integration of explicit physics-based supervision. :::

A persistent limitation of MCTS is the high variance induced by the branching factor of chemical synthesis. At any state, numerous plausible disconnections exist (OR nodes), but each chosen disconnection may yield multiple precursors that must all be solved (AND constraints). Kishimoto et al. formalized this asymmetry, demonstrating that stochastic simulations can be dominated by shallow branches favored by the model’s initial score, thereby failing to explore longer-horizon routes that terminate only after many steps [35]. To address this exploitation bias, Wang et al. introduced dynamic exploration schedules (mUCT) to broaden coverage of low-scoring but chemically plausible branches, demonstrating that such mechanisms can also incorporate auxiliary objectives like green solvent selection [111]. More recently, Tripp et al. reframed the planning objective: rather than identifying a single optimal route, the goal becomes selecting a portfolio of routes that together maximize the successful synthesis probability (SSP), accounting for the possibility that individual steps may fail in practice[112].

6.1.2 Best-First Search with Learned Heuristics

To reduce the variance and computational cost of stochastic simulations, a complementary line of work replaces rollout-based evaluation with learned estimates of synthetic cost. Schreck et al. pioneered this by framing planning as a single-player game, training a value function to predict the expected remaining synthesis cost from any given intermediate [113]. At inference time, this yields a deterministic planner that selects disconnections by minimizing the immediate reaction cost plus the predicted cost of the resulting precursors. Retro* subsequently formalized this principle within a neural-guided A* framework, decomposing node evaluation into accumulated cost and a learned heuristic for future difficulty on an AND/OR tree [41]. Xie et al. extended this logic to graph-based structures (RetroGraph), merging identical intermediates to share value estimates across redundant branches [114]. Finally, recent efforts such as InterRetro [115] demonstrate that the search process itself can be distilled into a single-step policy via self-imitation learning, allowing for greedy inference that approximates the results of MCTS without the runtime cost of tree expansion.

However, controlled comparisons suggest that the choice of traversal algorithm yields only incremental gains relative to the quality of the chemical model. Within the AiZynthFinder framework [110], Roucairol and Cazenave found that nested Monte Carlo search and greedy best-first search improved solvability only modestly over standard MCTS, concluding that performance is primarily bounded by the expansion model’s ability to propose valid steps [116].

6.1.3 Horizon Effects and Strategic Control

A central limitation of explicit search is the horizon effect: when node scoring is optimized for short-term objectives (e.g., maximizing single-step likelihood), the planner systematically penalizes steps whose utility is only realized later in the route. This includes strategic disconnections [117] that enable convergent assembly, as well as auxiliary operations like protection/deprotection that temporarily increase molecular complexity. MEEA* addresses these limitations by combining best-first expansion with short exploratory simulations a few steps ahead before committing to a node, while training the value function with a path-consistency regularizer to stabilize cost estimates across the search trajectory. [118].

Complementary approaches enforce long-range intent explicitly. ReTReK [119] injects curated chemical knowledge, such as preferences for ring disconnections or convergent steps, directly into the selection rule, biasing exploration toward strategies that pure likelihood models often neglect. Similarly, Westerlund et al. demonstrate that in medicinal chemistry, satisfying user intent (e.g., preserving a specific scaffold) often requires forcing the planner to break or freeze specific bonds, a constraint effectively implemented via multi-objective MCTS [120].

Structural modifications to the search process have also been proposed to capture dependencies that span many steps ahead. DESP [121] handles the constraint of reaching specific starting materials by combining top-down retrosynthesis with bottom-up forward expansion — growing the route from both ends simultaneously and using a learned metric of synthetic proximity to guide the convergence of the two frontiers.. Beyond the single target, Picazo et al. introduced MultiAiZ [122] to exploit shared intermediates across batches of molecules, dynamically updating the available inventory to encourage convergent plans — routes that reduce overall synthetic effort by routing multiple targets through common key intermediates. A complementary approach elevates the planning problem to a higher level of abstraction entirely. Roh et al. replace atom-level reaction templates with rules that operate on generalized synthons, effectively decoupling strategic bond disconnections from the tactical implementation of specific functional group interconversions[123]. This abstraction allows the planner to bypass the tactical complexity of protecting group sequences, which often creates local minima that trap myopic search algorithms. While this approach yields dramatic improvements in solvability on complex targets, it does so by changing the objective: instead of delivering a fully specified and executable route, it produces a high-level plan whose individual steps must still be instantiated before the synthesis can be carried out.

6.2 Single-Step Reaction Prediction

For search-based planners, the multistep algorithm serves only to assemble routes from the local disconnections proposed by its expansion model: a predictor that maps a target product $P$ to a set of candidate precursors $\{R\}$ , typically generating dozens or hundreds of suggestions per step and ranking them according to criteria such as predicted reaction feasibility, chemical similarity to known precedents, estimated yield, or learned heuristics from training data. The expansion models are very diverse and often based on templates (RetroSym [32], NeuralSym [124], GLN [125], RetroPath2.0 [126], LocalRetro [127]), neural networks (Seq2Seq [31], MEGAN [128], Chemformer [70]), or hybrid approaches (RetroXpert [129], GraphRetro [130], BioNavi [131]). The ranking is crucial, as it determines the order in which disconnections are evaluated during tree expansion; for instance, neural network-based expansion models often output a sorted list or probabilistic scores (e.g., via softmax distributions over templates or reactants), prioritizing those deemed most synthetically viable based on patterns extracted from reaction databases. While differences in traversal algorithms (e.g., MCTS vs. A*) alter how computational resources are allocated, such as through biased sampling in MCTS or heuristic-guided queuing in A*, they cannot compensate for a deficit in chemical knowledge within the expansion model itself. With limited computational resources, a route is effectively unreachable if its critical disconnection is never ranked highly enough by the expansion model, as search algorithms typically limit branching to the top- $k$ proposals (where $k$ is a hyperparameter like 10 or 50), potentially overlooking rare but optimal disconnections buried in lower ranks due to model biases, incomplete training, or overemphasis on common reaction motifs.

6.2.1 Template-Based Prediction

The canonical expansion model ranks a finite library of reaction templates (Section 5.2) and applies the top-scoring rules to generate precursors. This classification-based approach underlies 3N-MCTS [34] and open-source standards like AiZynthFinder [110], typically employing lightweight neural networks to score on the order of $10^4$ — $10^5$ possible transformations based on molecular fingerprints.

The primary advantage of template-based models is their structural discipline: they enforce local syntactic and topological constraints by construction (Section 5.2.1, Tiers 0—1). Because every proposed step results from applying a pre-validated graph edit, the output is guaranteed to be a valid molecular graph. Architectural refinements have largely focused on the ranking problem itself: (e.g., LocalRetro [127] and GLN [125]) improve accuracy by identifying reaction centers directly on the molecular graph, rather than relying solely on global fingerprint vectors.

However, these advantages are accompanied by an intrinsic limitation in coverage: the model’s chemical vocabulary is restricted to transformations represented in the predefined template library. As a result, the model’s performance degrades dramatically with the distribution shift when the target molecules or required reactions in a new scenario deviate significantly from those in the training data. In such cases, the model simply lacks the templates needed to propose valid disconnections, leading to incomplete or failed route predictions, as it cannot generalize beyond its hardcoded rules. For instance, AiZynthFinder achieves a solvability rate of 70.9% on ChEMBL but drops to 10.1% on the enumerated GDB MedChem set [132]. This reduction suggests that fixed template libraries struggle to generalize to chemical space outside the historical reaction space. To address this, approaches like RetroGFN [109] and generative template models [133] replace discrete classification over fixed templates with sequential template construction, which builds reaction templates step-by-step (through autoregressive generation [109] or sampling from the latent space of an autoencoder [133]). This enhances the diversity of proposed disconnections by enabling the creation of novel transformations beyond predefined libraries but at the cost of higher inference latency since each proposed disconnection requires multiple iterative model evaluations rather than a single, parallel classification pass.

6.2.2 Template-Free Sequence Generation

Template-free models bypass the fixed library by directly generating precursor SMILES strings, typically formulating prediction as a sequence-to-sequence translation task. The Molecular Transformer established this paradigm for forward reaction prediction [134], and subsequent work extended it to retrosynthesis by coupling transformer-based generation with explicit search procedures [36]. AutoSynRoute integrated these sequence models into MCTS, using likelihood scores to prioritize node expansion [135], while Chemformer utilized BART-style pre-training to improve robustness on smaller datasets [70].

The central benefit of template-free generation is coverage: the model is not bounded by the discretization artifacts of template extraction and can, in principle, propose any chemically valid string. The trade-off is the loss of formal guarantees (Section 5.2.1). Because the model emits tokens rather than applying a graph edit, it may generate invalid SMILES or chemically impossible bond changes (Tier 0 failures). These errors necessitate rigorous post-hoc validation and correction. Furthermore, RetroRanker identifies a “frequency bias” in these models, where high-confidence proposals often reflect the statistical prevalence of common reactants rather than the specific structural logic of the target [136]. This has motivated the development of ensemble architectures such as RetroChimera [137], which combine the coverage of sequence models with the structural constraints of graph-based editors.

6.2.3 Throughput as a Planning Constraint

When an expansion model is embedded within a search loop, its inference speed becomes a structural constraint on planning depth. MCTS and A* algorithms require hundreds to thousands of expansions to explore a tree effectively. Template-based classifiers (MLPs/GNNs) can be queried in milliseconds, whereas autoregressive sequence models require computationally expensive token-by-token decoding, often taking seconds per expansion.

This latency differential creates a speed-accuracy frontier. Maziarz et al. report that under a fixed time budget (e.g., 10 minutes per target), transformer-based planners execute so few expansions that they fail to solve complex targets, despite having higher single-step accuracy [138]. Similarly, Hassen et al. observed that Chemformer, despite superior top-1 predictive performance, underperformed the faster LocalRetro model in multistep solvability (53.4% vs. 80.6%) simply because the search could not explore deep enough within the practical time limit [139].

Consequently, accelerating inference is not merely an engineering detail but a prerequisite for the utility of sequence models in planning. Recent work has applied speculative decoding to increase transformer throughput, recovering some of the performance gap under strict time limits [140]. The broader implication is that for retrosynthesis, model throughput is a component of chemical capability: a marginally less accurate but orders-of-magnitude faster model may be the superior planner.

6.3 Hybrid and Neurosymbolic Approaches

The coverage and latency constraints identified in Section 6.2 motivated the development of hybrid architectures. These systems retain explicit search as the scaffold for compositional planning but inject learned modules (rankers, scoring functions, and heuristics) to steer which steps and partial routes are expanded. By integrating sequence models and large language models (LLMs) as guidance mechanisms rather than primary planners, these approaches aim to combine the rigor of tree search with the semantic reasoning of generative models.

6.3.1 Ensemble and Re-Ranking Architectures

A direct form of integration combines models with complementary inductive biases to improve candidate quality. Maziarz et al. exemplify this with RetroChimera [137], which combines a graph-based editing component (NeuralLoc) with a sequence-based generator (R-SMILES 2) and trains a learning-to-rank model to prioritize proposals from both sources. By fusing graph-based logic, which preserves rule-constrained edits (Tier 1 validity, Sec 5.2.1), with sequence-based generation that expands coverage into rare transformations, the architecture mitigates the specific failure modes of each paradigm. RetroChimera further validates this approach through expert preference evaluations, where chemists frequently rated model proposals as superior to historical ground truth, suggesting that the ensemble improves chemical plausibility even when it diverges from specific reference routes [137].

A modular alternative keeps the expansion model fixed but adds a learned verifier to score the plausibility of proposed bond changes. RetroRanker [136] implements this strategy by encoding the reaction center via a graph neural network, outputting a re-ranking score designed to suppress high-confidence but chemically implausible proposals attributed to frequency bias. Although the reported multistep gains are modest, likely because the re-ranking is practically constrained to the first expansion step, the work demonstrates that verification models conditioning on reaction-center structure provide a control mechanism that token-likelihood models lack. Effectively, these methods use structural constraints to filter the syntactic errors common to sequence models, thereby reinforcing Tier 0 validity.

Other hybrid approaches impose preferences at the route level rather than the reaction step. Zipoli et al. steer exploration by comparing the evolving sequence of reactions to embeddings of successful patent routes [141]. This strategy retains explicit search for compositional correctness while supplying an external retrieval signal that encourages the model to mimic the structural patterns of known chemistry, effectively mitigating the tendency of local scoring functions to miss strategic long-term dependencies.

Hybrid architectures can also address specific chemical deficits in pure search. Westerlund et al. developed a post-hoc graph modification framework that wraps the AiZynthFinder planner with a physics-guided selectivity module [142]. The system first diagnoses potential chemoselectivity conflicts, such as competing nucleophiles, using graph neural networks to predict condensed Fukui coefficients: quantum chemical reactivity descriptors that quantify how susceptible each atom in a molecule is to nucleophilic or electrophilic attack. Upon detecting a valid conflict, it automatically “repairs” the route by inserting protection and deprotection steps generated by a specialized transformer, overriding the planner’s tendency to favor shorter, unprotected pathways. This mechanism explicitly bridges the gap between topological connectivity (Solv-1, Section 5.2.1) and experimental selectivity (Solv-2), proving that chemical validity often requires structural complexity that purely data-driven, length-minimizing policies will systematically avoid.

6.3.2 Large Language Models as Heuristic Guides

General-purpose large language model (LLM)s provide a semantic layer often absent in purely structural planners. Rather than serving as primary expansion models, they function effectively as heuristic guides that inject non-topological constraints, such as safety, material availability, or procedural complexity, into the search loop. Liu et al. (Llamole) implement this by embedding an LLM within A* search to compute the heuristic cost-to-go, $h(n)$ , directly from textual descriptions of synthetic feasibility [143]. Baker et al. (LARC) similarly deploy the LLM as a “critic” agent within the MEEA* framework [118, 144], pruning hazardous or impractical intermediates that topologically valid policies might otherwise pursue.

Beyond scalar scoring, Song et al. utilize LLMs for macro-expansion via AOT* (And-Or Tree search) [145]. Here, the model proposes coherent multi-step route fragments which are subsequently verified and grafted into the search tree by template-based logic. This neurosymbolic design effectively treats the LLM as a source of strategic intuition while retaining the symbolic engine for rigorous Tier 1 graph validation. However, the stochastic nature of text generation introduces significant reproducibility challenges. Unlike standard discriminative models that provide deterministic outputs, LLM-based guidance depends heavily on decoding strategies and prompt formulation, requiring rigorous standardization of the inference context to guarantee consistent planning behavior.

6.4 Direct Sequence Generation

While hybrid systems retain an explicit search procedure assembling a route step by step through tree search, the direct sequence paradigm shifts the entire burden of multistep planning to a single generative model. In this framework, retrosynthetic routes are not assembled by the recursive expansion of a reaction tree, but are instead generated directly as a sequence of tokens produced via autoregressive decoding (the same way a language model generates a sentence). This architectural shift replaces the modular complexity of heuristic search algorithms with scalable representation learning, effectively relocating the computational effort from inference-time tree exploration to the training stage, distilling the strategic knowledge into model parameters once and then applying it during inference.

6.4.1 Autoregressive Template and Trajectory Modeling

Intermediate approaches in this domain focus on predicting the sequence of reaction templates rather than the molecular graph itself. This strategy aims to preserve Tier 1 structural constraints while enabling the model to condition on the entire planning history.

Xuan-Vu et al. introduced TempRe to explore this middle ground, utilizing a transformer to generate reaction templates (SMARTS) autoregressively from the product rather than emitting reactant SMILES directly [146]. This representation expands the effective chemical vocabulary while ensuring that every proposed transformation remains chemically valid by construction. Since each generated token corresponds to a predefined graph edit, the model cannot produce reactions that violate basic chemical rules. Crucially, empirical evaluations demonstrate that constrained generation (filtering outputs against a template library) yields significantly higher route reconstruction fidelity than unconstrained decoding.

A complementary strategy models the planning trajectory itself. Granqvist et al. [147] train a decision transformer using the records of complete planning episodes of multistep retrosynthesis (the current intermediate molecule, the disconnection applied, and the quality of the resulting route) drawn from the PaRoutes dataset [40]. However, this approach reveals a computational bottleneck: to achieve solvability rates competitive with standard search algorithms, the model requires a large beam width (e.g., 50), resulting in inference throughput significantly lower than efficient MCTS implementations. This suggests that without the pruning logic of a search tree, sequence models must implicitly recreate the search process during decoding to ensure valid termination.

6.4.2 Full-Route Sequence Prediction

Full-route generators push this paradigm to its logical conclusion: the model emits the entire retrosynthetic tree (including intermediates, branching points, and termination leaves) as a single structured sequence. This formulation addresses a key limitation of the step-by-step greedy search: the model plans the entire route at once and thus it can recognize when two separate branches of a synthesis should converge on a shared intermediate, and account for the fact that an early disconnection may only make chemical sense in light of what comes several steps later. These strategic considerations are often overlooked by a planner working one step at a time.

Recent implementations frame retrosynthetic planning as a translation task, where the target molecule is mapped directly to a complete synthetic route represented as a structured sequence. Shee et al. developed DirectMultiStep [42], a family of transformer-based models trained from scratch to generate multistep retrosynthetic routes as a single string, leveraging a mixture-of-experts approach to improve efficiency and accuracy. The architecture employs a classical encoder-decoder transformer setup [9] with additional gated mixture-of-expert blocks [148], with the encoder processing the input sequence (the target SMILES string and, optionally, the desired route length or starting material) and the decoder predicting the output route. The mixture-of-experts elements significantly improve coverage across the diverse reaction types encountered in multistep planning, routing different chemical subproblems to specialized sub-networks within the model. This design allows DirectMultiStep to predict full routes, including branches and termination leaves, in a single pass, outperforming iterative search methods by capturing long-range dependencies and convergent strategies. The flagship variant, DMS Explorer XL, which requires only the target structure as input, outperformed contemporary search-based methods on the PaRoutes benchmark, achieving 1.9-fold and 3.1-fold improvements in Top-1 route reconstruction accuracy on the $n_1$ and $n_5$ evaluation sets, respectively, and demonstrated generalization to FDA-approved drugs absent from the training data.

More recently, Sun et al. introduced SynLlama [149], a distinct specialized model derived from Meta’s Llama-3 large language models Llama-3.1-8B (8 billion parameters) and Llama-3.2-1B (1 billion parameters) through supervised fine-tuning on retrosynthetic data constructed by applying pre-defined templates to the Enamine library [150]. This approach enables the direct generation of linear route descriptions from target molecular structures represented as SMILES strings. By leveraging the pre-trained linguistic capabilities of the base models, SynLlama bypasses the recursive tree expansion, instead focusing on deconstructing targets into commercially available precursors. The model achieves high reconstruction rates (up to 74% on unseen Enamine sets) with reduced training data compared to prior generative baselines.

A similar approach was proposed by Wang et al.. Their model, LLM-Syn-Planner [151], adapts general-purpose LLMs such as GPT-4o or DeepSeek-V3 without additional fine-tuning to produce linear retrosynthetic routes directly from target SMILES strings as inputs. The model outputs sequential decision lists, encompassing rationales, products, reactions, and reactants for each step, and terminating when all precursors are purchasable from databases like eMolecules [152]. Their design employs an evolutionary optimization algorithm that initializes, evaluates, and mutates full routes, drawing on examples from similar historical syntheses retrieved via molecular fingerprints. The result is enhanced performance on benchmarks like USPTO and Pistachio, with solve rates exceeding 90% on simpler sets.

This architectural shift towards the full-route sequence prediction results in a characteristic performance profile: while explicit search algorithms (e.g., MCTS) excel at finding any topological path to a starting material (high navigability), sequence-based generators typically demonstrate superior fidelity in recovering the specific convergent logic of experimental reference routes (high validity). By optimizing for the joint probability of the entire sequence, these models avoid the locally valid but strategically incoherent decisions often made by step-by-step planners. However, end-to-end generation introduces a structural rigidity. In explicit planners, the chemical logic (expansion model) and search constraints (inventory, forbidden reactions) are modular; one can swap stock lists without retraining the policy. In direct sequence generators, these boundary conditions are implicitly baked into the model weights during training, requiring fine-tuning or complex constrained decoding to adapt to new inventories.

6.5 Comparative Overview of Multistep Planning Method Families

To summarize the qualitative comparisons developed in this section, Table 6 organizes the major families of methods side by side. It highlights representative systems, their dominant training signals, the extent to which explicit search is required, and their characteristic strengths and limitations. The final columns relate each family to the Solv- $N$ hierarchy introduced in Section 5.2.1, indicating the level at which current practice most naturally supports robust evaluation, and list common failure modes that motivate the transition from navigability-focused benchmarks to validity-oriented assessment.

7 From Navigability to Validity: A Critical Analysis of Benchmarking

From 2018 to 2023, the primary challenge in computational retrosynthesis was demonstrating that algorithms could navigate the combinatorial explosion of the search tree (Figure 1). This period, which we characterize as the Era of Navigability, focused on finding any topological path connecting a target to a starting material. By the primary metric of this era,stock-termination rate (STR), modern planners have largely solved the navigation problem, routinely achieving success rates exceeding 99% on standard benchmarks. Further improvements (e.g. from 99.5% to 99.8%) are statistically marginal and often reflect hyperparameter tuning rather than algorithmic progress. This necessitates a transition into the Era of Validity, where the objective is no longer to find a path, but to verify its chemical correctness.

Figure 8 illustrates the conceptual distinction between topological connectivity and experimental feasibility. While a planner optimizing for Solv-1 identifies a dense graph of potential connections, experimental constraints such as selectivity and purification requirements impose a filter that likely prunes this graph. Currently, standard evaluation metrics do not quantify this reduction; they treat all topologically valid routes as equal. Consequently, the field currently lacks the instrumentation to distinguish chemically sound proposals from those that are merely graph-theoretically connected.

:::caption{#fig-validity-sieve}[Figure 8. The Phase Transition from Navigability to Validity.] (A) The Era of Navigability: When evaluated strictly on topological stock-termination (Solv-1, Section 8.1), planners routinely achieve $∼$ 99% success, perceiving a dense, highly connected graph of plausible routes (blue). (B) The Validity Sieve: Transitioning to experimental reality requires filtering proposed routes through higher-order chemical constraints. The inset illustrates a characteristic Tier 2 (Chemoselectivity) failure hidden within Solv-1 graphs: a planner proposes a Bromo-Suzuki coupling, ignoring that the more reactive iodide will undergo preferential oxidative addition. (C) The Era of Validity: Upon applying Selectivity (Solv-2) and Executability (Solv-3) constraints, the vast majority of topologically valid routes are rendered chemically non-viable (red). The true experimentally actionable search space (green) is drastically sparser than legacy metrics suggest, underscoring the necessity of the rigorous benchmarking framework proposed in Section 5.2.1. :::

Note

Table 7. Stock Inflation Across Retrosynthesis Benchmarks. The full interactive version is available in the Evaluation section, where you can sort by inventory tier and compare Solv-1 rates side by side. Core point: inventory scope is at least as important as algorithmic choice — planners benchmarked against virtual libraries (∼231M compounds) report Solv-1 rates far above those using physical buyables (∼85k–330k compounds).

7.1 Inventory Size as a Difficulty Dial

Because the STR measures only whether a route ends, it is strictly dependent on the definition of the available inventory. The size of the stock set functions as a difficulty dial: expanding the inventory increases the density of termination points, statistically shortening the required search depth and increasing the probability of success. Consequently, STR values are not portable across studies unless the inventory is fixed.

In practice, evaluations often treat two distinct inventory types as interchangeable. As defined in Section 5.3, the physical tier ( $\sim 10^5$ — $10^6$ compounds) forces the planner to deconstruct targets into simple commodity chemicals. The virtual tier ( $\sim 10^7$ — $10^9$ compounds), essentially a list of make-on-demand targets, relaxes the problem by allowing termination at complex intermediates. This effectively transforms the computational task from deep route planning to intermediate retrieval.

Empirical comparisons confirm that varying the inventory alters apparent performance more than the choice of search algorithm (Table 7). For example, Guo et al. showed that for a fixed planner, expanding the inventory from a physical subset to a large virtual set increased STR from 73.5% to 87.3%[153]. Similarly, recent state-of-the-art results, such as the 100.0% STR for MEEA* [118] and 99.5% for PDVN [154], rely on virtual catalogs exceeding 230 million entries. When the inventory is restricted to physically deliverable buyables, success rates drop sharply; standard MCTS planning achieves only 18.7% on high-difficulty targets under strict stock constraints [155]. High success rates against virtual libraries therefore reflect the breadth of the inventory rather than the depth of the planning logic.

Furthermore, even nominally identical stock sources do not guarantee stable experimental conditions. Studies citing eMolecules or ZINC inventories often differ in whether they employ full catalogs, screening subsets, or merged composites [114, 151, 156-157]. For instance, Retro* evaluations utilized an eMolecules inventory of ${\sim}231$ million entries [41], whereas subsequent work reported substantially smaller sets ( ${\sim}23$ million) or hybrid lists ( ${\sim}35$ million for ZINC+eMolecules) [151, 156]. This variance confirms that STR values are not portable across the literature unless the stock definition is rigorously standardized.

Note

Evaluation Tables A & B. Both interactive tables are in the Evaluation section — Stock Inflation lets you compare inventory tier, stock source, and Solv-1 rates directly; Validity Gap & Complexity Cliff shows the route-reconstruction breakdown by depth. Takeaway: the same planner can look dramatically better when the stock set expands from physical buyables to massive virtual catalogs — and near-perfect Solv-1 does not imply chemically faithful routes.

7.2 The Saturation and Fragmentation of Test Sets

The second major confounder in current benchmarking is target selection. The field’s reliance on a small number of historical test sets, particularly USPTO-190 (Retro*-190), has led to a saturated evaluation environment. This benchmark was constructed by explicitly filtering for targets whose synthetic steps were ranked highly by a baseline model, a process designed to isolate and test graph search algorithms in 2020 [41]. While instrumental in the navigability era, this pre-conditioning means the set is not representative of novel or challenging chemical space. As a result, modern planners now routinely achieve between 93% and 100% [112, 114-115, 118, 121, 123, 145, 151, 153-154, 158-162], rendering the benchmark non-discriminative for state-of-the-art systems.

Outside this saturated standard, the evaluation landscape is highly fragmented. Studies frequently employ bespoke test sets, including custom subsets from ChEMBL or Reaxys, hazard-filtered targets, or case studies selected for specific chemical features [111, 116, 119, 135, 144, 163-166]. While useful for specific investigations, this practice prevents the accumulation of shared knowledge about model strengths and weaknesses, and makes it difficult to perform meaningful cross-paper comparisons.

To resolve these issues, evaluation must move from reporting aggregate success to performing category-specific analysis on representative benchmarks. The PaRoutes dataset established a standard for this by binning targets by difficulty ( $n_1$ vs. $n_5$ ) [40]. An even more granular analysis using reference route length (as a proxy for planning difficulty) has revealed a complexity cliff: STR remains high for short routes with 2-4 steps but collapses on the longer syntheses with more than 6 steps [106]. Ultimately, verifying true generalization requires that this category-specific analysis be paired with rigorous hold-out sets defined by scaffolds, reaction classes, or temporal splits, thereby disentangling chemical reasoning from the memorization of training patterns. [138, 146, 167].

7.3 The Divergence Between Topological Success and Chemical Validity

Beyond issues of target selection, the stock-termination rate metric is insensitive to the distinction between topological connectivity and chemical plausibility. A high STR certifies that a planner can find a path, but provides no information about whether that path corresponds to a viable experimental procedure.

This divergence is apparent in benchmarks that report both STR and Top- $K$ route reconstruction. For example, on the PaRoutes $n_5$ set, planners with near-identical STR show measurable differences in route reconstruction accuracy [40]. In the Torren-Peraire et al. audit, one planner achieved 99.7% STR yet recovered only 11.9% of ground-truth routes (Table 8, Panel A) [157]. Performance also degrades with route complexity; as shown in Table 8 (Panel B), the reconstruction accuracy of explicit search planners often collapses as route length increases. The issue is particularly acute on the USPTO-190 benchmark, where despite near-universal STR, audits find that Top-10 route reconstruction is in the low single digits [106].

A common explanation for this disparity is that planners may find valid routes that differ from the historical reference, which strict Top- $K$ matching penalizes. While this is possible, the validity of such algorithmically generated alternatives is difficult to verify without experimental follow-up, and recent analyses of the underlying single-step models provide reason for skepticism. An audit by Tran et al., for instance, reveals a systemic bias in these models toward proposing simpler transformations than those recorded experimentally [168]. This bias manifests as frequent errors in stereochemistry, leaving group assignment, and an underestimation of reaction complexity, suggesting that “novel” routes may often be chemically naïve artifacts rather than viable alternatives. Audits of “solved” multistep routes confirm this ambiguity, revealing proposals such as single steps with seven distinct reactants [106]. Consequently, until the field develops automated Tier 2/3 metrics, route reconstruction, however conservative, remains a primary proxy for grounding evaluation in experimental reality [168].

Addressing the limitations of relying on a single reference route, Guo et al. proposed a learned scoring function that predicts a route’s similarity to a hypothetical expert-designed path, even for novel targets [169]. By fine-tuning this predictor on human ratings, they developed a metric that correlates with chemical intuition better than binary solvability. However, the model architecture treats the reactions in a route as an unordered collection, a design choice that limits its ability to evaluate syntheses where the precise sequence of transformations is critical, such as those involving protecting groups.

7.4 Evaluator Dependence and Metric Fragmentation

Finally, evaluation is compromised when the metric is dependent on the model being evaluated. A prominent example is round-trip accuracy, in which a forward reaction predictor is used to validate the retrosynthetic disconnections proposed by the planner. While useful for filtering syntactic errors, these checks do not provide an independent measure of validity. When the forward model shares the same training distribution and architectural biases as the planner, a successful round-trip primarily confirms internal consistency rather than objective chemical correctness.

This challenge is further compounded by metric fragmentation, in which new methods are routinely introduced together with custom evaluation criteria. Metrics such as novel diversity scores or composite “feasibility scores” [162] can effectively highlight the strengths of a particular architecture, yet they hinder direct comparisons across the literature. Composite metrics, in particular, obscure the underlying causes of performance gains by combining multiple distinct factors, such as termination rate and the confidence of a learned classifier, into a single scalar value.

Progress in the Era of Validity requires that method development be decoupled from the definition of evaluation metrics. Rigorous evaluation is most informative when conducted within independent community-standardized frameworks that fix the evaluator, the stock set, and the target distribution (e.g., Syntheseus [138], RetroCast [106]). Only by standardizing the measurement protocols can the field reliably attribute performance gains to algorithmic advances rather than to choices in metric design.

8 A Framework for Validity-Centric Evaluation

As retrosynthesis planners have matured, the limitations of evaluating them with a single aggregate success rate have become apparent. The saturation of stock-termination metrics on standard benchmarks (Section 7) necessitates a move toward more granular and chemically meaningful evaluation protocols. To facilitate this transition and enable more rigorous comparison of model capabilities, we propose a framework for validity-centric evaluation. This framework is designed to differentiate between topological connectivity and experimental plausibility, assess the quality of ranked route suggestions, and encourage standardized and reproducible benchmarking practices.

8.1 The Solvability Hierarchy (Solv-N)

The cornerstone of this framework is a tiered classification of solvability designed to resolve the ambiguity of current metrics. We propose expanding the term “solvability” (currently used to refer to the stock-termination rate) into a formal series, Solv- $N$ (Table 9), where each level corresponds to a progressively stricter set of chemical validity constraints, as previously outlined in Section 5.2.1 and Table 4.

Solv-0 (Syntactic Solvability): The route consists of syntactically valid molecular graphs.
Solv-1 (Topological Solvability): The route connects the target to the stock set via topologically valid transformations. This is equivalent to the current stock-termination rate (STR) metric.
Solv-2 (Selectivity Solvability): The route’s transformations are chemically plausible, satisfying selectivity constraints.
Solv-3 (Executability Solvability): The route is experimentally viable under realistic laboratory conditions.

Tier	Core Question	Minimum Required Inputs	Validator Type	Benchmark Output	Typical Failure Modes	Label-Noise Sources	Human Adjudic.
S0	Syntactic validity	Canonical molecular representation (e.g., SMILES, SELFIES, graph); reaction format; atom mapping if relevant	Deterministic parser, sanitizer, valence checker, grammar validator	Validity rate; parsable fraction; invalid-string frequency	Invalid strings or graphs, valence errors, malformed reaction records, atom-mapping corruption	Toolkit disagreement, canonicalization differences, parser behavior, representation-conversion artifacts	No
S1	Stock-terminated topological route	Target; one-step model or reaction network; search policy; stock definition; stopping rule; evaluator protocol	Route-connectivity checker, stock-membership checker, search/evaluator audit	Stock-termination rate; topological solvability; route depth/length; success under fixed budget	Circular routes, evaluator-dependent success, termination in inflated inventories, shortcut artifacts	Inventory leakage, virtual-stock inflation, stock normalization differences, target overlap with stock or training data	Usually no
S2	Chemically plausible / selective route	Ordered route; stereochemistry; reaction context; optional reagents/roles; precedent or forward-model scores	Rule-based filter, forward model, stereo/selectivity checker, precedent matching, expert panel	Step plausibility; route pass/fail; all-steps-plausible fraction; selectivity-aware success	Chemoselectivity or regioselectivity errors, stereo loss/inversion, incompatible functional groups, missing protecting-group logic	Missing stereochemistry, incomplete metadata, noisy atom mapping, uneven precedent coverage, model miscalibration	Often yes
S3	Executable route under realistic constraints	Route order; reagents/conditions; stoichiometry if available; protecting-group strategy; purification assumptions; scale/cost/time constraints	Condition model, workflow/rule engine, availability/cost checker, process filter, lab record or expert review	Executable-route rate; constrained success; expected cost/time/yield; fraction passing all constraints	Condition incompatibility, unavailable reagents, cumulative yield collapse, purification bottlenecks, unsafe or unscalable steps	Underspecified conditions, missing yield/scale data, supplier drift, undocumented purification burden, lab-to-lab variability	Frequently yes

:::caption{#tab-solv-protocol}[Table 9. Operational Benchmark Scaffold for the Hierarchy of Chemical Validity (Solv- $N$ ).] Each tier corresponds to a stronger notion of success, moving from syntactic well-formedness (Solv-0) and stock-terminated topological planning (Solv-1) to chemically plausible, selective routes (Solv-2) and experimentally executable routes under realistic constraints (Solv-3). The table is intended as a minimal operational scaffold rather than a finalized community standard. As the hierarchy ascends, evaluation requires richer metadata, more heterogeneous validators, and increasing expert involvement. :::

This tiered system allows for more precise reporting. For example, a planner achieving near-perfect stock termination without verified chemical correctness would be accurately described as having a high Solv-1 rate. This notation clarifies that higher-order chemical constraints remain unverified, preventing the conflation of graph connectivity with experimental feasibility.

Achieving Solv-2 is particularly challenging, as it requires satisfying all sub-criteria—chemoselectivity (C), regioselectivity (R), diastereoselectivity (D), enantioselectivity (E), and stoichiometry (S)—simultaneously for every step. A proposed route is only fully validated at this level if it meets the Solv-2C, -2R, -2D, -2E, and -2S constraints concurrently. Given the difficulty of building automated verifiers for all these aspects, we suggest that as an interim measure of progress, individual sub-tier success rates (e.g., a Solv-2C rate) can serve as valuable diagnostics for specific model capabilities. The development of open-source community-standardized Solv-2 verifiers, such as the recently proposed ChemCensor [170], would therefore be a critical step toward enabling large-scale validity-centric benchmarking.

8.2 Toward operational Solv-2/3 benchmarks

The higher tiers of the Solv- $N$ hierarchy extend evaluation beyond graph connectivity and therefore require judgments about chemical plausibility and experimental feasibility. At present, these tiers lack universally accepted benchmark protocols. We do not claim that Solv-2 and Solv-3 are presently available as universally standardized labels; rather, we propose a minimal operational scaffold that could make validity-centric benchmarking progressively more reproducible across datasets and model classes. [38, 40, 171]

8.2.1 Minimal metadata requirements.

Evaluating Solv-2 plausibility requires more information than a bare reaction graph. At minimum, benchmarks should provide an ordered sequence of steps, explicit stereochemical annotations, atom mappings or equivalent atom correspondence information, and, where available, reagent or reaction-context fields. Additional metadata such as reaction-class labels, precedent links, or forward-model confidence scores can support automated plausibility checks, but should not be treated as mandatory if absent from the underlying corpus. [171-172] Solv-3 evaluation requires richer metadata still, including reaction conditions (e.g., solvent, catalyst, temperature when known), starting-material availability, and coarse workflow constraints such as protecting-group strategy, purification assumptions, and route-level resource constraints. [38, 173]

8.2.2 Stepwise versus routewise evaluation.

Solv-2 is most naturally evaluated at the step level, since selectivity, stereochemical fidelity, and functional-group compatibility are local properties of individual transformations. In practice, however, the benchmarked quantity should still be reported at the route level: a route passes Solv-2 only if all constituent steps satisfy a defined plausibility criterion. This mirrors the logic of route-benchmark frameworks such as PaRoutes, where route quality is ultimately assessed over complete multistep plans rather than isolated disconnections. [40] Stepwise validation is therefore the mechanism; routewise success is the reporting unit.

8.2.3 Treatment of stereochemistry.

Stereochemistry should be treated as an explicit constraint rather than an optional annotation. A step should count as stereochemically valid only if the proposed transformation preserves, removes, or induces stereochemical information in a manner consistent with known precedent, mechanistic expectation, or a calibrated forward predictor. [172, 174] If stereochemistry is missing or underspecified in the benchmark record, this should be recorded as ambiguity rather than silently defaulted to a pass, since incomplete stereochemical metadata is a known source of reaction-data noise. [171]

8.2.4 Stoichiometry and reagent roles.

Where available, stoichiometric roles (reactant, reagent, catalyst, solvent) should be retained and checked for consistency with the proposed transformation. In many public reaction corpora these fields are incomplete or noisy, so stoichiometry may need to be treated as a soft constraint at Solv-2 rather than a hard requirement. [171] For Solv-3, however, missing or implausible reagent-role assignments should count against executability, because condition selection and workflow feasibility depend directly on them. [38, 173]

8.2.5 Protecting-group logic.

Protecting-group strategy is a common source of hidden complexity in multistep routes. At Solv-2, missing or incompatible protecting-group logic can be counted as a plausibility failure when functional-group conflicts are evident locally. At Solv-3, protecting-group handling must be assessed in the context of the entire route, including the feasibility and cost of installation and removal steps, compatibility with neighboring transformations, and the downstream purification burden they impose. [38, 175]

8.2.6 Stepwise versus route-level executability.

Solv-3 is inherently a route-level property. Individual steps can be screened for condition compatibility, reagent availability, or likely failure modes, but true executability depends on cumulative effects such as yield attrition, purification bottlenecks, scheduling constraints, and cross-step incompatibilities. [40, 175] Stepwise filters are therefore useful as preliminary screens, but final Solv-3 assessment should be defined over the full route.

8.2.7 Reconciling validator disagreement.

In practice, rule-based filters, predictive models, and expert judgment will sometimes disagree. Rather than forcing a single authority, benchmarks should record validator outputs separately and treat disagreement as structured uncertainty. One pragmatic protocol is to use a consensus or weighted-consensus rule for binary pass/fail reporting while preserving the underlying validator scores for downstream analysis. Expert adjudication can then be reserved for disputed or high-impact cases. This is especially important because reaction-data preparation, atom mapping, and condition annotation are themselves nontrivial sources of label noise .[171, 176] A validity-centric benchmark should therefore report not only aggregate success but also the provenance and uncertainty of the labels used to define that success.

8.3 Ranking Beyond Termination: MRR-V

Binary solvability metrics are insufficient for planners that generate multiple candidate routes. A system that outputs 99 chemically invalid routes and one valid route achieves 100% solvability, but is practically inferior to a system that consistently ranks the valid route first. To capture this, we propose the Mean Reciprocal Rank of Validity (MRR-V $_i$ ), defined as the mean reciprocal rank of the first route that satisfies Tier- $i$ validity.

$\text{MRR-V}_i = \frac{1}{|Q|} \sum_{q \in Q} \frac{1}{\text{rank}_i(q)}$

Here, $Q$ is the set of test targets, and $\text{rank}_i(q)$ is the rank of the first route proposed for target $q$ that passes the Tier- $i$ validity check. Adopting MRR-V $_2$ , for example, would incentivize the development of models that not only find a selectivity-valid route but also rank it highly, directly aligning algorithmic optimization with practical laboratory utility.

8.4 A Call for Independent and Standardized Benchmarking

The history of machine learning in other domains suggests that rigorous progress requires separating method development from benchmarking. In the navigability era, it was common for papers to introduce a new search algorithm and a new custom benchmark simultaneously. This practice of coupled method-and-metric development introduces structural confounding, where apparent algorithmic gains cannot be isolated from relaxed boundary conditions or favorable test set selection. To ensure that progress is transparent and reproducible, the field must move toward evaluation by independent, standardized protocols.

A significant practical barrier to this goal is the difficulty of conducting fair, head-to-head comparisons. A truly rigorous comparison of two algorithms requires retraining both on identical datasets, using the same reaction templates and stock definitions. However, the lack of publicly available training scripts and the high computational cost of retraining often make such comparisons infeasible. As a result, the literature contains few direct, controlled studies of planner performance, impeding the community’s ability to discern which algorithmic innovations are most impactful.

To address this challenge, we advocate for the dual-track evaluation model formalized by Morgunov and Batista [106], which distinguishes between two complementary evaluation goals:

The Developer Track: This protocol is designed for the rigorous assessment of algorithmic novelty. It requires that method creators demonstrate the advantages of a new approach through fair, retrained comparisons against established baselines under fixed boundary conditions.
The Chemist Track: This protocol addresses the needs of practical application by facilitating the evaluation of pre-trained, off-the-shelf models as-is, without the requirement of retraining. A practicing chemist is often less concerned with theoretical algorithmic superiority and more with which available tool provides the most reliable routes for a given target.

By distinguishing between the assessment of algorithmic novelty and practical utility, this dual-track framework allows for both rigorous validation and pragmatic, application-focused assessment. Adopting such a standard would create a clearer path for both foundational research and the development of tools that serve the daily needs of the chemistry community.

8.5 Enabling Progress Through Shared Data and Outputs

A primary obstacle to automating higher-tier validity checking (Solv-2 and Solv-3) is the scarcity of large-scale, annotated datasets of chemically plausible but invalid routes. The patent literature provides an abundance of positive examples but offers no explicit supervision on why alternative synthetic paths fail.

To bootstrap the development of the next generation of automated validity models, we propose that the community adopt a standard of open route reporting. If future retrosynthesis studies were to publish their full, raw generated route trees in a standardized, machine-readable format (e.g., JSON), it would create an invaluable community resource. This shared data would enable researchers to crowdsource the auditing process, progressively building the ground-truth datasets of both successful and failed proposals required to train robust automated Solv-2 verifiers. The existence of a public infrastructure for sharing such outputs, like SynthArena [106], demonstrates that this practice is technically feasible and would significantly accelerate the field’s transition into the Era of Validity.

9 Beyond Topology: Toward Generation of Reliable Synthetic Procedures

This review has focused on topological planning: the construction of a stock-terminated sequence of graph edits (Tier 1, Solv-1). However, laboratory success is governed by executability (Solv-3), which requires that each step $A \to B$ admits a workable experimental procedure: choice of reagents and catalysts, solvents, temperature and time profile, and a purification strategy that yields material suitable for subsequent steps. This layer is difficult to learn from patent-derived datasets, where conditions are often unreported, inconsistent, or implicit.

Recent efforts to bridge this gap have gone beyond simple condition regression toward agentic frameworks, a shift documented in recent comprehensive surveys [23, 87, 177] and general capability evaluations [178-180]. These frameworks range from retrieval-augmented generators of standard operating procedures grounded in equipment manuals and safety documents for compliant laboratory workflows [181] to robotic agents that translate natural-language instructions into executable hardware controls for autonomous experimentation [182-184]. Specialized LLM models such as ChemCrow [185], ChemActor [186], ReactXT [187], and ReactGPT [188] translate between reaction SMILES representations and natural-language descriptions of experimental protocols, using pretraining and in-context fine-tuning of base LLMs to improve yield prediction and condition optimization. Reported outcomes include 67% yields in novel Suzuki-Miyaura couplings via human-AI collaboration (Chemma [189]), 94.5% yields in optimizations and scale-ups (LLM-RDF [190]), product confirmation in autonomous runs [184], and 97% execution success in robotic tasks (CLAIRify [182]). Other training strategies that use reinforcement learning from verifiable rewards, such as the scientific reasoning model QFANG [191], further push the boundary by producing chemically consistent step-by-step synthetic workflows that sometimes even improve verified literature protocols [191]. Collectively, these efforts show a shift from static text prediction toward tool-augmented and agentic LLM architectures that bridge high-level reaction plans and executable synthetic procedures.

Overall, demonstrated advantages include substantial acceleration of synthesis planning and documentation and clear evidence that fine-tuned and tool-augmented can outperform prior methods on tasks like procedure prediction and condition recommendation. However, despite these capabilities, generalist models frequently suffer from two distinct failure modes: regression to the global mode (predicting average conditions for specific chemistry) and semantic hallucination (generating fluent but chemically incoherent procedures).

9.1 Overcoming Regression to the Mode: The Specialist Approach

To address the limitations of global averaging, Li et al. introduced multiple optimized specialists for AI-assisted chemical prediction (MOSAIC) [192]: a framework that abandons the single-model paradigm in favor of an ensemble of local experts (see Fig. 1 in [192]).

The central object of MOSAIC is the reaction universe: a learned metric space of reaction-specific fingerprints (RSFP). It is generated by a kernel metric network (KMN) from concatenated Morgan fingerprints of the reactants and products of chemical reactions in the training set, together with natural-language descriptions of the reaction procedures, conditions, and yields. Training ensures that similar reactions are represented by neighboring points in the metric space of the reaction universe. The reaction classes are then defined by partitioning the entire reaction space into $\sim$ 2,500 distinct Voronoi cells using the FAISS clustering algorithm [193]. Clustering is purely metric-driven and does not enforce any traditional reaction type labels, allowing the system to learn and exploit similarities among chemical transformations directly from RSFP space rather than inheriting the reaction classifications adopted in the literature. Therefore, each Voronoi cell is a cluster of transformations with empirically similar fingerprints, often spanning multiple closely related named reactions and reflecting similarity in conditions, reagents, and synthetic procedures. These cells are interpreted as domains of chemical knowledge and serve as training sets for fine-tuning $\sim$ 2,500 low-rank adaptation (LoRA [194]) models (dubbed “chemical experts” [192]) that capture region-specific statistics and precise reagent and condition profiles associated with the chemical subclass rather than regressing to a global average.

MOSAIC operates by routing queries to the most relevant expert model based on chemical similarity identified by the nearest Voronoi cell. The expert prediction yields a reproducible and human-readable experimental protocol that includes full details such as reagents, order of addition, reaction conditions, purification steps, and yield of the procedure. Furthermore, the distances from a query to the centroid of the nearest Voronoi cell or nearest training databease example within the cell provide explicit confidence metrics, allowing the system to distinguish confident interpolations from out-of-distribution predictions. These metrics can also serve as scoring functions for assessing novelty and feasibility of queried reactions: small distances indicate high confidence in protocol executability, as the query lies near the heart of a well characterized domain or close to a well characterized and validated reaction, while larger distances signal potential extrapolation into underrepresented or unexplored chemical territories. Potential applications of these metrics include prioritizing synthesis candidates in drug discovery or integrating with tree search algorithms in retrosynthetic computational pipelines.

In wet-lab validation on 37 de novo compounds spanning pharmaceuticals and agrochemicals, MOSAIC achieved a 71% success rate across 52 attempted transformations. Notably, the system successfully proposed executable protocols for challenging Buchwald-Hartwig aminations and olefin metathesis reactions that were structurally distinct from the training examples, suggesting that the learned metric space effectively clusters chemically related transformations. However, this performance relies on training and maintaining thousands of independent disjoint models, trading simplicity for the precision of extreme specialization.

9.2 Grounding Procedural Generation

Procedure generation demands more than predicting catalysts and solvents; it requires orchestrating a chemically coherent sequence of operations (addition, quench, workup, isolation) consistent with the intended transformation. Unconstrained language models frequently produce procedures that are stylistically plausible yet chemically invalid because the generated text is decoupled from the underlying structural change. Liu et al. [191] demonstrate that even advanced models like GPT-5 can misinterpret reaction intent---for example, misidentifying a benzylic oxidation as a benzoylation---resulting in fluent protocols that synthesize the wrong molecule. Similarly, models often suggest catalytic hydrogenation for intermediates containing reducible motifs that must be preserved (e.g., hydrogenolysis of C-N bonds).

QFANG [191] addresses these failure modes by explicitly grounding procedural generation in the atom-mapped graph edit, ensuring that generated experimental protocols remain chemically consistent with the underlying molecular transformations. The approach is based on the chemistry-guided reasoning (CGR) framework, a two-stage process designed to produce high-quality reasoning datasets at scale. In the first stage, a “factual scaffold” is extracted from the atom-mapped reaction SMILES, capturing the essential functional group changes, bond formations, and disconnections that define the core topological logic of the reaction. This scaffold acts as a set of graph transformation constraints, enforcing rules such as atom conservation, valence adherence, and stereochemical fidelity to prevent hallucinatory deviations. In the second stage, an LLM expands this set of constraints into a detailed procedural narrative, incorporating contextual elements such as reagent quantities and reaction conditions while remaining tethered to the ground-truth graph edits. By conditioning the text generation on these explicit graph-based constraints, the model aligns the procedural steps with the verifiable topology, demonstrating superior generalization to out-of-domain reactions and adaptability to user-specified parameters such as scale and temperature. This architecture confirms that for generative chemistry, textual fluency is a liability unless strictly constrained by the syntax of the graph transformation, while unconstrained models often produce plausible but chemically invalid outputs.

9.3 The Non-Markovian Nature of Purity

Finally, a persistent blind spot in both synthesis planning and procedure generation is the assumption of modularity. Current systems typically treat purification as a solved abstraction, assuming that each individual reaction step yields sufficiently pure material to serve as the input for the subsequent step without interference. In practice, however, organic synthesis is inherently non-Markovian: the outcome of a given step depends not only on its immediate precursors but also on the accumulated history of the route. Impurities, residual catalysts, and inseparable byproducts can propagate through multiple steps, leading to cascading failures. For instance, a copper-catalyzed cross-coupling reaction achieving a 90% isolated yield may appear successful in isolation, but it becomes functionally untenable if trace copper carryover poisons a downstream palladium-catalyzed cycle or interferes with a biological assay of the final product. Effects of such propagation underscore the need for models that explicitly account for intermolecular interactions and contaminant persistence across the entire synthetic sequence.

Leading platforms like ASKCOS [195] have begun to address this challenge by integrating explicit impurity prediction modules, which filter topologically valid but chemically undesirable steps during retrosynthetic analysis. These modules leverage data-driven approaches, such as template-free forward prediction models trained on large reaction databases, to anticipate impurities arising from five primary modes: minor products, side reactions, dimerizations, solvent adducts, and reactions involving subsets of reactants. By simulating these side processes, ASKCOS can identify and discard proposals that would introduce problematic contaminants, thereby enhancing the practical feasibility of suggested routes. However, these impurity checks predominantly operate as local constraints, applied step-wise without fully considering downstream consequences. True Solv-3 planning, which emphasizes executability under realistic laboratory conditions, including yield optimization, purification requirements, safety considerations, and scalability, requires a more holistic approach. This could involve an introduction of the route-level objective function of the synthetic sequence, that penalizes not only individual step inefficiencies but also cumulative factors such as separation complexity and impurity propagation. Formally, this function might incorporate terms for predicted impurity profiles, chromatographic separability scores, and compatibility assessments across steps, shifting the optimization paradigm from mere step-wise yield maximization to route-wise material quality and overall process robustness.

10 Outlook: Toward a Chemical Foundation Model

This review has charted the evolution of synthesis planning through a phase transition. The challenges of the Era of Navigability, which focused on finding any valid path through a combinatorial search space, have largely been solved by modern algorithms. The field now enters the Era of Validity, a period defined by the pursuit of chemical correctness. The central task of the coming decade will be to build systems that not only connect molecular graphs but do so with the causal logic and experimental reliability of a trained chemist.

Achieving robust performance on Solv-2 (Selectivity) and Solv-3 (Executability) metrics will require more than improved evaluation alone; it necessitates concerted efforts in infrastructure design, data generation, and the fundamental architecture of the models themselves. We conclude by outlining these key areas, which together form a path toward bridging the gap between topological search and physical reality.

10.1 Infrastructure as a Scientific Instrument

The advancement of synthesis planning is an inherently interdisciplinary endeavor, relying on distinct expertise from both organic chemistry and computer science. In the field’s formative years, progress was driven by framing retrosynthesis in terms that were most amenable to established computational techniques: specifically, as a search problem on an exceptionally large graph. This formulation was a necessary and productive abstraction, as it allowed for the direct application of powerful search algorithms and spurred rapid innovation in solving the topological connectivity challenge.

This early focus, however, also illustrates a crucial principle for the future of computational chemistry. The software we build is not a neutral tool; it is the scientific instrument through which we probe a problem. The design of that instrument profoundly influences the questions we ask and the answers we obtain. An instrument architected to optimize graph traversal will naturally orient the field toward measuring success with graph-based metrics. To ask the deeper chemical questions of selectivity and experimental feasibility, the instrument must be designed with those principles as its foundational logic. This requires a paradigm of co-design that moves beyond consultation to active architectural contribution. The most significant advances in this new era will likely be driven by a new generation of researchers fluent in both reaction mechanisms and performant software design. The success of integrated platforms like ASKCOS [195] provides a compelling demonstration that when this deep collaboration is achieved, the research focus naturally shifts from computational benchmarks to practical, experimental utility.

10.2 Physics-Based Supervision and Automated Experimentation

A prevailing narrative in chemical AI suggests that progress is fundamentally constrained by the scale of available experimental data, with automated laboratories often presented as the primary solution. While automated experimentation is an invaluable tool for generating ground-truth data, this perspective may underutilize the vast predictive power of established theoretical and computational chemistry. Decades of research have yielded robust physical models that can, in principle, predict the very outcomes — selectivity, stability, and reactivity — that are most critical for building valid planners.

Historically, however, these powerful theoretical tools have been deployed as artisanal, single-molecule calculations rather than as at-scale data generation engines. The critical bottleneck, therefore, may not be a deficit in scientific theory, but rather a deficit in the engineering required to transform these physical models into high-throughput supervision pipelines. Instead of relying solely on the sparse and biased data from patent literature, the field can generate its own high-fidelity labels.

10.3 Search-Augmented Generation

The tension between explicit graph search and direct sequence generation is likely a transient phase in the field’s development. A recurring observation in computationally intensive sciences is that general-purpose architectures capable of scaling with computation eventually outperform systems that rely on complex, hand-engineered heuristics [196]. This suggests a powerful, symbiotic path forward for synthesis planning.

In this paradigm, explicit search, guided by rigorous physical constraints (e.g., automated Solv-2 filters), acts as the “teacher.” It can explore the vast combinatorial space of synthesis to generate large, high-fidelity datasets of valid routes. High-capacity sequence models can then act as the “student,” distilling this complex physical and strategic logic into a fast, generalizable policy. This approach amortizes the immense computational expense of search into the inference step, combining the rigor of symbolic methods with the speed and pattern-recognition capabilities of deep learning.

10.4 Synthesis Planning as a Pre-training Objective

We return to the central thesis of this review: that synthesis planning is the chemical analogue of next-token prediction. Current chemical foundation models, largely trained on static graph masking or SMILES reconstruction, often fail to generalize to activity cliffs (Section 2). We hypothesize that this fragility arises because they learn the syntax of representation rather than the syntax of transformation.

Retrosynthesis is a uniquely demanding generative objective because it forces the model to internalize the structural grammar of transformation under explicit validity constraints, rather than to correlate static motifs with labels. This is also why synthesis planning is a plausible route to emergence. The same electronic-structure determinants that govern reactivity---functional-group electronics, polarization, steric accessibility, and conformational preferences---also shape many downstream properties by controlling how a molecule interacts with its environment.

The implication is not that property prediction disappears, but that it becomes a readout of a representation shaped by planning. If the community adopts the validity-centric framework proposed here---pairing this objective with auditable routes and rigorous Solv-2 metrics---synthesis planning offers a path to grounded chemical representation learning. Under this view, a chemical foundation model---or, more cautiously, a program toward artificial chemical intelligence---becomes a concrete research agenda rather than a branding term: pre-train on planning as the chemical analogue of next-token prediction, and evaluate progress by whether planning competence transfers to new chemistry and new functional questions.

10.5 What evidence would challenge this thesis?

The hypothesis of this review is that pre-training on the causal logic of synthesis planning will yield more robust and generalizable chemical representations than objectives based on static molecular structures. This thesis would be substantially weakened if empirical results provided any of the following outcomes:

A failure of learned planned skills to generalize to broader chemical reasoning. The thesis would be falsified if a model becomes an expert at route-finding, yet its learned representations provide no significant advantage for unrelated chemical tasks, such as reasoning about structure-property relationships and discerning activity cliffs. This would demonstrate that learning the “syntax of matter” yields no more general chemical intelligence than the “syntax of notation”, ultimately failing to surpass the known performance plateaus of static pre-training. Synthesis planning would thus be revealed as a narrow, specialized skill, not a foundational one.
The data requirements for effective pre-training prove prohibitive. The thesis is predicated on the existence of a sufficiently large and diverse corpus of valid synthetic routes to learn from. It would be practically falsified if the performance of planning-based models saturates at a low level of competence due to the inherent limitations of available experimental data, and if the physics-based data generation (e.g., high-throughput QM) proves unable to bridge this gap at a reasonable computational cost.
The analogy to natural language processing proves to be flawed. The success of large language models may depend critically on post-training alignment techniques like reinforcement learning from human feedback [23], which are used to refine raw predictive models into useful assistants. This thesis would be significantly weakened if a similar massive-scale feedback loop from expert chemists is the true bottleneck for achieving artificial chemical intelligence, and that such a data-generating process is not scalable. In this scenario, the pre-training objective alone would be insufficient.

In summary, reproducible evidence that alternative pretraining objectives or validators consistently outperform synthesis planning centered approaches on synthesis and property-transfer tasks would motivate rethinking the centrality of synthesis planning in chemical foundation modeling.

11 Conclusions

Multistep synthesis planning has advanced rapidly in recent years, but the meaning of reported progress depends strongly on what is being measured. Across much of the recent literature, benchmark success has often reflected improvements in navigability: the ability to find a stock-terminated route through a large combinatorial search space. That progress is real and important. At the same time, this review has argued that navigability alone is an incomplete proxy for practical synthetic competence, because topologically valid routes may still fail at the level of selectivity, stereochemical fidelity, protecting-group logic, condition compatibility, or overall executability.

To clarify this distinction, we introduced the Hierarchy of Chemical Validity (Solv- $N$ ), which separates syntactic validity (Solv-0), topological solvability (Solv-1), chemically plausible and selective routing (Solv-2), and executable route construction under realistic constraints (Solv-3). We view this hierarchy not as a finalized standard, but as a minimal scaffold for organizing evaluation and for making explicit which levels of validity a given model, benchmark, or claim actually addresses. In particular, the review has highlighted that many widely reported near-saturation results pertain primarily to Solv-1, and that these results can depend strongly on stock definition, evaluator design, and inventory scope.

Within the domain considered here, recent work suggests a shift in emphasis from route finding alone toward stronger notions of route validity. Search-based systems, direct sequence generators, and hybrid or neurosymbolic architectures all contribute differently to this transition, but none yet resolves the full Solv-2/3 problem in a standardized and experimentally grounded way. For this reason, we argue that future benchmarking should place greater weight on chemical plausibility, selectivity, and execution constraints, and should report these dimensions separately rather than collapsing them into a single notion of solvability.

A broader interpretive claim of this review is that synthesis planning may be a valuable organizing objective for learning chemistry-aware representations. More specifically, multistep retrosynthesis appears to be a strong candidate objective for tasks that depend on reactivity, synthetic accessibility, and route-level compositional reasoning. We have deliberately framed this as a hypothesis supported by converging evidence, rather than as a settled conclusion. Forward reaction modeling, condition and workflow prediction, multimodal structure—property learning, 3D or physics-informed objectives, and lab-in-the-loop systems each capture aspects of chemical intelligence that retrosynthesis alone does not. The most plausible path forward is therefore not a single universal objective, but a broader foundation-model stack in which retrosynthesis provides one important organizing prior among several complementary learning signals.

We emphasize that the conclusions of this review are grounded primarily in multistep small-molecule organic retrosynthesis, especially database-driven planning systems built on patent and reaction-corpus precedent. Whether the same framework transfers unchanged to catalysis, inorganic and organometallic synthesis, polymer and materials chemistry, electrochemistry, or peptide and biocatalytic synthesis remains unresolved. These domains differ substantially in representation, data quality, mechanistic structure, and criteria for experimental success, and will likely require domain-specific extensions of both the Solv- $N$ hierarchy and the modeling conclusions developed here.

Taken together, the literature supports a more validity-centric view of progress in contemporary retrosynthesis. If the field can move from topological route finding toward reproducible evaluation of chemical plausibility and executability, then synthesis planning may become not only a benchmark task, but also a useful organizing framework for more general chemical machine learning. The central challenge ahead is therefore not simply to find more routes, but to measure---and ultimately learn---which routes are chemically credible, experimentally actionable, and robust under realistic constraints.