From Structure to Function: The Next Phase of Foundation Models in Drug Discovery

The application of foundation models to drug discovery represents a significant methodological shift in how therapeutic candidates are identified, designed, and optimized. Where traditional computational approaches relied on physics-based simulations or task-specific machine learning models trained on narrow datasets, foundation models offer a different paradigm: large-scale pretraining on biological data followed by adaptation to specific downstream tasks.

This memo examines the current state of foundation models in drug discovery, with particular attention to the technical characteristics that distinguish domain-specific models from general-purpose architectures, the bottlenecks that constrain their practical utility, and the directions we believe are most promising for future development. Our perspective is informed by conversations with researchers and practitioners at our recent AI4Healthcare Hackathon, as well as ongoing engagement with the technical literature.

We argue that while substantial progress has been made—particularly in structure prediction—the field is now entering a more challenging phase where the problems of binding affinity, conformational dynamics, and experimental validation become central. The models and approaches that succeed in this next phase will likely look quite different from those that dominated the previous one.

Motivation: Why Specialized Models for Drug Discovery

General-purpose language models have demonstrated remarkable capabilities across a wide range of tasks, raising a natural question: why develop specialized foundation models for drug discovery rather than adapting existing large language models?

The answer lies in the particular characteristics of molecular and biological data. Unlike natural language, where the relationship between tokens is primarily semantic and statistical, biological sequences encode physical and chemical constraints that govern molecular behavior. A protein sequence determines a three-dimensional structure; that structure determines binding interfaces; those interfaces determine function. These relationships are governed by physics, not convention, and models that fail to respect them produce outputs that are chemically implausible or biologically inert.

Furthermore, the data landscape in drug discovery differs substantially from that of natural language processing. Medical and biological data is fragmented across institutions, often privacy-sensitive, and subject to regulatory constraints that limit sharing and aggregation. Clinical data in particular is sparse relative to the complexity of human disease. Models trained primarily on web text lack exposure to the specialized vocabularies, experimental conventions, and domain knowledge encoded in scientific literature, laboratory notebooks, and clinical records.

Perhaps most importantly, the cost of errors in drug discovery is asymmetric in ways that differ from typical language model applications. A hallucinated molecule that appears promising computationally but fails experimentally represents not just a wrong answer but potentially years of misdirected effort and resources. This creates strong incentives for models that can quantify uncertainty, provide interpretable rationales, and integrate with experimental validation workflows—capabilities that require architectural and training choices tailored to the domain.

Historical Development and Current Landscape

Structure Prediction as Foundational Capability

The release of AlphaFold2 in 2020 marked a watershed moment for computational biology, demonstrating that deep learning could predict protein structures with accuracy approaching experimental methods. The subsequent development of AlphaFold3 extended these capabilities to protein-protein, protein-nucleic acid, and protein-ligand complexes, employing a diffusion-based architecture that represented a significant departure from the earlier Evoformer design.

The significance of this work—recognized by the 2024 Nobel Prize in Chemistry—extends beyond the immediate practical applications. AlphaFold demonstrated that transformer-based architectures, when trained on appropriate biological data with suitable inductive biases, could learn representations that captured deep structural regularities in protein space. This suggested that similar approaches might prove effective for other aspects of molecular biology.

However, structure prediction, while necessary for many drug discovery applications, is not sufficient. Knowing that a molecule can adopt a particular conformation does not tell us how tightly it will bind to a target, whether that binding will produce a therapeutic effect, or whether the molecule can be synthesized, formulated, and administered safely. These downstream questions—which ultimately determine whether a computational prediction translates to a successful drug—require capabilities beyond structure prediction alone.

From Prediction to Generation

A parallel line of development has focused on generative models capable of designing novel molecules rather than merely predicting properties of existing ones. The RFdiffusion family of models, developed at the Baker Lab, adapted diffusion techniques—originally developed for image generation—to the problem of protein design. By learning to denoise random coordinates into coherent protein backbones that satisfy specified constraints, these models enabled the design of novel binders, enzymes, and symmetric assemblies.

The recent RFdiffusion3 represents a significant advance in this lineage, operating at all-atom resolution rather than residue-level approximation. This enables joint generation of protein structures and their interactions with ligands, DNA, and other non-protein molecules within a unified framework. The architectural choice to treat all atoms equivalently—regardless of whether they belong to the protein or its binding partner—reflects a broader trend toward models that capture the full complexity of biomolecular systems rather than treating non-protein components as fixed constraints.

Similarly, Chai Discovery's recent work on de novo antibody design demonstrates that generative models can achieve meaningful success rates on therapeutically relevant design tasks. Their approach generates novel antibody sequences conditioned only on target antigen and epitope information, without requiring starting templates or existing binders. While validation remains ongoing, the reported improvements over previous computational methods suggest that generative approaches may be approaching practical utility for certain design applications.

Affinity Prediction as Emerging Frontier

Perhaps the most consequential recent development is the emergence of models that jointly predict structure and binding affinity. Boltz-2, developed collaboratively by MIT and Recursion, represents the clearest example of this trend. Unlike previous approaches that treated structure prediction and affinity estimation as separate problems requiring different methodologies, Boltz-2 integrates both within a single foundation model trained on structural data, molecular dynamics simulations, and experimental binding measurements.

The significance of this integration is practical as much as theoretical. Traditional physics-based methods for affinity prediction—particularly free energy perturbation (FEP) calculations—are computationally expensive, limiting their application to small numbers of candidates in late-stage optimization. A model that can estimate affinity with comparable accuracy at dramatically lower computational cost enables qualitatively different workflows: screening larger libraries earlier in the discovery process, exploring broader regions of chemical space, and iterating more rapidly on design hypotheses.

The decision to release Boltz-2 under an open-source license, with full access to code, weights, and training pipeline, has implications for how the field develops. Open availability lowers barriers to adoption and enables independent validation, but also means that any competitive advantage from the model itself is short-lived. This may shift value creation toward proprietary data, integration with experimental capabilities, or downstream application rather than model development per se.

Technical Characteristics of Domain-Specific Models

The foundation models that have proven most effective for drug discovery share several technical characteristics that distinguish them from general-purpose architectures. Understanding these characteristics illuminates both why specialized models are necessary and what design principles are likely to guide future development.

Physically-Informed Architecture Design

Molecular systems obey physical symmetries that constrain their behavior in ways that natural language does not. Rotating a molecule in space does not change its properties; permuting equivalent atoms does not change its identity. Effective models for molecular data typically encode these symmetries directly in their architecture through equivariant neural networks that preserve geometric relationships under transformation.

The AlphaFold3 diffusion module, for example, operates on atomic coordinates in a manner that respects SE(3) equivariance—the symmetry group of rotations and translations in three-dimensional space. RFdiffusion3 similarly employs sparse attention mechanisms designed to capture local atomic geometry while maintaining global coherence across the molecular system. These architectural choices are not merely implementation details; they encode domain knowledge about the structure of the problem space in ways that improve sample efficiency and generalization.

More recent work has extended this principle beyond symmetry to incorporate other physical constraints. Boltz-2 introduces "Boltz-steering" to improve physical plausibility of generated structures, while also incorporating conditioning mechanisms that allow users to guide predictions using experimental templates or known molecular contacts. These features reflect a broader recognition that purely data-driven approaches benefit from integration with domain knowledge, even when that knowledge is encoded softly as architectural bias rather than hard constraint.

Training on Domain-Specific Data Distributions

The pretraining data for biological foundation models differs fundamentally from the web-scraped corpora used for language models. Protein language models like ESM-2 and ESM-3 are trained on evolutionary sequence data—millions of protein sequences from across the tree of life that encode billions of years of natural selection. This training data is not merely large but structured: evolutionarily related sequences share functional and structural properties, and models that learn to predict masked residues implicitly learn something about the constraints that govern protein function.

The EDEN family of models from Basecamp Research extends this principle by training on metagenomic data specifically enriched for diversity. Their dataset—collected from over 150 locations across 28 countries—is intentionally biased toward environmental and host-associated metagenomes, phage sequences, and mobile genetic elements that are underrepresented in public databases. The hypothesis underlying this approach is that evolutionary diversity encodes design principles that transfer to therapeutic applications: mechanisms for DNA insertion, antimicrobial activity, and other functions that have been refined by selection across diverse ecological contexts.

This raises a question that will likely become increasingly important as the field matures: to what extent do proprietary biological datasets constitute durable competitive advantages? If model architectures converge and training procedures become standardized, the primary differentiator may be access to data that captures biological variation absent from public resources. Basecamp's thesis—that the majority of public sequence data derives from a small number of well-studied organisms—suggests substantial room for improvement through more comprehensive sampling of biological diversity.

Multimodal Integration

Biological systems are inherently multimodal: understanding a disease requires integrating genomic sequences, protein structures, imaging data, clinical records, and experimental measurements. Foundation models that can jointly process multiple data modalities have potential advantages over specialized models that address each modality in isolation.

Several recent models demonstrate this capability. Medical vision-language models combine radiological images with clinical text to enable more comprehensive diagnostic reasoning. EndoDINO processes high-dimensional endoscopy video to predict disease progression in inflammatory conditions. The xHAIM framework generates interpretable summaries from multimodal patient data, demonstrating that integration can improve both predictive performance and clinical utility.

The challenge of multimodal integration extends beyond model architecture to data availability and alignment. Different modalities are often collected by different institutions under different protocols, with limited overlap between patients who have comprehensive data across all relevant types. Foundation models that can learn useful representations from incomplete or partially-aligned multimodal data—rather than requiring complete feature sets for all training examples—may have significant practical advantages.

Efficiency and Deployability

A notable feature of several successful biological foundation models is their relatively modest parameter counts compared to frontier language models. RFdiffusion3 operates with approximately 168 million parameters—substantially smaller than its predecessors while achieving improved performance. Models like Reason2Decide achieve competitive results on clinical triage tasks with architectures dramatically smaller than general-purpose alternatives.

This efficiency likely reflects the structured nature of biological data: physical constraints reduce the effective dimensionality of the problem space, and domain-specific architectures can exploit this structure in ways that generic architectures cannot. From a practical standpoint, smaller models enable deployment in settings with limited computational resources—academic laboratories, hospital systems, and resource-constrained research environments where access to large-scale GPU clusters cannot be assumed.

Current Bottlenecks and Limitations

Despite substantial progress, significant limitations constrain the practical utility of current foundation models for drug discovery. Understanding these bottlenecks is essential for identifying productive directions for future research and for setting appropriate expectations about near-term capabilities.

The Experimental Validation Gap

The most fundamental limitation is the gap between computational prediction and experimental reality. A molecule that scores well according to a model's objective function must still be synthesized, purified, and tested in biological systems before its therapeutic potential can be assessed. This experimental validation step remains expensive, time-consuming, and subject to failure modes that computational models do not capture.

The implications extend beyond simple validation. The properties that make a molecule computationally attractive—high predicted binding affinity, favorable docking scores—do not necessarily align with the properties that make it a viable drug. Solubility, metabolic stability, membrane permeability, toxicity, and manufacturability all matter for therapeutic development but are imperfectly captured by current models. A computational approach that optimizes only for binding may produce molecules that are potent but undruggable.

Addressing this gap requires tighter integration between computational and experimental workflows. The merger between Recursion and Exscientia reflects recognition that value creation in AI-driven drug discovery depends on closing the loop between prediction and validation. Similarly, Chai Discovery's explicit focus on "developability"—optimizing for properties required to turn computational designs into actual drugs—represents an attempt to incorporate practical constraints earlier in the design process. The degree to which foundation models can learn to predict experimentally-relevant properties, rather than merely computationally-convenient ones, will substantially determine their practical impact.

Conformational Dynamics and Flexibility

Current structure prediction models excel at predicting static conformations but struggle with proteins that undergo significant conformational changes upon binding or that exist in equilibrium between multiple functional states. This limitation is particularly consequential for G protein-coupled receptors (GPCRs), one of the most important classes of drug targets, which signal by transitioning between active and inactive conformations.

The challenge is both data-driven and architectural. Training data is biased toward crystallizable conformations—states stable enough to form the ordered lattices required for X-ray crystallography. Flexible regions, intrinsically disordered domains, and transient binding states are underrepresented in structural databases and correspondingly difficult for models trained on this data to capture. Architecturally, diffusion models that generate single structures must be extended or adapted to represent conformational ensembles and dynamic transitions.

Recent work has begun to address this limitation. AlphaFlow and BioEmu demonstrate that generative models can learn to sample from conformational distributions rather than predicting single structures. Boltz-2 incorporates predictions of B-factors—crystallographic measures of atomic flexibility—as a proxy for dynamic behavior. However, accurate prediction of conformational dynamics remains an open problem, and many therapeutically relevant targets remain challenging for current approaches.

Data Availability and Distribution

Foundation models are data-hungry, and the availability of appropriate training data varies substantially across different aspects of drug discovery. Protein sequences are abundant—hundreds of millions are available in public databases—but binding affinity measurements, clinical outcomes, and experimental validation data are far more limited. This creates an imbalance where models may be well-calibrated for tasks with abundant data but unreliable for tasks where data is scarce.

The distribution of available data also introduces biases that may limit generalization. Well-studied protein families, common disease targets, and successful drug classes are overrepresented in training sets. Models may perform well on targets similar to those in the training distribution while failing on novel target classes or underexplored regions of chemical space. Boltz-2's developers explicitly note that "strong performance on public benchmarks does not always immediately translate to all complexities of real-world drug discovery"—a caveat that applies broadly across the field.

Addressing data limitations requires both expanding data collection and developing methods that generalize from limited examples. Basecamp Research's global biological sampling program represents one approach to the former; transfer learning, few-shot adaptation, and physics-informed architectures represent approaches to the latter. The relative importance of these strategies—more data versus better algorithms—remains an open empirical question.

Regulatory and Clinical Translation

The path from computational prediction to approved therapeutic passes through regulatory frameworks that were not designed with AI-generated candidates in mind. Questions of model validation, uncertainty quantification, and interpretability take on particular importance when regulators must assess the safety and efficacy of AI-designed molecules. The FDA's recent draft guidance on AI in drug development represents a first step toward addressing these questions, but substantial uncertainty remains.

Early clinical results from AI-designed therapeutics are encouraging but limited. Insilico Medicine's fibrosis candidate has advanced through Phase II with positive results; Schrödinger's physics-enabled design approach has produced a molecule now in Phase III trials. These examples demonstrate that AI-designed molecules can succeed in clinical development, but the sample size remains small and the extent to which AI contributions accelerated or improved outcomes versus traditional approaches is difficult to assess. More clinical data, across more targets and therapeutic modalities, will be needed to establish the practical value of foundation models for drug discovery.

Directions for Future Development

The preceding analysis suggests several directions that we believe will be important for the continued development of foundation models in drug discovery.

Integration of Prediction and Generation

The separation between models that predict properties and models that generate candidates is increasingly artificial. Boltz-2's joint prediction of structure and affinity points toward architectures that can simultaneously generate novel molecules and estimate their properties, enabling more efficient optimization and reducing the iteration required between generation and evaluation steps. Coupling generative models with accurate property predictors—whether within a single architecture or through tight integration of separate models—represents a natural direction for development.

EDEN's demonstration that a single foundation model can design therapeutics across multiple modalities—gene therapy constructs, antimicrobial peptides, engineered microbiomes—suggests that the boundaries between different therapeutic approaches may be more permeable than previously assumed. Models that learn general principles of biological design, rather than specializing in particular molecular types, may prove more flexible and broadly applicable.

Closing the Experimental Loop

The wet lab bottleneck will not be solved by better computational models alone. Progress requires integration of foundation models with automated experimental platforms that can rapidly synthesize, test, and characterize computational predictions. Self-driving laboratories, high-throughput screening systems, and automated synthesis platforms are maturing in parallel with computational methods; their integration represents a substantial opportunity.

Models that incorporate experimental feedback—learning from the successes and failures of their own predictions—may achieve performance that pure in silico approaches cannot match. The DrugReflector system's demonstrated improvement from incorporating lab feedback illustrates this principle. More generally, the design-make-test-learn cycle that characterizes effective drug discovery must be accelerated at every step, not merely the computational one.

Expanding Biological Coverage

Current models are trained primarily on well-characterized proteins, common disease targets, and established drug modalities. Expanding coverage to underexplored regions of biological space—orphan targets, rare diseases, novel therapeutic mechanisms—requires both expanded data collection and methods that can generalize from limited examples.

The EDEN approach of training on evolutionarily diverse metagenomic data represents one strategy for expanding biological coverage. Transfer learning from data-rich domains to data-poor applications represents another. As foundation models mature, their ability to support drug discovery for conditions beyond the well-trodden paths of oncology and common chronic diseases will be an important measure of their broader impact.

Uncertainty Quantification and Interpretability

For foundation models to be trusted in high-stakes drug discovery decisions, they must provide calibrated uncertainty estimates and interpretable rationales for their predictions. Current models often produce confident predictions without indicating when those predictions are unreliable. Developing methods for uncertainty quantification that scale to foundation model architectures, and for generating explanations that are meaningful to domain experts, represents an important technical challenge. The Reason2Decide framework's approach of jointly training predictions with rationales illustrates one direction; integrating attention mechanisms, SHAP values, and other interpretability techniques with biological domain knowledge illustrates others.

The Short

Foundation models for drug discovery have progressed rapidly from proof-of-concept demonstrations to tools with genuine practical utility. Structure prediction is now routine; affinity prediction is improving; generative design is producing candidates that succeed in experimental validation. The field has moved beyond asking whether AI can contribute to drug discovery to asking how that contribution can be maximized.

Yet substantial challenges remain. The gap between computational prediction and experimental reality persists; conformational dynamics are imperfectly captured; data limitations constrain generalization; and regulatory frameworks are still adapting to AI-designed therapeutics. The models and approaches that address these challenges will likely differ from those that achieved the initial breakthroughs in structure prediction.

We see this as a transition from a phase dominated by architectural innovation—where the primary question was whether deep learning could work for biological problems—to a phase where integration, data, and experimental validation become central. The researchers and organizations that navigate this transition successfully will be those that combine computational sophistication with deep engagement with biological reality, that build capabilities across the full discovery pipeline rather than optimizing isolated steps, and that maintain appropriate humility about the limitations of current approaches while pushing to overcome them.

The ultimate test of foundation models for drug discovery will be clinical outcomes: do AI-designed therapeutics reach patients faster, work better, or address diseases that would otherwise remain untreated? That test is now beginning in earnest, and its results will determine the lasting significance of the technical advances we have surveyed here.