#1: "The Most Underrated Unlock for Working Scientists Isn't a New Model — It's a Coding Agent"
Nobody came to this dinner to talk about coding agents. The conversation went there anyway.
The emerging consensus: the most immediate and underrated unlock for AI in science right now is coding agents, not larger models or better architectures. The reason is structural. Scientific work — prediction models, survival analysis, omics pipelines, multi-parameter optimization — ultimately reduces to code. Deep learning automated the optimization of models. Coding agents automate the construction and iteration of the code that runs them. The loop closes.
In clinical research, the gap is measurable: pipelines that used to take two years to build and validate now take weeks. In materials science and physical modeling, agents run autonomous optimization loops — improving one target variable without degrading another, surfacing tradeoffs, proposing the next experiment. The iteration speed is the unlock, not the model itself.
The caveat everyone raised without disagreement: healthcare runs into HIPAA almost immediately. You can't pipe patient data to a cloud-based agent. Workarounds exist — anonymization, synthetic records, local inference — but they add friction that slows down precisely the iteration cycle the agents are supposed to accelerate. The vision of an AI research scientist working freely on clinical data remains constrained by a regulatory layer designed for a different era.
#2: "Clinical and Multi-Modal Biology Is Still Pre-ChatGPT"
Protein modeling has already had its pre-training moment. AlphaFold2 transformed structure prediction. ESM-2 showed that large-scale training on protein sequences produces representations rich enough to predict structure and function from sequence alone. For proteins, scale and self-supervised learning worked. That ground has largely been won.
Clinical and multi-modal biology are different.
Electronic health records, single-cell data, multi-omics, perturbation screens, imaging, and longitudinal patient trajectories do not come with one clean self-supervised objective. There is no obvious universal task that forces a model to learn biology across all these modalities the way next-token prediction forces a language model to learn language.
That is why the LLM analogy matters. Before ChatGPT, the dominant assumption was that labeled data was the bottleneck — label more, fine-tune more, align better. Then the paradigm flipped: pre-training at scale learned the latent structure, and labels became the steering layer. All the core knowledge came from pre-training. The labels just let you control it at inference.
Clinical biology may still be before that flip. Expert-annotated Q&A datasets, molecular RLHF pipelines, and carefully curated clinical benchmarks are being built now, and they have real value. But the concern is that they are being treated as the primary investment rather than the finishing layer on top of a foundation that doesn't yet exist. The self-supervised objective that captures biological meaning across sequence, structure, cellular state, and clinical outcome hasn't been established convincingly.
Until it is, labeled data can polish model behavior. It is unlikely to change what the model fundamentally knows.
In other words: we may be labeling the internet before pre-training on it.
The counterpoint deserves honest weight. The data being built now can be banked. When the right foundation model arrives, these annotations will matter enormously. The real disagreement isn't whether to build labeled datasets — it's about sequencing and sizing. Are these investments being scaled correctly relative to the pre-training problem that still needs to be solved? That question doesn't have a clean answer yet, and anyone claiming otherwise is ahead of the evidence.
The central bottleneck remains: clinical and multi-modal biology has not had its ChatGPT moment.
#3: "More DNA Than the Entire Internet — And Context Is the Bottleneck"
The data sparsity problem in biology is genuinely strange when examined carefully. By raw sequence volume, there is more DNA publicly available in a single database than all text on the internet. Biology is not data-scarce by volume. And yet.
The problem is context. A snippet of DNA from a specific patient, in a specific cell type, in a specific disease state, at a specific moment in that person's life — that is one data point. Generalize to a different cell type, disease, or individual and the rules change. The model has seen enormous amounts of data, but enormous amounts of non-overlapping, one-off data. Every sample is a special case. And special cases don't generalize.
The single-cell layer runs deeper still. Almost all virtual cell data today is purely transcriptomic — and single-cell transcriptomics is already riddled with embedded assumptions. Standard short-read probes detect families of isoforms, not specific ones — so you often don't know which isoform is actually present. Long-read single-cell methods are beginning to address this, but they remain less accessible at scale. The zeros that fill most single-cell matrices are also ambiguous: true absence, or a missed observation? Standard pipelines fill those zeros by inference. This is a well-known problem in the field — active research areas in the scRNA-seq community are working on it — but it hasn't been resolved. The risk: each downstream census model learns, in part, from the systematic biases of every prior experiment that fed into it, and we may be training models to replicate our instruments' limitations rather than to understand biology.
Text-to-image was invoked as a counterargument. The connection between a caption and the image it describes is also sparse — barely there. Somehow it worked. Maybe biology's sparsity problem dissolves the same way: not by solving it cleanly, but by assembling enough noisy, mismatched signal that structure emerges anyway. Nobody was confident this would happen. Nobody was confident it wouldn't.
#4: "Whole-Genome Generation Needed Two Breakthroughs: Architecture and Imagination"
The story of whole-genome DNA generation is really two stories that converged at the same moment.
The first was architectural.
Standard transformers — the backbone of modern language models — have a fundamental scaling problem: attention grows quadratically with sequence length. Double the sequence, and compute cost roughly quadruples. For natural language this is expensive but manageable. For DNA it becomes prohibitive. A single human chromosome spans hundreds of millions of base pairs. Applying standard attention across genomic-scale sequences wasn't merely inefficient — it was effectively impossible on any hardware that exists.
Architectures like Hyena and StripedHyena changed that. By replacing full attention with long-convolution mechanisms that scale far more efficiently with sequence length, they made genomic-scale modeling computationally tractable. This was not a minor optimization. It changed what could be modeled at all.
The second was conceptual.
Even after long-sequence modeling became possible, the field had to ask a more basic question: should we generate DNA in the first place? For decades, the dominant frame was that DNA was something to read, annotate, and interpret. You analyzed genomes. You did not design them from scratch. Treating the genome like language — learning its distribution and sampling from it to produce new biological sequences — required a shift in imagination as much as a shift in architecture.
The CRISPR-Cas systems that came out of this approach functioned. The proteins were novel. And the key insight, looking back, is not that generation requires a complete understanding of the rules. It requires sampling well from the space that follows them — the same way no one fully understands how a video model turns a text prompt into a coherent scene, but the scene is coherent. You don't need to understand the grammar to learn the distribution.
That may be the deeper lesson for biology broadly.
Some breakthroughs are blocked by hardware. Some by algorithms. But others are blocked by inherited assumptions about what is worth trying — and those are the hardest barriers to see. There are probably other areas of biology right now where the architecture exists but the conceptual frame is still stuck. Whoever identifies those gaps first has a significant head start.
#5: "Many Orders of Magnitude Is Not a Rounding Error"
The dream of a unified biological foundation model — one system that goes from base pairs to tissue behavior — got a systematic dissection.
The problem is time scale. The full range of protein dynamics spans from the picosecond timescale of atomic motion through to seconds for protein folding and conformational change. Cell signaling unfolds over minutes. Tissue remodeling takes weeks. Organ-level disease progression takes years to decades. The dynamic range from molecular to physiological is conservatively fifteen or more orders of magnitude. No existing paradigm spans convincingly from molecular to physiological timescales — though computational tools like molecular dynamics do cover significant ranges within those levels.
The physics analogy partially holds — and then breaks. In atmospheric science, you can build multi-scale models because you know, from first principles, what information to discard as you move up the scales. The Born-Oppenheimer approximation tells you: at the timescales relevant to nuclear motion, you don't need to track electrons. There is no equivalent principle in biology. Going from protein to cell to tissue, nobody knows which molecular details can be safely abstracted away and which ones quietly determine everything. We choose what to drop not because we know it's safe to drop, but because we have no choice.
The rough consensus: not one model, but a family of models in conversation. Excellent specialized models at each scale, trained on the data that matters for that scale, progressively passing learnings upward. Not intellectually satisfying. Probably correct.
The more uncomfortable implication: even well-designed models at every level may be encoding the same false hypothesis. The field's most trusted assumptions — the signaling pathways, the causal chains, the disease mechanisms that absorbed billions in failed drug trials — feed directly into training data. The hypotheses we believe most confidently are the ones most likely to corrupt the models we train on them.
#6: "The Best Captioned Dataset in Biology Is an X-Ray Report"
Radiology offered the clearest case study of the evening in what actually works — and why.
The field has something structurally unique: every image comes with a long, detailed natural-language description. The radiology report is not a label — it's a rich, expert-generated account of what the image contains, what it means clinically, and what should happen next. It is, structurally, the best naturally occurring paired dataset in all of medicine.
CNN-based classifiers, trained on large labeled image datasets, plateaued at generalization beyond narrow benchmarks — demonstrating superhuman accuracy on specific tasks like diabetic retinopathy detection or pneumonia classification, but struggling to generalize reliably to new populations, new scanners, or new presentation patterns. Then vision-language models arrived and changed the picture. The reason: everything the model needed to understand about the image wasn't in the image. It was in the text. The labels captured the classification. The reports captured the reasoning. Without the reports, the model learned to classify what was obviously visible. With them, it learned what the finding meant.
The larger point the table drew from this: wherever you have rich, naturally occurring annotation linking modality to meaning — not artificial labels, but the kind of paired observation that experts generate as part of their normal work — AI closes the gap quickly. The unlocks in biology may follow wherever those pairings can be assembled: drug mechanisms linked to experimental outcomes, clinical notes linked to specific biomarkers, cell morphology linked to molecular state.
Most of biology doesn't have this. Radiology arrived at it through its own disciplinary culture — radiologists have always been trained to explain what they see, not just classify it. The report is the professional artifact, not an annotation added later. The question is where else that annotation structure might exist, or could be engineered into existence.
#7: "The Chess Computer Has Already Become Incomprehensible"
The night's sharpest metaphor arrived late.
Chess had a recognizable arc: humans vs. humans for centuries; then computers improving, catching up, surpassing; a brief window where human-plus-computer was the strongest combination; and then computers leaving humans entirely behind. The plays top chess engines make today are, to grandmasters, genuinely unintelligible. Not variations on moves a human might consider — a different category of strategy altogether, moves that would be rejected before evaluation and turn out to be deeply correct.
Biology may follow the same arc. The AI systems that eventually make progress on large-scale biological problems may do so through reasoning that no human scientist can follow — not approximately, but completely. The way a cell integrates signals across many timescales simultaneously, translating cascades of molecular events into a developmental decision that plays out over days — that complexity may become legible to a computational system long before it becomes legible to us.
There is one important asymmetry with chess worth naming. In chess, correctness is verifiable immediately — you run the game forward and see if the move was good. In medicine, verification requires clinical trials that take years, sometimes decades. The trustworthiness problem in biology is structurally harder: the system may be right, but you cannot confirm it quickly enough to act with confidence. This is not a reason to stop building. It is a reason to be deliberate about what 'trustworthy' requires in each application context.
If the output is correct, the reasoning doesn't need to be inspectable. But it does need to be trustworthy. That distinction — between inspectable and trustworthy — is probably where the hardest work of the next decade happens.
The pragmatists leaned toward: does it matter? The question isn't whether AI understands biology the way a scientist does. The question is whether it can produce outputs that work — drugs that are safer, diagnostics that catch disease earlier, treatments calibrated to the actual molecular state of the actual patient. But interpretability isn't irrelevant — it's load-bearing in a different way. It determines how much we can trust the output before we act on it, how we catch failures before they reach patients, and how regulators decide what's submittable. The goal isn't interpretability for its own sake. The goal is having enough of it to be trustworthy in the contexts that require it.



