Why is it so difficult to retrofit unstructured lab data?

Retrofitting unstructured lab data faces three core obstacles. First, irreversibility: data never structured at the point of capture is often ambiguous in ways that cannot be resolved after the fact — the experimental context and intent are gone. Second, scale: retroactive structuring of years of scientific data competes directly with active research and commonly stalls before completion. Third, and most important, the underlying workflow that produced unstructured data is still in place, so new unstructured data continues to arrive even as old data is being cleaned. The data foundation is not a one-time project; it is an architectural decision about how scientific work is captured in the first place.

Why do AI initiatives in life sciences fail?

Most AI initiatives in life sciences fail not because of model limitations, but because the underlying data is not ready. Scientific data is often stored in free-text notebook entries, ad hoc spreadsheets, and proprietary instrument exports with no consistent structure or relationships. When AI is pointed at this fragmented foundation, it can produce plausible-sounding outputs but cannot generate verifiable, traceable, or trustworthy answers. Governance reviews stall because outputs cannot be audited. Gartner projects that 80% of agentic AI initiatives in healthcare and life sciences will not progress beyond initial governance checkpoints in 2026 for exactly this reason.

What does an AI-ready scientific data environment look like?

An AI-ready scientific data environment has five key characteristics: data is structured at the point of capture so scientists do not need extra steps to make it AI-ready; relationships between entities (compound to assay, assay to result, result to decision) are stored explicitly rather than inferred; experimental context travels with the data rather than sitting in notebooks or conversations; AI outputs are verifiable and traceable back to source data; and the platform is open enough for external tools and models to connect to the structured foundation without rebuilding underlying data bridges.

Your AI Is Only as Good as the Data Underneath It

Q: What is structured scientific data?

Structured scientific data is data organized according to a defined schema, with consistent formats, labels, and relationships. In a lab context, this means a result has a defined type, a compound has a defined relationship to the assay it was tested in, and an experiment has a defined connection to the protocol it followed and the decision it was meant to inform. Data structured at the point of capture carries its context with it — the relationships that give a result meaning are recorded as a natural consequence of how the workflow is set up, not imposed afterward.

The Case for Structuring at the Point of Work

Most AI initiatives in life sciences follow a recognizable arc. There is a compelling demo. Leadership is excited. A budget is approved. A team is assembled. Six months later, the initiative is quietly deprioritized, not cancelled, but no longer the priority it once was.

When you ask what happened, the answers are rarely about the model. The model worked fine in the demo. The problem emerged when the AI was pointed at real data and asked a real question. The output was plausible-sounding, but no one could verify it. The scientists who were supposed to use it didn't trust it. The governance review stalled because the outputs weren't traceable. The initiative ran into the actual data environment, fragmented, inconsistent, never built with AI in mind, and it didn't survive the encounter.

This pattern is not an anomaly. It is the dominant experience of AI adoption in life sciences right now. And the reason it keeps repeating is not that the models are insufficient. It is that the structured scientific data underneath them is not ready, and most organizations are discovering this later than they should.

What a Fragile Foundation Actually Looks Like

The data problem in life sciences is not mysterious. Most researchers know it intimately. It looks like this:

Experimental records captured in free-text notebook entries, written differently by every scientist who has ever worked on a project. Results stored across PDFs, ad hoc spreadsheets, and proprietary instrument exports that were never designed to talk to each other. Context, the reason an experiment was run, the hypothesis it was meant to test, the decision it was supposed to inform, sitting in someone's memory, or a Slack thread, or a presentation that was never filed anywhere useful.

The data is technically accessible. It exists. In many cases, a great deal of effort has gone into collecting it. But it is semantically opaque to an AI. It is a collection of values without relationships, numbers without lineage, records without meaning.

An AI operating on this kind of foundation can produce outputs. It can summarize. It can surface patterns. It can answer questions. But it cannot produce answers that are verifiable, traceable, or trustworthy, because the data underneath them isn't any of those things either. When a scientist asks "am I moving in the right direction?", a question that is, in the end, the only question that matters, an AI drawing on free-text notebook entries can offer a confident-sounding guess. That is not the same as a useful answer.

The instinct, when this becomes apparent, is to reach for connectors. If we can just pull in more data sources, integrate more systems, build more pipes into and out of the platform, the thinking goes, the AI will have more to work with and the answers will improve. This instinct is understandable, and it is wrong. Connectors move data. They do not make data mean something. Adding more pipelines into a fragmented foundation does not change what the AI is operating on. It just moves the fragmentation around faster.

What Structured Data Actually Means

The phrase "structured data" gets used loosely enough that it has started to lose precision. In a lab context, it is worth being specific about what it actually means, and what it makes possible.

Structured data, at its most basic, is data that is organized according to a defined schema, with consistent formats, labels, and relationships. A result has a defined type. A compound has a defined relationship to the assay it was tested in. An experiment has a defined connection to the protocol it followed and the decision it was meant to inform. None of this requires exotic technology. It requires a commitment to capturing data correctly at the moment it is created, not cleaning it up afterward.

That "at the point of capture" distinction matters more than it might initially seem. Data that is structured after the fact is always a partial reconstruction. Context has been lost. Ambiguities have been resolved by whoever did the cleaning, not by the scientist who ran the experiment. The relationships are approximations. The AI operating on that data is operating on an approximation of the truth, and its answers will reflect that.

Data structured at the point of work is different. It carries its context with it. The scientist recording a result is also, as a natural consequence of how their workflow is set up, recording the relationships that give that result meaning. Nothing extra is required. The structure is a consequence of doing the work correctly, not an additional burden imposed on top of it.

The next level of this is ontology-backed data, and it is worth spending a moment on what that means in plain language, because it is where the real competitive difference in AI performance lives.

An ontology is a formal map of how concepts in a domain relate to each other. In life sciences, that means defining not just what a piece of data is, a compound, an assay result, a material state, a protocol step, but what it means in relation to everything else in the system. A compound is not just a row in a table. It has a structure. That structure was used in an assay. The assay produced a result. The result connected to a decision. The decision shaped what happened next. That semantic richness, the meaning behind the data, not just the data itself, travels with the information as it moves through the system.

This is what allows AI to answer questions about direction rather than just questions about facts. "What is the IC50 of compound X?" is a retrieval question. Almost any system can answer it. "Am I moving in the right direction with this compound class?" is a reasoning question, and answering it requires understanding not just what the data says, but what it means and how the pieces relate to each other. Ontology-backed data makes that kind of reasoning possible. Data that lacks it does not.

Why Retrofitting Is So Difficult

At this point, the obvious question is: why not just clean up the data you have? If the problem is that historical data was captured without sufficient structure, surely a data cleaning project can address it.

The honest answer is that it can, sometimes, partially. But the obstacles are significant enough that most organizations that attempt it underestimate them.

The first is irreversibility. Data that was never structured at the point of capture is often ambiguous in ways that cannot be resolved after the fact. The context is gone. The experimental intent was never recorded. The scientist who knows what that ambiguous entry actually meant left the organization two years ago. Cleaning can impose a structure on this data, but it cannot recover meaning that was never captured in the first place.
The second is scale. Retroactive structuring of years of scientific data is an enormous undertaking, and it almost always competes for resources with active research. These projects have a well-documented tendency to stall partway through, organized enough that the problem seems addressed, not complete enough to actually solve it.
The third, and most important, is that the underlying workflow that produced unstructured data is still in place. Even when a cleaning project succeeds, new data continues to arrive the same way it always did. The cleaned dataset becomes stale before the work is finished. The problem recreates itself.

The implication is uncomfortable but important: the data foundation is not a one-time project. It is an architectural decision about how scientific work is captured in the first place. Organizations that build AI on a strong foundation do not do it by cleaning up what they already have. They do it by changing how new data is created, by building structure into the moment of scientific work, so that the AI-ready data is a byproduct of normal operations rather than a separate effort.

What Good Looks Like

It is worth being concrete about what an AI-ready data environment actually looks like in practice, not as an abstract ideal, but as a set of observable characteristics.

The data is structured at the point of capture. Scientists do not need to do additional work to make their data AI-ready. The structure is a consequence of how their workflows are set up, not an overhead imposed on top of them.

Relationships are preserved. The connections between entities, compound to assay, assay to result, result to decision, decision to next experiment, are stored explicitly, not inferred after the fact. When an AI queries this data, it is not guessing at relationships. It is reading them directly.

Experimental context travels with the data. The intent behind an experiment, the lineage of a material, the rationale for a decision, these are stored alongside the results they informed, not separately in a notebook or a conversation. An AI asked to explain what an experiment was about can answer because that information is there, not because it is extrapolating from incomplete records.

Answers are verifiable. When an AI produces an output, the query that generated it can be re-run independently and inspected. The answer is not a black box. It is the result of a traceable process that a scientist, a reviewer, or a regulatory body can follow back to its source.

The platform is open enough for external tools to connect to the structured foundation. Data scientists who want to use their own models, their own pipelines, or their own analytical tools can access the same structured, harmonized scientific foundation without rebuilding the underlying data bridges. The value of the foundation extends beyond any single AI interface.

None of these characteristics require exotic solutions. They require a platform that was designed with structured scientific work at its center, and a recognition that the AI return on investment is not separate from the data investment. It is the same investment, looked at from a different angle.

The Governance Dimension

There is one more reason to take the data foundation seriously, and it is becoming increasingly hard to ignore.

In regulated environments, which describes most of life sciences, AI outputs need to be auditable. This is not a bureaucratic requirement imposed from outside. It is a practical one. An AI that produces a summary no one can verify will not survive a governance review, regardless of how impressive the underlying model is. The question a compliance team or a regulatory reviewer will ask is not "how good is the model?" It is "can you show me exactly how this answer was produced, and can you reproduce it?"

Gartner has flagged this as the primary reason agentic AI initiatives in healthcare and life sciences are expected to stall in 2026 not model capability, but the inability to demonstrate the kind of traceability and control that regulated environments require. The prediction is that 80% of agentic AI initiatives will not progress beyond initial governance checkpoints. The bottleneck is not ambition or investment. It is the data foundation those initiatives are built on.

The connection worth making is this: governance is not a separate problem from data quality. It is downstream of it. An AI operating on structured, ontology-backed data produces answers that are inherently more auditable because the query that generated the answer can be re-run, inspected, and verified. The audit trail is not a compliance feature bolted onto the system. It is a natural consequence of how the data was structured in the first place. Getting the data foundation right and getting the governance story right are not two different things. They are the same thing.

The Question Worth Asking

Before the next AI initiative kicks off before the vendor evaluation, before the model selection, before the architecture decisions there is one question that should come first.

What is the AI going to be operating on?

If the honest answer involves significant data cleanup, a connector strategy to compensate for fragmentation, or a hope that a sufficiently capable model will work around what the underlying data cannot provide the initiative is starting from the wrong place. The cleanup will take longer than planned. The connectors will move data without making it mean something. The model will produce outputs that nobody can verify, and the initiative will join the long list of life sciences AI projects that stalled somewhere between a compelling demo and actual use.

The organizations seeing real, durable returns from AI in life sciences are not always the ones with the most sophisticated models or the most aggressive AI roadmaps. They are the ones whose data was ready structured at the point of work, connected across the research cycle, semantically rich enough for AI to produce answers that are meaningful rather than merely fast.

The model is the last mile. The data is the road. Building the road correctly from the start is not a prerequisite for AI. It is the investment that makes AI worth making.

Learn More

Interested in how Luma approaches structured scientific data capture? See how the Luma platform is built for AI from the ground up.