Rancho BioSciences started life as a data curation company. Thirteen years on, we've grown into data science, engineering, and consulting, but we still run large, complex data harmonization projects for pharma. We often get asked: what does that actually look like?
The answer is: it looks a lot like people doing complex work involving other people that AI can't yet do for them. And using AI for the parts where they can.
Here's a rough breakdown of a typical project:
Generally we're working inside a secure customer environment so there's mandatory security training to be done. As external contractors we do a lot of this. Some of it is relevant, some of it less so. It just needs to be done. Security is important.
You might assume this was all wrapped up before contracts were signed. You would be wrong. Sure, words were written down. But priorities shift, stakeholders disagree on what they actually wanted, and someone has reorganised since then. We spend time figuring out what matters, what's realistic, and what needs to happen first. This is also where a lot of implicit organisational knowledge starts surfacing. What does this team actually need the data to do? What does "harmonized" mean to this stakeholder versus that one? Getting those answers out of people's heads and into a shared brief is itself valuable work.
Which datasets need to be harmonized, now and in the future? Which public ontologies apply? Who governs the model once it exists? Our knowledge engineers use AI to explore design choices and iterate quickly, but if the model is going to be source of truth, it needs to be correct. That needs humans making the calls. The model is also where a lot of captured knowledge lives permanently: agreed definitions, relationships between concepts, constraints on what can be entered. It's not just a technical artefact, it's a record of what the organisation knows.
This is the part people usually think of as "manual curation," and it's where the knowledge excavation really happens. Looking at fields, working out mappings, tracking down whoever actually knows what that undocumented database column means (it's always a Dave). The goal is to take knowledge that exists only in people's heads, in old spreadsheets, in database schemas with no descriptions, and make it explicit, documented, and durable. Once it's in the model as synonyms, definitions, and annotations, future users understand it. More importantly, AI agents can use it. This is the core of what Rancho's curation team does, and experience matters enormously here. AI can assist with field mapping but it struggles with partial, inconsistent information, which is most of what you're working with.
Once mappings are defined, we build reusable ETL pipelines so future data in the same format flows through automatically. This is where AI and traditional automation genuinely earn their keep, and it's possible because the hard thinking has already been done upstream.
There's sometimes a perception from leadership that this whole harmonization problem can be packaged up and handed to AI. Understandable given the hype, but I'd push back: if it were that straightforward, it would already be done. The challenge isn't execution, it's getting to the point where instructions are clear enough to execute. Turning tacit knowledge into explicit knowledge is the hard part, and it requires people with the right mix of scientific understanding, data modelling experience, and the interpersonal skills to go find Dave.
The reason this work matters is that it builds the foundation everything else depends on: model training, multi-omics analysis, agentic workflows. At Rancho we've spent a more than a decade building the team and the processes to do this well. The tools may have changed but the need for experts who really understand the data hasn't.