What is Correct?
As data engineers, we often receive implicit (or explicit) requirements for our data to be correct. Naturally, the people writing those requirements are usually unable to articulate what correct actually means in a given context—and they are often reluctant to commit to any one definition.
So, let’s explore different interpretations of correctness to help us have better conversations the next time someone insists that "the data has to be correct."
Correct to the Source
Sometimes, data needs to be correct to the source. In essence, we’re aiming to ensure that through all the transformations and communication protocols we use, the data remains representative of what was measured at the origin.
Is the data at the end of our process an accurate reflection of the data at the beginning? Are we handling encoding specifics properly? Can our final representation accommodate all possible inputs?
This type of correctness offers a clear definition. Testing is straightforward: given a set of input messages, we can write clean end-to-end tests to compare them with the corresponding output messages.
However, we must accept that any flaws present in the source will propagate through the system. After all, our goal here is correctness to the source.
Correct to the Model
There is no such thing as "correct to reality." Precision is limited. Time delays are inevitable. The only thing exactly correct to reality is reality itself.
So when someone tells you to “just do it how it is,” ask them: What model are we using? What assumptions are we making? What are the acceptable ranges of values? How are we binning our data? What is our tolerance for imprecision?
To achieve this type of correctness, we need clearly specified requirements and constraints. Trade-offs will have to be made—applying algorithms to average noisy inputs, filtering out out-of-bounds data, setting thresholds to exclude errors, and implementing merge strategies to harmonise multiple sources or dimensions.
This is often what people mean when they say, “make the data correct,” yet the assumptions underpinning the model are rarely spelled out. This ambiguity creates a constant tug-of-war between those setting demands and those trying to meet an ever-shifting standard of "correct."
Correct to the Representation
Sometimes what is correct has less to do with the actual data and more to do with who can see it and how. To present data to humans we often do aggregations, colour coding, charting, filtering based on some permission system and so on. In those cases what can be seen is the only metric that counts. Is what the output of the system looks like correct, based on all of the inputs?
Here too, we have to have a clear definition of what our representation assumptions are. Correctness to the representation is in its essence just another, specialised form of correctness to the model. We need a clear definition of what our requirements are or we are bound to fail.
Pipelines
In many modern data-driven organisations, we need all three types of correctness as data flows from raw, unreliable inputs to the polished presentation layer we show customers and stakeholders.
Understanding these different types of correctness allows us to build cleaner, more reliable pipelines.
A well-structured, well defined pipeline will always have all three stages:
- Correct to the Source: Start with clean, standardised data extracted from source systems. Ensure it’s transformed into a format suitable for ingestion.
- Correct to the Model: Use the source-aligned data to build models that reflect domain assumptions. Apply validation rules, filters, and transformations as needed.
- Correct to the Representation: Construct views and dashboards that present data in an understandable, permission-appropriate format.
This three-tiered architecture has several advantages:
- Linear Flow: Data moves from less refined to more refined states in a straight path.
- Clear Assumptions: By explicitly defining correctness at each stage, we avoid redundant validation and reduce complexity.
- Transparent Lineage: Instead of convoluted feedback loops and untraceable transformations, we always know what kind of dataset we’re working with and what we’re producing.
- Easier Debugging: When errors arise, we can quickly isolate which stage violated its correctness contract. What is wrong directly translates to where the error is.
- Effective Monitoring: Each stage’s assumptions can be monitored. For example:
- The source stage can alert on outages or malformed inputs.
- The model stage can flag violations of expected patterns.
- The presentation stage can notify us when aggregates or metrics diverge from norms.
Conclusion
Thinking critically about what “correct” means helps us build simpler and more robust data architectures. When correctness is well-defined, everyone is happier with the outcome—and trade-offs made along the way become more transparent and justifiable.