Self-service analytics make data accessible to users throughout an organization. But without careful management, they also lead to issues like data sprawl, disorganized dashboards, and degraded data quality.
Data contracts create clear agreements between data producers and consumers, reducing the potential for the misunderstandings, inaccuracies, and unintended use of data that risk quality and reliability.
In our recent webinar, industry experts Chad Sanderson, CEO at Gable.ai, and Shinji Kim, founder and CEO at Select Star, dug into these challenges.
Let’s untangle the complexities of data contracts, understand the roles of data producers, and explore how column-level lineage can advance efficient data management.
What are Data Contracts, and Why Do They Matter?
At its core, a data contract is an agreement between data producers, who generate or transform data, and data consumers, who use it.
Contracts introduce structure to data processes, streamlining operations and aligning data production with organizational objectives.
With this structure, organizations can ensure that data is trustworthy, well-documented, and meets consumers’ expectations.
A well-designed contract establishes clear guidelines for data generation, fosters transparency, and ultimately contributes to the robustness of data governance.
What are data products?
A data product is the collective output of any data generation activity. They come in various forms and are accessible to a wide range of users within organizations. These products can be visualized in interactive dashboards in tools such as Tableau or Looker, a recommendation engine behind a retail website, or large language models that interact with users directly.
Drawing parallels from the domain of software engineering, data products are defined as collections of components that create business value in the data realm.
The challenge is the need to differentiate between data produced in an experimental environment and that which comes from a production environment. This is something software engineering does very well, Chad said, but the data industry struggles with.
“There’s some stuff in the data warehouse that is just constantly in QA mode,” he said. “Then there’s some stuff in there that is incredibly high quality and adds a ton of business value, but no one has gone through the process of productizing those things. So you have some really nasty data quality and data governance issues that apply equally to everything.”
A lack of standardized procedures for ensuring data quality and governance in various data warehouse environments contributes to pervasive issues in the field.
Who are data producers?
Data producers are anyone who generates data, regardless of their job title. This definition encompasses a diverse array of functions, spanning both technical and non-technical roles. It includes individuals ranging from engineers to analysts and even product managers.
The unique challenge in data governance is that unintentional producers may exist. These individuals create, own, and maintain a data asset that others leverage, irrespective of their engineering roles.
For example, a product manager may not even realize other people are using the dashboard they created. Without that knowledge, they have no idea of the downstream impacts they cause when they change the dashboard to suit their own needs.
It’s important to understand one’s role in the complexity of data dependencies, as a link in the data supply chain and the unique responsibility of data producers as arbiters of truth within the organization.
How does data sprawl happen and how do contracts combat it?
Data contracts distinguish an asset in an experimental sandbox environment from one in production, preventing premature dependencies on unfinished or experimental data.
Preventing the unintended use of incomplete or experimental data makes contracts a key weapon in combating the challenges of data sprawl.
“Data contracts are almost like a next step after you have your data mart with Medallia architecture,” Shinji explained. “You have these raw tables and available analytical tables anyone can use. I can publish these datasets I always use or this model I built so other people can use what I’ve done and don’t have to reinvent the wheel. But when I publish it, I will add documentation to it as part of the contract.”
Chad emphasized the importance of the contract being an agreement between two sides, with expectations on both producers and consumers. He outlined two implementation approaches:
- the standard software engineering approach where a consumer makes their service accessible to the organization;
- the inverse approach, unique to the data world, where a data scientist who has found a meaningful use case approaches producers for a contract to ensure quality and SLA expectations are met, thus aligning with the motivating factor of the use case.
How Column-Level Lineage Makes Data Contracts More Effective
Automated column-level lineage is crucial to discovering necessary data contracts. Lineage delineates data sources and transformations, making it clear whether a data asset is in the experimental or product stage and revealing how it is used throughout the organization.
While contracts can exist without lineage, they require prior knowledge and understanding of the datasets – which is not the case for many data consumers.
Column-level data lineage bridges the gap between consumers and users, providing context to the data and enhancing the effectiveness of data contracts. In a democratized data environment, lineage guards against unsophisticated users making unauthorized or unexpected changes to the data.
One Select Star customer is saving millions simply by integrating the column-level lineage API into their continuous integration model, Shinji said. Continuous integration catches any changes that will have a downstream impact and flags them. The person making the change has to acknowledge the impact before pushing it through.
“Sometimes you have to make changes, but you need to make sure all the consumers are aware of the impact, too,” she said.
Column-level lineage also streamlines the data warehouse by making it possible to deprecate data products that aren’t being used.
“In the beginning, when there are a lot of disorganized dashboards and tables, we recommend customers deprecate them,” she explained. “The reason many of them can’t deprecate is because they don’t know who’s using it or what impact it may have downstream.”
Unlock the Potential of Column-Level Lineage
As your organization navigates the complex landscape of data contracts, data products, and the challenges of data sprawl, the integration of column-level lineage and data contracts emerges as a powerful solution.
Data contracts maintain order and quality by separating data being developed in a sandbox environment from datasets ready to use. Column-level lineage strengthens contracts by clarifying the origin and flow of data.
Discover how you can approach the C-suite and advocate for an automated data lineage tool in your organization. And when you’re ready, schedule a demo.