Data discovery arises as a major pain point in most data mesh discussions. Discovery is one of the hardest data management problems to solve, and many data platform managers simply don’t know where to start.
Select Star CEO and founder Shinji Kim recently participated in a panel discussion to help explain why data discovery is so critical in data mesh and share key points to consider, including automation, governance, and integrations. Hosted by Paco Nathan of Data Mesh Learning and Open Source Data Podcast, the panel included fellow data experts Sophie Watson, Mark Grover, and Shirshanka Das.
Here are some highlights from the discussion. To take a deeper dive, watch the full replay.
Why is data discovery important?
Before data can be queried and put into action, people need access to and an understanding of the data available to them. The ultimate goal of data discovery is to make it easier and faster for users to locate their data, know what it means, and then put the new information into practice.
“Being able to find and understand the data is in the critical path of any data analysis, modeling, or manipulation of data,” Shinji explained. “Companies are increasingly looking at the area of data discovery because it’s no longer just the centralized data team that’s using the data. Product managers, analysts, sales ops, and engineers are continuously looking at and finding different parts of company-owned data.”
“Utilizing and analyzing data is a pattern, not just a catalogue schema you can set and forget,” Paco added. “It’s changing all the time, and data scientists need to be able to get into all of the data and touch it.”
Unfortunately, while there are plenty of tools available for data analysis, most modern data is being held in a decentralized cloud system, which means there’s a marked issue with knowing what exists and where to find it.
Why data discovery solutions are important now
There has been an explosion of data in the last 5–10 years — and all the while, our technology and our understanding of data has morphed and changed, making it increasingly difficult to find and use. Companies have made huge investments to centralize and process data in the cloud, but these challenges still persist.
Data used to just mean data tables in the data warehouse — now it’s moving into the cloud, creating an even larger body of data to contend with. The very definition of data has changed as enterprises place increasing emphasis on adoption of machine learning, extending business intelligence capabilities, and mapping operations to key performance indicators (KPIs) that conspire to make end-to-end data operationalization difficult.
As in many digital realms, the availability of tools for data is both a blessing and a curse. “Not only is there too much data in too many places, hyperspecialization has made it so that people don’t know where data is or what is even available to them,” Shirshanka noted.
More than ever, organizations are dealing with more teams needing access to data and making data-driven decisions, though oftentimes they lack the understanding or background necessary to do so. Sophie commented, “Making data-driven decisions is fantastic, but many people don’t have the background to know how to do it in a sensible and ethical way. So anything that can make that data easier to find, understand, and use responsibly is a winner.”
Organizations also face challenges while moving to an extract, load, transform (ELT) transmission format from the more traditional extract, transform, load (ETL). “This provides less risk of data or corruption and loss in transit, however it can leave people not knowing how to access, discover, and utilize data they didn’t create if it is in a format or location they are unfamiliar with,” Shinji explained. This is an escalating and frustrating issue for organizations across the board.
The role of data discovery in data mesh
Data mesh is about splitting up your centralized data into data domains. How we model, produce, and consume data is decoupled. Data discovery platforms are necessary to discover and understand these components, and then teams can go in and apply tags and ownership to that data. Data mesh allows these teams to invite other users through democratized access while still maintaining full governance and control, giving everyone visibility into what other teams are doing.
It’s important to make the distinction between discovery for code versus for users. When teams start implementing compliance, controls need to be in place so people aren’t scrubbing all the data looking for what they need. Discovery has to be consistent and needs programmatic abstractions for both humans and code.
Key points to consider when deciding on data discovery in data mesh
Oftentimes, data teams have to spend a lot of time finding the correct datasets that they can trust. Data discovery in data mesh makes it possible for teams to be able to easily locate the data they need.
The panelists discussed key points that data platform managers should consider, including:
1. Automation and User Experience (UX)
Having good annotation and documentation is very important, but data discovery should also include an automated way to display the operational metadata so users can refer to the latest status of the dataset. Looking at the usage pattern of the dataset helps users understand how data is already being used today. This also helps recognize the right data sets to use.
Sophie emphasized that things must be searchable by everybody who is touching data and everyone must have confidence in the tags and be able to contribute to them. If data is easier to define and discover, it can be used in new and interesting ways.
2. Data Ownership and Governance
Sophie mentioned that because of democratized access and tagging, there is a lot of opportunity for ownership to fall less on a single person within an organization and more on the collective, helping to prevent burnout and resentment. Different people can contribute and will put time and effort into tagging data because they know it will eventually be easier to locate and utilize in the future.
This promotes an Open Source culture without requiring everyone working on the project to have deep technical knowledge of the data. Instead of making the same mistakes over and over again, they can address problems immediately and save others from going through the same struggle of finding and organizing the data.
3. Metadata as Code
Shirshanka again brought up the issue of treating data and metadata as code. In the data mesh community, it is important to have the ability, when creating a data product, to inquire as to the rules for what makes it valid. Compliance and built-in checks need to be integrated into tools, and they need to be able to handle many versions of metadata combined with a lot of code. In many cases, domain-specific teams create their own APIs and pipelines for their respective data consumers. This helps govern and control access rights, which considerably reduces the likelihood of producing bad data while also enhancing the security of it.
Pressing questions on data discovery and real-world applications
The discussion wrapped up with Paco mentioning the possibility of AI and machine-learning within data discovery, as well as how testing tags for uncertainty can help improve the likelihood that data is good and relevant to the user. This was followed by Q&A where attendees asked about handling PII securely, the best skills to have in order to get a job in the data sphere, the effect of company size on data discovery, and more. Some of the panel questions and answers are posted here.
Still have questions about how to apply data discovery in data mesh? Talk to us by requesting a demo at selectstar.com