Introduction
As someone who has spent 13 years in the weeds of data, I witnessed the rise of the “data-driven” trend first hand. Before starting and selling my first data startup, I spent time as a statistical analyst building sales forecasting models in R, a software engineer creating data transformation jobs, and a product manager running A/B tests and analyzing user behaviors. What all these roles had in common was that they gave me an understanding that the context of data — what it represents, how it was generated, when it was updated last, and the ways it could be joined with other datasets — is essential to maximizing the data’s potential and driving successful outcomes.
However, accessing and understanding the context of data is quite difficult. This is because the context of data is often tribal knowledge, meaning it lives only in the brains of the engineers or analysts who have worked with it recently. When other data consumers need to understand the context of data, the shortest path is to find someone who has used the data before and learn it from them.
That becomes a real problem as organizations scale. Finding the right person with the correct context takes time, and you might need to talk to multiple people in order to gather a full understanding of your data.
I first encountered a data discovery problem back in 2008 at Sun Microsystems Research Lab. I was responsible for incorporating different sources of data into our sales forecasting model using Bayesian networks. We already had 10 years worth of billing data that I needed to join with the CRM and sales data.
I thought this would be a straightforward task until I realized that the data was scattered across 100+ different fields in 15–20 different tables. If I had made a mistake, it could have resulted in miscalculated sales forecasts. I had to track down several people who had worked with the data in the past to tease out which were the right fields to use. If this context was easily accessible, the project would’ve taken me a week. Because it wasn’t, it took me a month!
This is not a new problem, but it is a growing one. Data is an increasingly central part of good decision making, and the amount of data that companies collect is increasing exponentially. On top of that, the teams working with data have also grown and become more distributed. Different teams have different ways of using the same data.
While there are lots of solutions for storing and querying data, it’s the sharing of context knowledge around data that remains a largely unsolved problem. Data catalogs — software that lets you search your metadata — don’t do enough to help solve the issue of data context.
In this blog post, I’ll explain how most data catalogs approach the data context problem, why their approach falls short, and a better path forward: the data discovery platform. I’ll also introduce Select Star, the data discovery platform my team and I have been working on to help companies better approach data discovery.
Data Catalogs vs. Data Discovery Platforms
Data catalogs have been around as long as databases. Most databases come with a repository of metadata, usually called INFORMATION_SCHEMA. It has all the table names, column names, and descriptions (called database comments).
The information schema can tell you how the data is structured (Which field belongs to which table?) and the latest operational information about the data (How many rows are there? When was this changed last?). Traditional data catalogs like IBM Infosphere or Informatica integrate across different databases to pull these metadata and make it searchable.
It sounds nice to have a centralized place for all the metadata. But even with a great data catalog, finding the right data is inefficient and oftentimes impossible. You search “revenue,” and you see hundreds of tables that include “revenue” — so how do you know which one is the right one to use? You have to ask someone.
This is why so many innovative tech companies ended up building custom internal tools to help address this problem. Airbnb, Facebook, LinkedIn, Lyft, Netflix, Spotify, and Uber have all written about this problem, and they’ve each launched a “data discovery platform”. These data discovery platforms all aim to be a centralized place for anyone in the company to find the data they’re looking for, see who else is using it, and where it’s being used, and write documentation about it.
The good news is that these new data discovery platforms can make a difference. At Spotify, 95% of data scientists are utilizing Lexicon, their data discovery platform. Facebook boasts that their data discovery platform has tens of thousands of internal users.
The bad news is that most companies don’t have the resources or expertise to build their own data discovery platform. So the decision usually boils down to one of three choices — buy an expensive enterprise data catalog like Alation or Collibra, try implementing an open source project like Amundsen or DataHub, or attempt to document everything manually in Google Docs or an internal wiki.
Each of these approaches is risky. Manual documentation is hard to keep it up to date, and data consumers won’t trust it once it goes out of date. Enterprise data catalogs cost hundreds of thousands of dollars at minimum and take several months to integrate. Open source projects require engineering time integrating, customizing, and maintaining infrastructure.
Also, after choosing to buy a proprietary data catalog or implement an open source one, many companies discover that they still haven’t fully solved their data discovery issues. This is usually because a centralized metadata repository — which is what most of these options amount to — alone is not enough to solve data discovery problems. The context of data, such as data popularity and lineage, are still left to the companies to implement on their own.
A true data discovery platform should provide the data context — who’s using the data, how is it calculated, what are the other related datasets — automatically, on top of pulling and displaying all the metadata. And by having the full context of data, data discovery platforms can allow any data consumer in your organization to easily answer questions like:
- Where is this data or metric? What is it called and who else is using it?
- Can I trust this data? When was this updated last?
- What’s the source of this data? How is this field calculated?
- Where is this data being used today? Is there a materialized view or dashboard that gets generated from this data?
- What are the different ways to use this data today? Any other similar or related datasets to this?
- Who are the top users of this data? How do they use the data today?
Announcing Select Star: the data discovery platform I always wished existed
Over the last year, my team and I have been building Select Star, an automated, easy-to-use, intelligent data discovery platform that “just works.” Our goal is to help companies solve the data discovery problems that I’ve experienced, and that data analysts, data scientists, and engineers are struggling with today.
We believe the ideal data discovery platform should:
1. Expose up-to-date operational metadata along with the documentation
What does this data represent? An important part of data discovery is having good documentation. But writing good documentation is time consuming and difficult, and most people don’t like doing it. Select Star will automatically surface all the metadata, and annotate them with insights gathered from SQL queries — recently added, popularity, top users, and downstream dashboards. This doesn’t mean you won’t need documentation. The domain-specific context of data is still important and should be documented by the domain experts. Making the documentation process to be easy, and once documented, ensuring it’s surfaced up everywhere that’s relevant is what Select Star takes care of.
2. Track the provenance of data back to its source
How was this metric calculated? Understanding where your data comes from and how it was generated is crucial when you’re analyzing it. Data lineage is a key feature to provide this insight. Showing both upstream sources and downstream dependencies, from raw data all the way to dashboards and metrics can provide the true understanding of how data flows throughout the organization and its potential impact of any changes.
3. Guide data usage
Which field is the table’s key? Who else is using this data and how do they use it? These are questions that every data analyst asks as their data warehouse gets bigger. Being able to find the answers without having to ask other people empowers data consumers to explore and fully utilize the data.
Data usage isn’t just about the tables and columns. Understanding data usage at a high level can also be very insightful. It can guide which new derived tables should be created next or which ETL job should be deprecated because there may be automated tables being generated that no one is using anymore.
By combining all three, we are building a true data discovery platform that embraces how data changes over time and provides its own analysis and recommendations for every data consumer in the organization.