Back
Blog Post

How to survive the data explosion: Learnings from J.P. Morgan and Fivetran

Veronica Zhai
February 7, 2023

Sifting through the data from multiple data platforms can be challenging. Here is how we solved this problem with the help of a data discovery platform.

I run analytics products and operations at Fivetran, a $5.6 billion “internet plumber” and market leader in data integration. Prior to Fivetran, I built the first enterprise data stack at J.P. Morgan.

At Fivetran, we have worked hard to build a strong data culture. Over 90% of our employees use data tools regularly. Behind such a powerful data culture, there is a problem that is ubiquitous to companies of all sizes that are undergoing a digital transformation — the data explosion, in which the volume and variety of data become overwhelming. Today, I want to share tools and best practices that will address this problem.

The Data Explosion Problem

Adopting a modern data stack leads to an explosion of data sources, user personas and data assets. This explosion results from the following causes:

  • A proliferation of SaaS tools, event streams, and other systems that produce data. In 2021, the average enterprise deployed 187 SaaS apps. Organizations of all sizes strive to centralize data from hundreds of sources into one database for analytics.
  • Modern data pipelines make it extremely easy to put huge and varied volumes of data into a centralized place.
  • Many different people depend on reporting for day-to-day operations. These people span across many teams and all levels of seniority from executives to individual contributors.
  • The proliferation of business intelligence platforms and dashboards can make it hard to find the right data: difficulty pinpointing which dashboard to reference; confusion when navigating similar assets and many hard-to-understand, poorly-defined, or ambiguous fields, values, etc. in data models.

One of the biggest pains that many data teams experience is the constant flurry of requests to build more — more data sources, more dashboards, and more metrics. Due to the sheer volume and variety of data, they often have trouble finding the right data to use.

These are the inevitable, unintended consequences of crossing the digital transformation chasm. For data leaders, this can make the enterprise data overwhelming to manage. For business users, even though they use a set of operational dashboards regularly, they often feel intimidated to explore data beyond their initial scope.

The solution: one map that just works

To contain the data explosion, we need one platform that helps us piece everything back together and make sense of it.

This requires:

  • A complete inventory of all data assets, including tables, dashboards and metrics.
  • Relevant context for each data asset. This includes who owns them, the popularity of these assets, who uses them the most, and most importantly, column-level lineage (how each field relates to both upstream sources and to downstream assets).
  • Ease of use. Managing enterprise data is already complex; there is no reason to add more overhead to the data management process. The ideal platform is intuitive to navigate and just works.

In a future state, a data consumer should be able to easily learn which data asset is best to use for both their day-to-day work as well as ad-hoc data exploration. A data team can easily identify and deprecate unused and duplicative data assets, organize and standardize data assets, and create easy-to-maintain data inventories.

In short, to address the explosion of data assets, we need a data discovery platform that consolidates all data assets in one place, provides useful data context and makes data assets easily searchable. In the past, people called this a data catalog tool.

Evaluation criteria and decision process

We’ve examined every major data discovery tool in the market. There are a few key criteria that we consider for tooling selection:

Criteria 1 compatibility with the modern data stack: We need a tool that is compatible with cloud data warehouses including BigQuery, Snowflake and Redshift, business intelligence tools such as Looker and Tableau, and pipeline tools such as Fivetran.

Criteria 2pricing. We have a budget, set at less than 10% of our total budget for business intelligence tools (this excludes cloud data warehouse cost).

Criteria 3 — ease of use. This covers 3 major aspects:

  • Easy to set up — only lightweight configuration is required.
  • Intuitive to use — analysts and business users can find relevant assets they need easily.
  • Minimal ongoing maintenance overhead — we do not want to be “data librarians”!

While most people argue that data discovery is a highly crowded space, with more than 10 tools in our initial pool of candidates, the decision process was actually very simple for us. This may sound hyperbolic, but let me explain:

  • The existing players including Collibra, Informatica, Alation and Data.World are not the best fit with our data stack. They are very costly for our budget. Their core users are IT personas, instead of data analysts and business users.
  • We also did not consider open-source tools such as Amundsen, DataHub, because we wanted an easy to use, out-of-the-box solution. This way, our team wouldn’t have to spend time and resources configuring a new tool. Because we wanted to save time, it was also important to have customer support at our fingertips.
  • We did not consider tools with vertical integration like Sled, because we don’t use Snowflake internally. Sled is a modern data catalog tool that integrates with the Snowflake ecosystem. Their vertical approach in solving data discovery, going deep not wide, is different and interesting: Sled covers capabilities from automated data quality check to data discovery and lineage, and from documentation to the metric layer.
  • We narrowed down our options between Select Star and Atlan. We ran a month-long trial, and we chose Select Star. It was the best fit for our use case because it provides the core experience of data discovery better than any of the other tools in the market.

Specifically, Select Star is the best fit for our need for the following reasons

  • They are reasonably priced.
  • We can easily set up integrations with BigQuery, Looker and Sigma. With Select Star, we were able to spin up an instance and configure our data sources in a day.
  • It just works. We are very emphatic that we don’t want analysts to become “data ops” or “data librarians.” While many tools expect the data team to perform heavy configuration to make them useful and maintainable, users can easily find the assets they need and their lineage.

Leap over the exploration chasm: stay ahead in enterprise data management

We have selected a tool that helps facilitate data discovery, but now what? Whether you are a data executive in a powerful enterprise like J.P. Morgan or a fast-growing enterprise like Fivetran, there are five core pillars of data governance that you need to consider in order to manage a successful data organization.

Quality — ensure data completeness, accuracy, and low latency from source to destination.

Integrity — protect and maintain the validity of centralized metrics.

Taxonomy — define and maintain data taxonomy of key business metrics.

Standardization — define and maintain consistent logical data models and metrics to support cross-functional use cases.

Access:

  • Entitlement — ensure that individual access to data is controlled and consistent
  • Discovery — ensure that data assets are easily discoverable

While enterprise data governance can feel overwhelming, I will share some simple action points to help you get started:

  • Invest in clean, dependable data in the source system — the majority of data quality issues come from bad data from the source system. You want to invest in capability to ensure completeness and accuracy of source data.
  • Invest in a robust data pipeline tool — the process of building a customized data pipeline is time-consuming, costly, extremely brittle, and requires constant maintenance. Your data pipeline tool must be reliable and easy to use. You don’t want your data team to become data plumbers. The tool should not require heavy configuration and maintenance to ensure data quality.
  • Leverage a data discovery platform or a data usage tool to organize assets — you don’t want your data team to become data ops or data janitors, but you need to design a process to trim your data garden. Invest in a data discovery tool that integrates seamlessly with your core data platforms or build a data usage tool internally to easily deprecate unused, outdated and duplicative assets. Most business intelligence tools have usage information.
  • Set up a steering committee for approving new assets — every organization has public folders to store assets in their business intelligence tool. Among the public folders, there are also gold assets that consist of highest priority dashboards and north star metrics. New assets in public folders and changes to the gold assets should be reviewed by a steering committee. Without this, your data garden will grow wild.

I hope this article can help you better empower your organization to stay ahead in large-scale digital transformation. While there have been significant technological advances in data management so far, I believe this is only the beginning. Read the full blog on Towards Data Science.

Related Posts

Snowflake Cost Management Best Practices with Ian Whitestone
Learn More
A Guide to Building Data as a Product
Learn More
How Fivetran Streamlines Data Analytics with Select Star
Learn More
Data Lineage
Data Lineage
Data Quality
Data Quality
Data Documentation
Data Documentation
Data Engineering
Data Engineering
Data Catalog
Data Catalog
Data Science
Data Science
Data Analytics
Data Analytics
Data Mesh
Data Mesh
Company News
Company News
Case Study
Case Study
Technology Architecture
Technology Architecture
Data Governance
Data Governance
Data Discovery
Data Discovery
Business
Business
Data Lineage
Data Lineage
Data Quality
Data Quality
Data Documentation
Data Documentation
Data Engineering
Data Engineering
Data Catalog
Data Catalog
Data Science
Data Science
Data Analytics
Data Analytics
Data Mesh
Data Mesh
Company News
Company News
Case Study
Case Study
Technology Architecture
Technology Architecture
Data Governance
Data Governance
Data Discovery
Data Discovery
Business
Business
Turn your metadata into real insights