In today's data-driven world, the importance of effective data documentation, especially on your base data assets (i.e. tables, columns, dashboards, charts) cannot be overstated. Data documentation is important because it provides visibility into what data you have and how it’s defined, it supports knowledge management, curation, standardization and transfer, and drives organization alignment about the state of your data.
Data documentation also plays a vital role in enabling data discovery, allowing organizations to scale their data literacy with self-service capabilities and democratize data internally. When done well, it provides a foundation for understanding what data exists, the structure, meaning, and usage of data, facilitating data analysis and data-driven decision-making.
Yet, data documentation is often at best an afterthought, and let’s be honest - no one’s favorite task. Technical writing takes time because you need to get it right - semantically and technically. You need to re-explain the domain or business level process with technical details, there are edge cases and different conditions of the data, etc.
At Select Star, we believe having a good data documentation means the following:
- There is relevant documentation on data objects, and users can easily understand the data they are looking for, and will be more likely to come back the next time they’re trying to find something.
- Analysts make fewer mistakes when working with data, because they understand what it means.
- Any search across data objects is immediately enhanced as more items in your data catalog are now well documented.
As the quality and availability of artificial intelligence (AI) technologies and solutions continue to advance, there are exciting opportunities to enhance many aspects of data documentation that allow data consumers and data producers to work better together.
Select Star’s Approach to Documentation
We know documentation is important, and we know that documentation is hard to execute on. This is why it’s one of the top challenges that we decided to wrangle at Select Star.
The needs of documentation seem simple at a glance:
- For consumers - documentation needs to be available, accurate, useful, and in context
- For creators - documentation needs to be as easy to create as possible and auditable
Simple does not mean easy though. Who does the documentation need to be useful for? Which assets or types of documentation are the most important to tackle? How do you ensure that documentation stays up-to-date, and is consistent throughout all the applications that the data goes through?
At Select Star, we've chosen to focus on a few key principles:
Focus on what matters - Not all data assets are created equal, and chances are that 20% of your dashboards and tables get 80% of the consumption. Select Star quickly shows you which of your most popular dashboards and tables are missing documentation, so you can focus on the documentation that has the most impact.
Force multipliers - Because the needs of data documentation are so vast, focusing on solutions that make things a tiny bit easier won’t cut it. We look for solutions that can greatly improve efficiency and process. For example, Select Star’s documentation propagation ensures that you write documentation once and use it everywhere.
Select Star’s propagation job looks at upstream and downstream lineage fields that are using the same data without transformation, as well as similar looking tables and to show documentation where it makes sense.
In addition to documentation, it also propagates field-level tags the same way. If you tag a field or a column as sensitive or PII, Select Star will propagate that to upstream and downstream and save you the hassle of having to tag in other places manually. This ensures that your sensitive data is treated accurately, and you don't have to spend hours tagging it.
We’re also adding new AI-powered capabilities to do the heavy lifting of writing from zero documentation so our users don’t have to start from scratch.
Ownership and auditability - There is not good documentation that stands the tests of time without ownership and auditability. Select Star attaches owners to tables, authors to documentation, and has an activity log (and API) that tells you who is making what change. This level of transparency builds accountability and trust, which we plan to build upon.
AI Generated Documentation with Select Star
Our approaches to automated documentation have already helped numerous customers to achieve their documentation effort, with 2-5x more column description fill rates, and active contribution from their data team members. However, sometimes we come across customers whose data assets have no documentation at all - and it’s been empty for a while.
Someone may be able to dig through other related tables and SQLs to try to write up the documentation, but it will take a significant effort to start.
With the rapid development in LLM and ML models like OpenAI’s ChatGPT, we thought this is the next-level challenge that we can help our customers with. Select Star already has its own generated context about the data assets such as where the data is, where it came from, its related dashboards and reports, and the SQL queries that are utilizing the tables and columns – which are all great ingredients to pass to an AI model for summarization.
Today, I’m excited to announce that we’re releasing our AI Generated Documentation as a part of Automated Documentation of Select Star. Now you can easily generate descriptions for tables,columns, dashboards, and more – that will help the end-users and data consumers to be able to quickly understand your data. No one person - including AI - is “perfect” at documentation, but along with Select Star’s context, user input, and AI, this full collaboration can ensure that data documentation meets the desired quality bar.
Conclusion
Data documentation at scale is a hard but valuable problem to tackle. By focusing on what matters, force multipliers, and ownership and accountability, Select Star helps you focus, work smarter, and build trust in your documentation. The latest features continue to invest in these principles that are the foundation to successful data documentation in a world where we all have too many demands on our time. The future is just the beginning, and we look forward to continuing to invest in solutions that deliver substantial value to data teams.
Yes, this blog post was drafted with the assistance of chatGPT :)