Most businesses today understand the huge opportunityData ageoffers them, and an ecosystem of modern technologies has emerged to help them. But for businesses, building a modern and comprehensive data ecosystem to deliver data value from available offerings is very confusing and challenging. Ironically, some of the technologies that have made certain segments easier and faster have made governance and data protection more difficult.
Big data and the multiverse of madness
“Using data to make decisions? That’s crazy talk!” was a common idea in computer science in the 2000s. Information technology groups didn’t really understand the value of data – they treated it like money in the bank. They thought that if they stored it in a database and kept it perfect, it would gain value – so they resisted letting people use it (especially in its more granular form). compound interest on locked data. A better analogy would be the food in your freezer. You have to go through it. You have to take things out and eat them (eh, I mean use them) or they go bad. The data is the same – they must be used, updated and refreshed, otherwise they lose their value.
Over the past few years, we have developed a better understanding of how data should be used and managed to maximize value. With this new understanding came disruptive technologies to help enable and speed up the process, simplifying difficult tasks and minimizing the cost and complexity required to accomplish a task with data.
But when you look at the whole ecosystem, it’s hard to make sense of it all. If you try to arrange companies in a tech stack, it’s more like a “52 card scoop” – two cards won’t land exactly on top of each other because very few companies present the exact same offer and very few cards line up. side by side to offer perfectly complementary technologies. This is one of the challenges of trying to fit in the best of breed offerings. The integration is hard and the interstitial points are difficult to deal with.
We can look at Matt Turck’s data ecosystem diagrams from 2012 for 2020 and see a clear trend of increasing complexity – both in the number of companies and in the categorization. It’s extremely confusing, even for those of us in the industry, and while he’s done a good job of organizing it, I’d argue that pursuing a taxonomy of the analytics industry isn’t not productive. Some technologies are miscategorized or misrepresented, and some companies should be listed in 2 or more places. It’s no surprise that companies trying to build their own modern stack are at a loss. No one really knows or understands the whole ecosystem, because it’s just too massive. Diagrams like these have value as a loosely organized catalog, but should be taken with a grain of salt.
A healthier approach, but still inherited
Andreessen Horowitz (a16z) offers a different way of looking at the ecosystem, based more on the data lifecycle, which they call a “unified data infrastructure architecture.” It starts with data sources on the left, ingestion/transformation, storage, historical processing, predictive processing, and finally output on the right. At the bottom are the data quality, performance, and governance functions that are ubiquitous in the stack. This pattern should look familiar to you because it is very similar to the linear pipeline architectures of legacy systems.
Much like the previous model, many modern data businesses today don’t fit neatly into a single section. Most businesses can span two adjacent spaces, others will surround “storage”, for example with ETL and visualization capabilities, to give a discontinuous apparent value proposition.
Starting from the left, the sources are obvious but worth detailing. These are the transactional databases, applications, application data, and other data sources that have been covered in big data infographics and presentations over the past decade. The key to remember is the three Vs of Big Data: volume, velocity and variety. These Big Data factors have had a significant impact on the modern data ecosystem simply because traditional platforms could not handle at least one of the V’s. Within any given enterprise, data sources are constantly changing.
Ingestion and processing
The next section is more complicated – ingestion and transformation. You can divide this into more traditional ETL or newer ELT platforms, programming languages for the promise of ultimate flexibility, and finally event data streaming and real-time or semi-real time. The ETL/ELT field has seen innovations driven by the need to handle semi-structured and JSON data without losing transformations. Obviously the reason there are so many solutions in this space today is not just the variety of data, but also the variety of use cases. The solutions maximize ease of use, efficiency, or flexibility where I would say you can’t get all three in one tool. If not apparent, since data sources are dynamic, ingestion and transformation strategies and technologies must follow.
Recently, storage has also been a center of innovation in the modern data ecosystem due to the need to meet capacity requirements. Traditionally, databases were designed with compute and storage tightly coupled. Any upgrade required a complete system shutdown, and it was difficult and expensive to manage capacity. Today, innovations are coming fast from new cloud-based data warehouses like Snowflake, which has separated compute from storage to enable elastic scaling of capacity. Snowflake is an interesting and difficult case to categorize. It is a data warehouse, but through its Data Marketplace, it can also be a data source. Additionally, as the ELT gains momentum and Snowpark gains capabilities, Snowflake becomes an engine of transformation. Although there are many solutions in the industry EDW, data lake, data lakehouse, etc., the main disruptions we know are cheap infinite storage and elastic and flexible compute capabilities.
BI and data science
The a16z model is broken down into Historical, Predictive, and Output categories. In my opinion, most software vendors in this space occupy more than one category, or even all three, which makes these groupings purely academic. Challenged to find a better way to make sense of an incredibly dynamic industry, I gave up and oversimplified. I’ve narrowed this down to database clients, focusing on just two types: BI and data science. You can think of BI as the historical category, Data Science as the predictive category, and claim that each has integrated the “output”. Both have created challenges for the governance space with their ease of use and ubiquity.
Business Intelligence has also come a long way in the past 15 years. Legacy BI platforms required extensive data modeling and semantic layers to harmonize the way data was visualized and to overcome the performance issues of slower OLAP databases. Since these old platforms were centrally managed by a few people, the data was easier to control. Users only had access to aggregate data that was rarely updated. The analyzes provided at the time were much less sensitive than today. BI in the modern data ecosystem has brought a sea of change: the average office worker can create their own analytics and reports, data is more granular (when was the last time you touched an OLAP cube? ) and the data approaches real time. It is now common for a data savvy company to get updated reports every 15 minutes. Today, teams across the enterprise can see their performance metrics against current data and enable rapid changes in behavior and efficiency.
While data science has been around as a technology for a long time, the idea of democratization has started to gain traction in the last few years. I use the term DS in a very general sense of statistical and mathematical methods that focus on complex prediction and classification that go beyond basic rule-based calculations. These new platforms have increased the accessibility of analyzing data in more sophisticated ways without worrying about setting up the computational infrastructure or the complexity of coding. “Citizen data scientists” (this term is also used in the most general terms possible) are people who know their field, have a basic understanding of what data science algorithms can do, but do not have the time, skills or inclination to manage the coding and infrastructure. Unfortunately, this move has also increased the risk of sensitive data being exposed. PII analysis may be necessary to predict consumer churn, lifetime value, or detailed customer segmentation, but I argue that it is not necessary to analyze it in its raw form or in text raw.
Data tokenization, which allows modeling while securing the data, can reduce this risk. Users don’t need to know who the people are, just how to group them so they can run cluster analysis without needing to be exposed to sensitive data at the granular level. Additionally, using deterministic tokenization technology, the tokens are predictable, yet indecipherable, to enable database joins if sensitive fields are used as keys.
Call it digital transformation, data democratization, or self-service analytics, the aesthetics of history, forecasting, output—or BI and data science—make the data ecosystem modern more accessible to domain experts in the company. It also greatly reduces reliance on computing outside of the storage tier. The dynamics of what data can do forces users to iterate, and iteration is painful when multiple teams, processes, and technologies get in the way.