Is Big Data dead?

Parviz Deyhim
12 min readFeb 10, 2023

I’ve recently had the chance to read Jordan Tigani’s blog post, “big data is dead”. I have mad respect for Jordan since the days of Google and BigQuery. Jordan’s post sparked a number of questions for me. So I decided to analyze each section of his post to gain more clarity. Eventually, after arguing some of the points, I came to the conclusion that saying big data is dead is not completely accurate. There is a fundamental question about what “big data” is in the first place. It can have several meanings, depending on how you look at it. However, I think what should be said instead is that we are starting to be able to reduce the complexity of dealing with large amounts of data in a way that the majority of workloads can be processed with less sophisticated systems. What does that imply, and what are the impacts? Let’s start by going through Jordan’s post first.

The post starts off by talking about the “Obligatory Intro slide” where every data pitch deck has a graph that shows how much data we will create and consume over the next several years. I personally never liked that slide because it kept repeating itself in every conversation and every data product pitch. Nowadays, if any slide starts by telling me how much data will be produced in the coming years, I stop listening to the pitch.

data generated over time increasing

For the last 10 years, every pitch deck for every big data product starts with a slide that looks something like this:

Next there’s a comparison between MySQL and MongoDB. From the post:

DB Engines scores over time MongoDB versus MySQL

MongoDB is the highest ranked NoSQL or otherwise scale-out database, and while it had a nice run-up over the years, it has been declining slightly recently, and hasn’t really made much headway against MySQL or Postgres, two resolutely monolithic databases. If Big Data were really taking over, you’d expect to see something different after all these years.

Frankly, I did not understand that argument for several reasons:

Both of those are distinct systems with various use-cases. Given its distributed nature, I agree that MongoDB was designed to scale with data. But at the same time, the use-cases for MySQL and MongoDB are different. Putting the popularity chart (it is also not clear how popularity was measured, but let’s not get into that) of two systems that are meant to satisfy different workloads together is not really an apples-to-apples comparison.

And let us assume for a second that the chart is telling us that MongoDB=Big Data, MySQL=not-so-big-data, and because MongoDB is not as popular as MySQL, we conclude that big data is not so popular. Is it possible that MongoDB has not been popular because other competitive NoSQL offerings have taken market share from it? Perhaps their recent licensing model has hampered their growth? There could be other factors impacting the non-popularity of MongoDB than simply not having a big data problem. Overall, I just can’t understand how to argue in favor of big data being dead with this example.

Is there a better way to support the hypothesis that we simply do not have a big data problem? I really liked how Jordan took different approaches to supporting that hypothesis, quantitatively, qualitatively, and inductively.

To quantitatively test this hypothesis, Jordan talks about his analysis of BigQuery customer workloads:

When I worked at BigQuery, I spent a lot of time looking at customer sizing. The actual data here is very sensitive, so I can’t share any numbers directly. However, I can say that the vast majority of customers had less than a terabyte of data in total data storage. There were, of course, customers with huge amounts of data, but most organizations, even some fairly large enterprises, had moderate data sizes.

Customer data sizes followed a power-law distribution. The largest customer had double the storage of the next largest customer, the next largest customer had half of that, etc. So while there were customers with hundreds of petabytes of data, the sizes trailed off very quickly. There were many thousands of customers who paid less than $10 a month for storage, which is half a terabyte. Among customers who were using the service heavily, the median data storage size was much less than 100 GB.

Jordan argues that when he analyzed BigQuery workloads, the majority of customers did not have more than GBs of data. My first challenge with that argument is that BigQuery has always made it super easy for anyone with almost no data to get started. This has always been BigQuery’s main customer acquisition channel. However, not all are qualified customers, and some, if not most, are just kicking the tires. I’m one of those. And what generally happens with product-led growth, such as BigQuery, is the 80/20 rule. While you attract tons of users, only 20% of them generate 80% of the revenue. In other words, the logic that BigQuery has thousands of users with so little data includes in its calculation many users who are actually not the right target customers for the product. So it’s unfair to include them in the “how big is your data’’ analysis.

Another challenge (I’m a BigQuery enthusiast and hate to say this) is that, unfortunately, BigQuery historically has had a hard time acquiring large enterprise customers. This was certainly true several years ago (when I think Jordan was doing his analysis). Recently, BigQuery has done an amazing job building features and functionalities that will attract more customers. However, using BigQuery as a proxy for “how big is your data,” especially given Jordan’s research was conducted at a time when BigQuery was not taken seriously by enterprises, is not strong evidence.

My last challenge to that argument is about the role BigQuery plays in the data journey. BigQuery has mostly been considered a data warehouse, which is meant to store clean and aggregated data. So if the argument is that the entire data journey doesn’t have a big data problem, using BigQuery as a proxy is not very representative of reality.

There were other quantitative examples such as:

In order to understand why large data sizes are rare, it is helpful to think about where the data actually comes from. Imagine you’re a medium sized business, with a thousand customers. Let’s say each one of your customers places a new order every day with a hundred line items. This is relatively frequent, but it is still probably less than a megabyte of data generated per day. In three years you would still only have a gigabyte, and it would take millenia to generate a terabyte.

Alternately, let’s say you have a million leads in your marketing database, and you’re running dozens of campaigns. Your leads table is probably still less than a gigabyte, and tracking each lead across each campaign still probably is only a few gigabytes. It is hard to see how this adds to massive data sets under reasonable scaling assumptions.

To give a concrete example, I worked at SingleStore in 2020–2022, when it was a fast-growing Series E company with significant revenue and a unicorn valuation. If you added up the size of our finance data warehouse, our customer data, our marketing campaign tracking, and our service logs, it was probably only a few gigabytes. By any stretch of the imagination, this is not big data.

Unfortunately, the examples did not resonate with me. For one, the example of “customer orders and millions of leads” is very simplistic, and in the past decade of working with customers of all sizes, I hardly ran into anything that resembled that example. I do not deny that there are many early startups or small companies that will fit that example. But I do not think using them as a quantitative example to prove the “big data is dead” hypothesis is a compelling approach. It demonstrates, in my opinion, that there is a long tail of use-cases that do not have the big data problem.But that has always been the case and does not prove that the big data problem does not exist.

Let’s look at some of the arguments that were made from the first principle perspective:

All large data sets are generated over time. Time is almost always an axis in a data set. New orders come in every day. New taxi rides. New logging records. New games being played. If a business is static, neither growing or shrinking, data will increase linearly with time. What does this mean for analytic needs? Clearly data storage needs will increase linearly, unless you decide to prune the data (more on this later). But compute needs will likely not need to change very much over time; most analysis is done over the recent data. Scanning old data is pretty wasteful; it doesn’t change, so why would you spend money reading it over and over again? True, you might want to keep it around just in case you want to ask a new question of the data, but it is pretty trivial to build aggregations containing the important answers.

This bias towards storage size over compute size has a real impact in system architecture. It means that if you use scalable object stores, you might be able to use far less compute than you had anticipated. You might not even need to use distributed processing at all.

I agree that data accumulates over time, the compute does not have to scale linearly with data. As a matter of fact, that is the entire selling point of moving to a decoupled storage and compute architecture that has been talked about for a very long time: pay for storing data and only pay for the amount of compute needed to process that data. But the scale of the compute has another factor that is independent of the storage size: concurrency and speed. While it is often not necessary to scale storage size linearly, the challenge arises when you want to support many users in parallel. Think of hundreds of analysts using a dashboard that fires up hundreds of queries at the same time. Is there a need for a distributed system to give you parallelism and low latency to support the required parallelism? Yes and no. It really depends on how much data and what scale of concurrency we’re talking about. On a smaller scale, a simple system on a single node can get you a decent amount of parallelism. When concurrency increases and a single node cannot handle the scale within an acceptable latency, things become more difficult. So in that case, there’s no other way to scale to multiple nodes. So we’re back to the same theme that I’ve mentioned previously: dealing with the long tail. There will always be a long tail of use cases that require a more sophisticated solution.

Customers with moderate data sizes often did fairly large queries, but customers with giant data sizes almost never queried huge amounts of data. When they did, it was generally because they were generating a report, and performance wasn’t really a priority. A large social media company would run reports over the weekend to prepare for executives on Monday morning; those queries were pretty huge, but they were only a tiny fraction of the hundreds of thousands of queries they ran the rest of the week.

I totally agree that if we’re talking about analyzing data rather than transforming it, the majority of workloads are acting on aggregated and cleaned data. However, and as Jordan eluded to, there are going to be occasional cases where processing much larger data sizes is needed. Think data transformation and aggregation. Sure, some of that can be done incrementally to ensure the entire dataset is not processed. However, there are some common scenarios in which aggregation or data transformation is applied to several TB of data. Could that be counted as big data? potentially.

Even when querying giant tables, you rarely end up needing to process very much data. Modern analytical databases can do column projection to read only a subset of fields, and partition pruning to read only a narrow date range. They can often go even further with segment elimination to exploit locality in the data via clustering or automatic micro partitioning. Other tricks like computing over compressed data, projection, and predicate pushdown are ways that you can do less IO at query time. And less IO turns into less computation that needs to be done, which turns into lower costs and latency.

And later in the post

A huge percentage of the data that gets processed is less than 24 hours old. By the time data gets to be a week old, it is probably 20 times less likely to be queried than from the most recent day. After a month, data mostly just sits there. Historical data tends to be queries infrequently, perhaps when someone is running a rare report.

I think this is a very important point, and one that suggests that while we may still have the need to store and analyze a decent amount of data (big data? medium data? Who cares what it’s called), recent advancements and optimizations in data analytics platforms have ensured that the system of use does not need to scale to the moon for the majority of workloads. It can intelligently scan and consume portions of data. So this is where I start arguing that it’s not the data size or compute needs that tell us we don’t have a big data problem. We still have cases where we need to process tons of data. It is the advancement of technology that has given us the ability to process a decent amount of data more intelligently, so that it appears to be small. I’ve tried to illustrate my point with the following graph:

Going back to 2009, when I think the big data buzzword got started (let’s not get caught up on the actual dates), due to limitations in compute, storage, and networking, the data size threshold after which a distributed system was the only way to get the job done was much lower than what it is today. Several advancements in the data processing landscape, such as partitioning, columnar data formats (Parquet, etc.), innovations such as Apache Delta, and various other optimizations that Jordan pointed out, have allowed us to process more data with a single node rather than a distributed system. So in my opinion, it is not the case that we are not generating a large amount of data. It is also not that we do not have use cases that require processing large amounts of data. To me, all of those still stand. But what I think is happening is that we are starting to be able to manage the complexity of dealing with large amounts of data in a way that the majority of workloads can be processed with a less sophisticated system.

Given what I said above, I asked myself two questions: what is the implications of this trend? And do we need a new of way doing things given this trend?

In my opinion, the biggest implication of this trend is the reduction of the cost-performance. More specifically, when more of what users do today can be done with a less sophisticated system (e.g a single node vs. a distributed system), the cost of processing data will reduce significantly over time without impacting the overall performance. However, it is hard for me to see if there’s a need for yet another new data processing platform. The ability to get more done with less is more of an optimization that current data processing systems can implement. It is not out of the question, and in fact, it may be the case now that platforms such as Databricks or others can simply decide that instead of processing data with multiple nodes, it can be done with one.

But maybe Jordan’s post was meant to nudge us towards a completely different paradigm. If the conclusion is that the majority of workloads can be processed by a single node, what’s stopping us from making our laptops, phones, or tablets that single node? In other words, given the ability to process just what we need, can we use our personal devices to do the job? Recent advances in browser technology, such as WASM, have brought us very close to this paradigm. However, I still have to ask: Do users care where the job gets executed and how? Can current platforms provide the same cost and performance as running things locally on your laptop? I think the answer is yes. I think while the new paradigm of running things in the user’s browser is impressive, the same speed and cost can be accomplished by the current analytics platforms.

--

--

Parviz Deyhim

Data lover and cloud architect @databricks (ex-google, ex-aws)