Better productivity and efficiency with Databricks serverless SQL

Parviz Deyhim
8 min readOct 26, 2022

Summary

Databricks Serverless SQL provides instantaneous access to analyze your data at scale. It introduces productivity and better cost performance by predicting cloud infrastructure demand and providing an always-on experience. In this post, we’ll look at how Databricks serverless SQL can make you more productive and improve the cost performance of your Lakehouse architecture.

Introduction

No one likes to wait for things to happen. We simply avoid waiting at all costs. What does this have to do with Databricks Serverless architecture? More than a year ago, we launched Databricks SQL on our managed lakehouse platform. While our platform has simplified the lives of many of our users, the reality of today’s cloud infrastructure is that launching and configuring resources simply takes time. And our users have repeatedly told us that they want a truly instant-compute experience. We gave our users what they wanted last year when we introduced Databricks Serverless SQL, which lets them start their analysis in a matter of seconds. It turns out that cutting down on wait time had a number of positive effects. It made the user experience better by making them more productive, and made the overall system more reliable. As a result, it made our product more cost-effective. Let’s go deeper on why Databricks Serverless SQL is better for productivity.

Economic impact of a better user experience

We all know what it’s like to have a better experience and be productive, but I think what’s often missing from the conversation is a way to measure the economic impact of these benefits. After all, if we can’t link some intangible benefits to some real result, we’ll always end up questioning how valuable they are. In other words, what does it mean for your business and bottom line when we say “better user experience” and “improved productivity”? Let’s look at interesting experiments done by Google and Amazon to show how poor user experience makes an economic difference.

Let me ask you, how many Google search results do you want to see on a page? 10 or 30? When they tried both options, Google noticed user traffic and revenue dropped by 20% when the number of search results went from 10 to 30. Why? The page with ten results loaded in 0.4 seconds. The page with 30 results loaded in 0.9 seconds. The longer wait made the user experience worse, which caused revenue to drop by 20%. Amazon had a similar experience of reduced traffic and revenue when page rendering took 100 milliseconds longer. To me, the conclusion is simple: improve the user experience and you’ll be rewarded. Ignore user experience and you’ll be sorry.

The evolution of cloud

Imagine you are tasked with putting together a report that shows how your customers are using your product. You’re excited to work on this report, so you get to the office early in the morning. After making coffee, you sit down in front of your favorite reporting tool and start analyzing the data. You drill in, drill out, and try several different analyses. Every time you do that, the BI tool submits one or more queries to the backend infrastructure hosted on on-premise servers. Since it’s early in the morning and no one else is using the backend infrastructure, your queries run very quickly. But once you share your analysis with the rest of the company, more queries are sent to the backend. And at some point, the back-end infrastructure doesn’t have enough resources to handle all of the queries, everything slows down and frustrates your coworkers.

Setting up and managing your own on-premise infrastructure was a common thing a decade ago. The hardest part was how long it took to get resources resulting in reduced user productivity. And when you did get the resources, you had to make sure you had enough to cover your peak demand which meant reduced cost efficiency during non-peak hours.

What changed everything was the advent of the cloud and the ability to get resources when you need them and return them when you don’t. You don’t have to plan for capacity months in advance, and you don’t have to pay for peak capacity even if you only need it for a short time. One word, “autoscaling,” describes the process of getting resources when needed.

Now imagine that you are the same analyst who’s looking to build a product analytics report and share it with the rest of the company. This time, however, the backend infrastructure hosted on-prem has now moved to Databricks. Moving things to Databricks has made a ton of undifferentiated tasks (eg. purchasing hardware, hardwares/software maintenance, etc) that were essential to maintain your on-prem backend, completely obsolete. The major value to you is that you and your colleagues no longer have to worry about slow reports. When your amazing report is shared with the rest of the company and results in thousands of queries in a short period of time, the Databricks SQL autoscaling feature kicks in and adds resources to keep up with the demand. As a result of all this, everyone in the company is more productive. To put it another way, by improving the user experience, more can be accomplished in the time that was previously spent waiting.

Are we there yet? Well, almost. We still have room to improve productivity. Historically, autoscaling has been very effective for the type of demand that is more or less predictable. Consider Netflix usage patterns, where the majority of usage occurs later in the day and is at its lowest during work hours. What has not generally been the most effective use of autoscaling is with what is commonly referred to as “spiky” or “bursty” traffic, in which the amount of usage (demand for resources) increases rapidly and without any noticeable predictability. While auscaling can still improve the user experience even with bursty usage, as you can see in the chart, it is not the perfect solution.

It is extremely difficult to solve the bursty traffic problem solely with autoscaling. I’ve seen some interesting approaches to the problem in the past, such as attempting to predict spikes with sophisticated machine learning algorithms. But none has been more efficient than what generally has been the go-to approach to dealing with bursty traffic: keeping some of the resources always running in anticipation of a spike or burst. And if you’ve been following the events in this post, you’re probably wondering if this isn’t the same inefficiency as the always-on on-premise architecture. And the answer is yes, which brings me to the topic of serverless architecture

Serverless architecture

Almost all definitions of serverless architecture, including mine in this paragraph, are in my opinion, incomplete. Serverless is often defined as an architecture where you no longer have to manage the resources. My definition is similar: serverless is an architecture in which the platform provider is responsible for maintaining an invisible pool of generally always-on resources (servers) that users can access when needed. And it’s easy to see how serverless architecture can be beneficial: the responsibility of managing resources shifts to the provider. You no longer have to worry about managing resources in your cloud environment. However, I believe that those definitions overlook one of the most significant advantages of serverless architecture: the power of aggregation.

Good things generally happen when you start doing things as a team rather than individually. It’s often the case that the inefficient behavior of an individual can become more efficient as a group. In the context of a serverless architecture, when individual workloads start running on a common platform, an interesting pattern starts to emerge. The spiky and random workloads can start looking more predictable. How’s that exactly? Take a look at the chart below. The figure below shows the world without a serverless architecture. All workloads happen at random times and in separate environments. By aggregation all of the workloads into a single platform, the graph on the right emerges. A collection of random events form a more predictable pattern. And I think this is by far the most important aspect of serverless architecture. Instead of thousands of customers running their random unpredictable spiky traffic in their own cloud environment, by managing all of the workloads in a unified platform, Databricks can now bring some predictability into the picture. With better predictability, Databricks can in turn predict when more resources are needed ahead of customers demand, which means getting access to resources can be instantaneous.

Instantaneous access to resources brings a number of benefits. An obvious one is better user productivity by reducing the time the user has to wait. Another benefit is the ability for users to comfortably shutdown the resources when they are no longer needed knowing that getting started again and having access to resources when needed is instantaneous. In other words, users can avoid keeping resources always on to have instantaneous experience. By doing that, most users can immediately see improved total cost of ownership and/or cost-performance.

Putting it all together

To put everything together, let’s revisit our earlier example: you are again the same analyst who’s looking to build a product analytics report and share it with the rest of the company. You’ve been using Databricks SQL and have been more productive than the days of using on-prem infrastructure. This time however, you decide to use Databricks SQL serverless to improve your productivity even more.You connect your favorite reporting tool to Databricks SQL serverless and notice that in a matter of seconds everything is ready for you to get started. When you share your report with other colleagues, the simultaneous execution of your report triggers Databricks autoscaling feature and in a matter of seconds more resources are added to support the new additional demand. And eventually when demand reduces, Databricks serverless architecture will take away resources to ensure that you are no longer paying for resources you are not using. This pattern continues every time you interact with the platform. All this happens without the need for you to manage any of the resources in your cloud environment. Databricks manages everything and all you have to do is to simply connect and analyze.

In this post, I’ve touched on how Databricks SQL serverless architecture can further improve the user experience by aggregating multiple diverse and random workloads into Databricks’ unified lakehouse platform. In the next post, I’ll go a bit deeper on additional unique Databricks Serverless SQL features and differentiators related to performance, security, and cost efficiencies.

Please get in touch with me if you have any questions or feedback: Twitter, LinkedIn

--

--

Parviz Deyhim

Data lover and cloud architect @databricks (ex-google, ex-aws)