How to Simplify data lake architectures

Data lakes surrounded by specialized data marts, connected by complex workflows supported by specialized and cross-functional teams of engineers are challenging for several reasons discussed in the previous post.

Ideally, a single system could provide the flexibility of a data lake for data producers while in parallel serving data consumers through minimal data transformation. Picture a system where data producers easily store data in Google Cloud Storage while data consumers run SQL and Graph-based questions all with sub-second latency. However, this is not the reality of today’s data landscape. The fact is that organizations need more specialized systems optimized for…


The data ecosystem has come a long way since the days of using databases and expensive traditional data warehouses to make sense of large amounts of data. Organizations commonly store and analyze data using open-source tools hosted on-prem or in the cloud on platforms capable of processing more data than ever. Nowadays, data lakes and data marts are the most common platform architectures. Data lakes allow organizations to consolidate data from various operational data sources into a single storage system. …


In my previous blog, I demonstrated how to leverage BigQuery’s AEAD encryption functions to achieve data deletion, also referred to as crypto-deletion. However, I limited this demonstration to data in BigQuery. But data rarely exists only in one system. What if we have to delete data from all of the existing systems in our pipeline? How can we apply the same data crypto-deletion strategy beyond just BigQuery?

To set the stage, let’s make some assumptions about the use-case, requirements, and outcomes. …


The concept of incremental processing can have a major impact on the design of data analytics pipelines. Processing large amounts of data in increments introduces resource efficiencies, faster processing time and inherently reduced processing cost. However, not all analytical functions have incremental properties. A common example is the count distinct function. In this post, I would like to talk about incremental count distinct processing using BigQuery’s HyperLogLog++ functions and how they provide fast, scalable, incremental processing properties.

I’ve always been fascinated with how algebraic concepts can greatly impact how we process data at scale. For example, commutative and associative properties…


Image result for bigquery

There are a number of features that set BigQuery apart from other data warehouses — large-scale streaming ingest, automatic data archival without performance penalties, and integrated machine learning functions are just a few. Most recently, we released BigQuery encryption functions which enable a broad and important set of abilities, including what we’ll discuss today: data deletion and retention using crypto-shredding.

Why might you want to delete data from BigQuery? Many businesses operate within regulations around data retention — for example, GDPR’s “right to be forgotten” clause which stipulates that a user’s data should be deleted when requested. …


There’s a lot going on at Google. Our customers solve challenging problems using Google Cloud’s services and solutions. What has been obvious to me is that the challenges that our customers solve are not always unique. There are common repeatable patterns that can be applied across different customers. For that reason I’ve decided to blog about some of the common data & analytics patterns and solutions that we solve on behalf of our customers. If there are challenges that you like to solve I would love to hear from you. Find me on twitter: pdeyhim@


Not long ago, I worked on a project where the client was looking to reduce their AWS cost by optimizing their AWS cloud environment. After further investigation, I concluded that applying AutoScaling to their AWS infrastructure would have substantially reduce customer’s monthly AWS cost. And the reason for that was simple: customer had idle resources that were statically provisioned to handle the peak time traffic. While this is a common issue with most AWS environments and hardly deserves a blog post, the process for which I collected data and calculated the potential cost saving did deserve further attention. More specifically…


After spending the past 6 years helping clients build cloud data processing platforms, I’ve decided to share my thoughts and vision. I’m by no means a thought leader in this space. Far from it actually. But I’ve worked with interesting clients and have been invovled in challenging projects where what I’ve learned and experienced could provide different point of view for others who are walking down the same path.

Parviz Deyhim

Data lover and cloud architect. ex-aws, ex-databricks, and now a Googler

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store