Re-thinking data lake architectures with google cloud — Part II

Parviz Deyhim
7 min readApr 29, 2021

How to Simplify data lake architectures

Data lakes surrounded by specialized data marts, connected by complex workflows supported by specialized and cross-functional teams of engineers are challenging for several reasons discussed in the previous post.

Ideally, a single system could provide the flexibility of a data lake for data producers while in parallel serving data consumers through minimal data transformation. Picture a system where data producers easily store data in Google Cloud Storage while data consumers run SQL and Graph-based questions all with sub-second latency. However, this is not the reality of today’s data landscape. The fact is that organizations need more specialized systems optimized for specific workloads. There are noteworthy innovations both in the OSS ecosystem and by the cloud providers, enabling a new breed of architectures. These innovations can reduce the need for specialized storage systems, limit the need for data movement, and simplify the team structure supporting the data platform, resulting in improved data quality and freshness.

Google Cloud BigQuery is one of these solutions. Its features allow organizations to build data infrastructures in a new way: Google Cloud BigQuery is a data lake and data mart simultaneously, leveraging BigQuery features to limit data movement within the organization, and reduce data transformation complexities.

BigQuery as data lake and data mart

BigQuery architecture is built around de-coupled storage and compute. While this architecture allows for better performance, scalability and price/performance, it also provides additional benefits that may not be immediately obvious: BigQuery is a data lake storage and data mart simultaneously.

BigQuery as storage: One of the benefits of de-coupled storage and compute is the ability to treat BigQuery as a pure storage. One can think of it as HDFS or GCS except it is optimized and fully managed. Additionally, BigQuery provides direct API access to the storage layer where frameworks such as Spark, Hive, and Apache Beam can read from BigQuery storage without utilizing BigQuery compute resources. The combination of storing data in BigQuery and accessing it directly via APIs, makes BigQuery an ideal choice as a data lake storage.

BigQuery as a virtual data mart: The de-coupled compute and storage model enables organizations to treat BigQuery as a data mart. And unlike the traditional architectures where data mart is often an isolated specialized system, building a data mart with BigQuery is simply a logical construct rather than an additional hardware or software. Using logical constructs such as Google Cloud Projects and permission based APIs, organizations can build and modify data marts in a matter of minutes. This simplifies the overall architecture by eliminating the need for managing additional resources.

Limiting data movement

One of the most important steps towards a more manageable and scalable data platform is to avoid data replication. Data replication creates complexities, dependencies, and reduces data quality and consistency. There are several BigQuery features that will allow organizations to limit data movement.

Data mart without data movement: Traditionally in data lake and data mart architectures, data is replicated between data lake storage and data marts which often creates complexity and data consistency issues. BigQuery can act as storage and compute simultaneously, copying data to data mart(s) can be accomplished by sharing data with other parts of the organization, creating a virtual data mart. Sharing data in BigQuery is a permission based approach and does not require hard-copy data. This enables organizations to limit data movement and puts data ownership and control in the hands of data producers rather than the specialized team of data engineers.

BigQuery for low-latency queries: As discussed in the previous sections, data lakes are leveraged mostly as a general purpose storage and while data gets replicated to a more performant storage system for data consumption. For example, copying data from GCS (or HDFS) to Apache Pinot, Druid, or MySQL/Postgres is a common pattern in data architectures. Those architectures use a more specialized storage as a cache to improve data consumption performance. In order to simplify this process, BigQuery has introduced BI Engine, a new in-memory execution engine where users can experience low-latency query execution without additional data movement. In other words, BI Engine is meant to serve the same purpose of creating a cache in front of BigQuery except that BigQuery intelligently manages data movement between storage and cache and eliminates the need for organizations to move data between systems.

BigQuery for ML workloads: Another common example of data movement is for machine learning model training and building. Recognizing this problem, BigQuery has introduced native ML functionalities allowing users to train, build, and serve models without exporting data out of the system. BigQuery supports built-in ML (doc) models and also models built using Tensorflow (doc). In addition to that, Google’s AutoML tools (doc) have native integration with BigQuery as a federated storage.

BigQuery for GIS workloads: Storing and querying geographical data used to require organizations to use a more specialized system such as Postgres database. BigQuery has native support for storing and querying GIS data which eliminates the need for data movement outside of BigQuery storage.

Reduce data workflow complexities

Simplify data integration and ingestion with streaming inserts: Integrating and ingesting data from external sources into a data lake is a complex and challenging task. The complexity arises due to the heterogeneous nature of data sources. Historically, the integration between external data sources and data lakes have been implemented with batch processing frameworks which as a result introduces another layer of complexity: the need for batch workflow orchestration. With the proliferation of data sources, the complexity of adding and orchestrating a series of batch workflows to ingest data into the data lake increases. This process can be simplified if batch workflows are converted to stream pipelines where data movement or transformation runs continuously, eliminating the need for additional orchestration. Additionally, streaming pipelines lends itself to operational data sources with a real-time nature.

Unlike traditional systems where batch data loading is the only way to ingest data, BigQuery supports streaming data ingestion. The streaming data ingestion into BigQuery allows users to store and consume data instantaneously. It simplifies the architecture by eliminating the need for workflow orchestration and managing additional logic to optimize streaming data for consumption

Reduce data optimization workflows with BigQuery storage optimizer: Data lake storage is often optimized for write but requires changes to make the system optimized for consumption. Here are several common optimizations and what BigQuery, namely the storage optimizer, does to reduce the operational overhead:

Columnarizing newly ingested data: As more data gets added, the system requires additional workflows to create a columnar representation of the new and row-oriented data. Engineers often solve this by converting Avro, CSV, and JSON into Parquet or ORC format using Apache Spark, Hive, Flink, or other tools. BigQuery storage on the other hand manages this behind the scene. As data gets added to the storage, either in batch or streaming, it’s automatically transformed to a columnar representation.

Defragmentation: As more data gets added to the system, either as the result of newly ingested data or due to mutations applied to the existing data (DML operations), the system tends to be fragmented, reducing data consumption performance. In order to keep the system optimized, it requires additional workflows to defragment (consolidate smaller files into larger files and remove deleted data). There are example OSS tools such as Iceberg, Apache Hudi and others that can continuously keep the storage system optimized. However, BigQuery’s storage optimizer has an automated workflow for defragmentation and dynamically optimizes data storage on the disk..

Optimizing data layout: In order to optimize how data is layed out, subsequent externally managed workflows re-order or partition data. To do this, the system generally sorts the data on one or more keys. Similarly to BigQuery’s automed defragmentation, BigQuery’s storage optimizer partitions and sorts data continuously without any user intervention.

Simplify data transformation with materialized views: The current landscape of data architecture is full of examples where data residing inside the data lake needs to be transformed (denormalized, aggregated, deduplicated, etc) by an externally managed data workflow prior to consumption. To help reduce the complexity of data transformation workflows, BigQuery has introduced materialized views. With materialized views, the user provides a SQL transformation logic and registers it as a BigQuery view. When users write queries against the view, BigQuery intelligently decides if the transformation logic can be executed on the fly or if it can be fulfilled from data stored on disk. In the end, users offload the heavy lifting to BigQuery resulting in data transformation logic reduction.

The features and capabilities of BigQuery discussed above empower organizations to lower the complexities of data lakes by making specialized storage systems and complex workflows unnecessary. One challenge that remains includes the monolithic architecture of data-lakes and the organizational challenges of maintaining the platform. In the next post we’ll discuss how BigQuery can solve those challenges.

--

--

Parviz Deyhim

Data lover and cloud architect @databricks (ex-google, ex-aws)