Welcome to this issue of the WeAreDevelopers Live Talk series. This article recaps an interesting talk by Adnan Rahic who teaches the audience how to make data warehouses.
About the Speaker:
What are Data Warehouses?
Adnan describes them as central locations for data from multiple sources. It‘s up to the user to set up a process to extract, transform, and load data from source systems based on a schedule or a set of events. Also, a report and data analysis can be run to get some business insights.
Usually, data warehouses are optimized for online analytical processing (OLAP). This means running a relatively low volume of complex analytical queries with tables that contain billions of rows. The calculations in these queries often include aggregate functions like AVG, COUNT DISTINCT, PERCENTILE_CONT, and so on.
The direct consequence is, that data warehouses are mostly designed as auto-scaling and heavily distributed systems with columnar engines.
Examples of those would be Snowflake Data Cloud, Google BigQuery, and Amazon Redshift.
OLAP vs OLTP
OLTP (online transaction processing) means running a high volume of not-so-complex queries where the data is mostly inserted or updated rather than being read. So the amount of data you are scanning in contrast to OLAP is really low.S
So, companies want insight into the data they have collected (which are often very high amounts) and your internal users become more interested in getting on-demand reports instead of for example weekly printouts.
The developer needs to figure out how to build responsive and performant apps on top of data warehouses to keep the users happy. But what means “responsive and performant” in this context?
Data Warehouse Performance
The goal of every developer is to give the users a feeling that they are operating directly on the data regardless of an operation, data volume, or a number of other users acting in parallel. An app reacts instantaneously when it responds to an action perfumed by the user within 10ms. 1 second is the limit above which an application risks interrupting the users' flow of thought.
Query latency is the amount of time it takes to execute a query and receive the result.
So regardless of where a user is operating all around the globe, he/she should get a response in less than a second.
What affects the query execution time? First and foremost the amount of data that needs to be scanned. With larger datasets, these scans take more time obviously. For this load to be distributed, modern data warehouses use multiple nodes. So, let’s see how they perform.
Query Latency in BigQuery
BigQuery is a serverless big data warehouse, available as part of the Google Cloud Platform. It’s highly scalable, meaning that it can process tiny datasets as well as petabytes of data in seconds. The performance is based on the pricing model the user chose and this determines how BigQuery allocates slots, its virtual compute units with CPUs, memory, and temporary storage used to execute queries. The program automatically calculates how many slots are required by each query, depending on query size and complexity.
Adnan imports TPC-H data with 150 million rows to figure out a few sample queries.
The query takes 1.4 seconds to run, and it processed 2.7GB of data. When he runs it twice, the second run latency is well under a second because the results of the first were cached.
Query Latency in Snowflake
Unlike BigQuery, Snowflake doesn’t share its compute resources between users. It processes queries using virtual warehouses, and each virtual warehouse contains multiple dedicated compute nodes allocated from a cloud provider for the different projects. That means each individual warehouse doesn’t share compute resources with others.
When Adnan run the query on his dataset it took 4.7 seconds.
Latency vs Concurrency
Nowadays most applications are built to be used by hundreds, thousands, or millions of users. So, it’s very rare for a lone user to execute a single query at a given time. The app should be able to run multiple queries against a data warehouse in parallel. The issues are that these queries will compete for available resources.
So, how would these parallel competing queries impact query latency?
Query concurrency is the amount of actively co-executing parallel queries. The "actively" part is important here because data warehouses can queue queries over a certain limit and run them only when previous queries are completed.
In BigQuery, according to its quotas and limits, concurrency is capped at 100 queries per project. That's a relatively high number, just beware of that "per project" part. It means that the quota is shared between all apps that interact with the same GCP project.
Snowflake is a bit different. It can use node sizes anywhere from 1 to 128 credits in a warehouse. Here's an explanation of how credits are charged. A warehouse can scale horizontally like a cluster if you select adding warehouses to a multi-cluster warehouse.
What is Cube?
Cube is an API server for making sense of huge datasets. It doesn't get any simpler than that. It's the de-facto Analytics API for Building Data Apps.
It's open-source with more than 11,000 stars on GitHub. Cube also integrates with every major database on the market today.
With Cube, you can create a semantic API layer on top of your data, and manage access control, cache, and aggregate data. Cube is also visualization agnostic. It's up to you to use any front-end visualization library to build your own dashboards.