044: Data Warehousing with Snowflake

Snowflake is a Cloud Data Warehouse that is comprehensive to use and probably belongs to the most successful cloud platforms over the last few years.

The weekly mood

This week slightly challenged my sense of responsibility and communication. It started with some (expected) sign of non-satisfaction from my manager, and ended with some (unexpected) sign of Leadership from me. In fact there are enough areas where I usually put enough of pressure on myself, like for example "delivering what I promised". However, I still definitely need a manager to motivate me for other areas like for example "making things happen". As an outcome, I am currently enjoying a much better team work around our Data Lake project and API Gateway evaluation. It's a shame that this would probably not happen without an external trigger, but that's also the essence of Lifelong learning.

This experience reflects the common mistake that people make when focusing on results rather than on outcomes. Let's assume that consciously, everybody wants a better world. However unconsciously, everybody strives for looking good and productive (results), not necessarily for making the world easy and sustainable (outcome). We may work on a better world based on a current status that needs to be "under control". Changes as a result often happen by "force" or by "accident", and so the world might become even more hectic and complicated. Consciously again, we will repair broken things and live with originated complexity. For this reason, it is generally worth following a Lean approach.

Anyway I expect more of such challenges coming, and so I'll definitely try to put more attention on the outcome in the future, than on the result. A fairly good transition to one of the key responsibilities of the Architect: Support Business Decision making not only with the right culture (ex. Consensus) but also with the appropriate Infrastructure, Processes and Tools, including their ability to collect, analyse and reflect the current status of Business operations. "If you can't measure it, you can't improve it" [Peter Drucker]. Such analytical information are for example Key Performance Indicators (KPI) that help better understand the market demand as well as the effective capacity to operate, drive changes and innovation. See also Gartner's 5 Steps to become a Digital Leader.

What is a Data Warehouse

Bill Inmon is recognized as the father of the Data Warehouse (DW) thanks to his book Building The Data Warehouse which was first released in 1992, and revealed the following accepted definition of what it is: A subject oriented, non-volatile, integrated, time-variant collection of data in support of management's decisions. The book also invites to design all enterprise assets using an Entity relationship (ER) model, a top-down approach also known as the Enterprise Data Warehouse.

General idea

Reports are implemented by a central IT.
Suited to large and established organisations with largely-scoped requirements, strict security policies and above-average resource capacity.

Reliable operations.
Trustful outcome.

Contra

Expensive until a minimum level of maturity is reached.
Time-to-delivery and usability potentially unsatisfactory for the Business.

Typical setup

Store on a single disk and compute on multi-CPU nodes aka. Symmetric Multi-Processing (SMP), an architecture adopted by traditional appliances like Oracle DW, IBM Cognos TM1 and Microsoft SQL Server.
Visualize through enterprise BI tool that is vendor specific ex. IBM Cognos BI, SAP Business Objects, Microsoft Analysis or independent/agnostic ex. Microstrategy, SAS.

In 1996, Ralph Kimball made the alternative proposal of a bottom-up approach also known as Dimensional Modeling in his popular book the Data Warehouse toolkit. Not the that I've already been writing about that kind of modeling in this previous article. Kimball's approach finally reached the critical majority with the rise of self-service Business Intelligence (BI) tools since 5 years.

General idea

Reports are implemented by the Business itself.
Suited to smaller and fast-growing organizations with tightly-scoped requirements, flexible access privilege and limited resource capacity.

Implements analysis requirements as they come.
Made by the Business for the Business.

Contra

Shadow-IT setting operational systems, data privacy and business-hours at risk.
Unofficial/Excel-like figures.

Typical setup

Store on commodity Database ex. PostgreSQL, opt. a Database-as-a-Service (DBaaS).
Compute and Visualize through proprietary Dashboards (BIaaS) ex. (MS) PowerBI, (Salesforce) Tableau, (Google) Looker or new open-source alternatives like Metabase and Apache Superset.

DW concepts didn't change much since then, apart from Data Vault (DV) Modeling proposal in 2000, as well as the reduced cost and increased elasticity of Online analytical processing (OLAP) engines since 2010s.

DW Systems

As a matter of fact, Data Warehousing belongs to the most exact science for Business Decisions, but also to the slowest time-to-value for IT systems. Indeed, a machine shows very perceptible latency when it comes to answer questions that require input aggregation. AFAIK it is comparable to an Antibody- or a DNA-test for which it is currently not possible to provide a result within seconds.

The following system requirements help better fulfill DW needs.

Performance

High Write Throughput = Ability to load a maximal amount of data in a minimal amount of time.
Low Read Latency = Ability to select and output data in a minimal amount of time.
High Read/Write Concurency = Ability to let multiple user workloads (both manual and service driven) operate at the same time.

Scalability

Isolated Domain Resources = Ability to serve different departments or tenants
Unlimited Disk Space = Ability to store historical data and automatically add storage volumes before capacity is reached
Parallelized Computing Operations = Abibility to treat any item of a queue as "urgent"

Observability

USE metrics = Ability to report on resource usage
Logs = Ability to report on operational activities
Audits = Ability to report on user/service access

Access

Fine-grained RBAC = Ability to reduce access to what is really needed or allowed
SQL Interface = Ability to interact with the system via a human-friendly language
API = Ability for an application to interact with the system

A Data Warehouse-as-a-Service (DWaaS) aka. Cloud DW is a kind of DW that provides an abstraction service layer on top of separate storage and compute resources. In 2005, the academic paper and implementation C-Store initiated the concept of columnar Data Warehouse that made OnLine Analytical Processing (OLAP) reach the next level of performance level.

Teradata, Google BigQuery, Amazon Redshift (formerly ParAccel) and Azure SQL DW tackled with Massively Parallel Processing (MPP) aka. the "shared-nothing" architecture. With this approach, the data is partitioned across a cluster of independent SMP nodes, so that multiple workloads can be run in parallel and reach maximal performance within their own partition. Still this configuration is very difficult to setup, use and tune at a minimal cost, so that it often doesn't performs as it actually should.

What is Snowflake

In 2014, three Database Architects (formerly working at Oracle and others) founded Snowflake Inc. and created a product named Snowflake after their passion for winter sports. The new proprietary DW solution was build completely from scratch and especially designed from the ground up for the Cloud. In addition to its amazing performance, it offered a much better User Experience (UX) than the competition, and was ultimately suitable for any size of business. As stated at the beginning of this post, this great example reminds the benefit of focusing on the outcome rather than on the results.

Minimal Viable Product (MVP) features:

Simplicity

Data is persisted on-demand in AWS S3, Azure or Google Storage and can be shared by an unlimited number of compute instances or "virtual DW".
Thanks to a auto-indexing and partitioning, customers only need to care about workload separation, not any more about execution performance tuning.

Low-cost

Cluster is provisioned and scaled on-demand in AWS EC2, Azure or Google VM at selected size and max. units of computation.
Thanks to auto-suspension (after a configurable idle-period) and resumption, customers only pay-per-use of effective storage and compute, not for unused clusters.

Snowflake pricing and Credit documentation are also fairly well comprehensive. If for example I choose a standard subscription on for storing 25 TB of data and computing for a total of 75 started node minutes a day in AWS US East, then my company would be charged for about $5000 every month. In comparison, same minimal usage of Teradata of AWS would be charged 45% more!

In 2015, Gartner rewarded this innovation which would disrupt a DW market previously owned at over 80% by Oracle, IBM and Microsoft. The startup raised amazing funds and convinced customers. After growing at a very fast pace, the company announced an Initial Public Offering (IPO) in 2019 and became public (SNOW) not later than this week.

How it works

Under the hood, the compute architecture initially called Elastic Parallel Processing (EPP) is a hybrid of SMP and MPP, in that the Platform is able to manage multiple MPP clusters of different sizes. After the term EPP didn't really establish outside Snowflake ecosystem, the custom architecture is now referred to as a Multi-cluster DW.

The storage architecture is equally amazing. In spite of I/O latency introduced by the third party Object Store, Snowflake takes advantage of durability and multipart features for splitting 500 MB sized chunks of data called micro-partitions. Within each partition, each table is then indexed and compressed at column level. All related metadata is hold by the service layer.

Eventually, this pretty unique combination of compute and storage architectures theoretically allows for an unlimited number of virtual clusters and virtual nodes to share the same virtual disk while running data intensive operations, without any drop of performance from the physical infrastructure layer.

Snowflake supports different levels of Storage:

Result Cache: Results of queries executed in the past 24 hours where the underlying data has not change. Those are available across multiple virtual warehouses.
Local Disk Cache: Whenever data is needed by a query, it is retrieved from the Remote Disk storage and then cached on Solid State Drive (SSD) or Flash / Random Access Memory (RAM).
Remote Disk: Long term storage responsible for data resilience and durability.

Materialized views (MV) are pre-computed SELECT queries which data is stored in the Result Cache, allowing shorted latency and higher consistency. The result never changes unless the view is replaced.

External tables were recently released as a preview feature, so that the platform opens to non-internally managed object stores. Users can better address their storage compliance requirements while running Data-Lake-style queries. Generally supported file formats (CSV, JSON, Avro, ORC, Parquet, XML) and compression (Gzip, Bzip2, Zstandard) are possible. Of course access is read-only and slower, but we'll discuss about the use-case in a bit.

Hands on

In case you don't have any yet, you need to create an account. There is a 30 days trial plan available for free, including sample data and covering all infrastructure costs.

Then you can access Snowflake via the Web-Interface, the snowsql CLI, the snowpipe REST API, or the driver SDK which is already supported by a number of third-party applications.

Web-UI

The Web-Interface is kept minimal with no complicated menus, configurations or interactions. You just navigate from Databases where you can find your storage collections and schemata, to Warehouses where you find your compute instances and status, to Worksheets where you can query and preview your data, to History that keeps trace of your activity log.

Here is an overview of the information schema:

And here is the result of one of the official query examples:

The history page provides a graphical query profiling as part of the query details. I found it a pity that the execution plan returned by the EXPLAIN statement (pre-execution) is only available in text mode and therefore not as much comprehensible than that one (post-execution).

In the Enterprise edition, there is an additional backup feature called time-travel, and some storage/capacity planning so that customers can operate and grow based on a transparent billing report and predictable cost.

Shell client

Let us see how to connect to the DB snowflake_sample_data and schema weather using the CLI:

$ snowsql -a <account_name>.<region_name> -d snowflake_sample_data -s weather
Failed to initialize log. No logging is enabled: [Errno 13] Permission denied: '/home/snowsql_rt.log_bootstrap'
Failed to initialize log. No logging is enabled: [Errno 13] Permission denied: '/home/snowsql_rt.log'
User: tncad
Password: 
* SnowSQL * v1.2.9
Type SQL statements or !help
tncad#COMPUTE_WH@SNOWFLAKE_SAMPLE_DATA.WEATHER>select count(*) from WEATHER_14_TOTAL;
+-----------+                                                                   
|  COUNT(*) |
|-----------|
| 895095522 |
+-----------+
1 Row(s) produced. Time Elapsed: 2.254s

Ingest

4 different approaches allow you to import large volumes of data into Snowflake, as depicted below:

The COPY INTO statement is the recommended option for batches of bulk data. It is synchronous, which means that the client process is awaiting for the ingestion to complete before moving forward (slow load, fast query).
Using EXTERNAL TABLEs is a suitable option for splitting the two different concerns of staging and access. But it requires to refresh tables after changes, and will introduce some new constraints at next stage (fast load, slow query). External tables are typically used as a transient staging, or in conjunction with materialized view to allow for low-latency access.
SNOWPIPE is an interface for auto-ingesting continuous data into a transient table (fast load, slow query). It is typically used in combination with Snowflake change stream object to make it transit to a base table (supporting fast-query and time-travel). You have to generate a token (or request it from your admin) for being able to use it. You also need to make an assumption whenever aggregations depending on load completion may run. Alternatively, Streaming events can also be pushed to transient tables via the Kafka connector.
The UPSERT statement is probably the most consistent option, since it automatically creates a temporary table and then merges into the base/internal table. But its execution is even more difficult to manage in a process sequence.

Change-Log

Change-tracking is enabled via the STREAM object which basically consists in an activation flag and an history table. With this, external tools can implement a pull-based Change-Data-Capture (CDC) to process new data in near-time.

Take away

We just took a tour of Snowflake, the next-gen DW especially built for the Cloud. Note that the purpose of this post was to get familiar with overall concepts and not to run a performance Benchmark, since usage was limited by the trial account based on x-small instances.

I am very much looking forward to further developments already announced, especially:

Snowsight: BI capability
Data Marketplace: Dataset sharing
Dynamic Masking: Column-level security

From what I read, the boundaries between the DW and the DL are currently melting and there are good chances to see Snowflake entering the Data Science market in the future.

The newbie cloud architect diary

Search This Blog

044: Data Warehousing with Snowflake

Labels

Comments

Post a Comment