045: MongoDB to Analytics - Part 2 - Star schema Skip to main content

045: MongoDB to Analytics - Part 2 - Star schema

A star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts.


The weekly mood

My role as a cloud architect is slowing transiting from supporter to enabler. As a supporter, you might make some background noise, but you are not really part of the game, worse you are not part of the staff. That's perfect to observe and learn. Now as an enabler, you might indirectly influence or contribute to activities. Enablers are for example advisers and trainers. They aren't working on the field or taking decisions on the border, but they are a useful part of the staff, and feel rewarded by its achievements.

In a previous post, we introduced the need to infer and flatten the source schema of some MongoDB collection, for the purpose of data analytics. In this article, we'll take intermediary tables aka. Staging area for granted. We will focus on defining the target schema, or the "magic" practice of Data Modeling, and We will adopt a Star schema. The goal being to provide Business users with fast and easy access to meaningful data via our corporate Data Warehouse (DW) based on Snowflake. Let's go!


Dimensional Modeling

According to Ralph Kimball's bottom-up approach to building a DW, a starting point is to denormalize staging data into a Datamart.

There are 2 main benefits for doing this:
  1. A Data Mart allows for more efficient OLAP operations, especially low-cost aggregation and low-latency request. Indeed, compute cost is way more expensive than storage cost in a DW. Reports tend to repeatedly run the same requests over and over again (think of a dedicated screen with auto-refresh set to 5s). Users usually do not expect to wait for seeing results.
  2. A Data Mart is achieved through Dimensional modeling using for example a Star schema or a Snowflake schema. In such schema, dimension tables hold the Busines Entities and their attributes, while fact tables mainly hold the relationships, optionally some measures in case their granularity fits well. Otherwise, it might be convenient to add another layer dedicated to the aggregation of facts and the calculation of Key Performance Indicators (KPI), accessible from a summary statistics table or a multi-dimensional database aka. cube.
There are a number of academic model examples such as Microsoft Adventure Works, and entire books of certified models for any possible purpose such as the The Data Model Resource Library by Len Silverston (1997).

Unfortunately, the design of a Data Mart also comes with 3 main challenges:
  1. Defining the right model requires special skills and might require to narrow the analytical scope to a limited range of business use-cases.
  2. Implementing the model requires some significant integration, mapping and transformation effort, which might also introduce potential risks of technical mistakes and business mistrust.
  3. Populating the model involves additional compute cost each time it needs to refresh.
Note that these steps should ideally not be executed in a Waterfall sequence, but with agile and continuously tested Iterations instead. 


Model definition

A general method consists in identifying Business Entities from the operational domain to analyse, as well as relevant Metrics and filter Attributes from example user queries. The following deliveries are typically expected:
  • Conceptual schema: A technology independent representation which has the great benefit to help identifying Business Entities, but also the great weakness to not be mandatory, therefore to be often skipped. 
  • Logical schema: A representation made of Entity-Relations (E/R) using e.g. a Data Modeling Language (DML).
  • Physical schema: A persisted state as produced through the execution of a script in Data Definition Language (DDL).
Rocket.chat REST API resource statistics already provides basic aggregations, for example the metric "totalMessages" is the result of an object count on the collection "rocketchat_messages". However, this information can be obtained from MongoDB function db.stats() as well, and we are actually interested in a more "faceted view" of the data that allows to answer more complex questions from the Business Analyst, like for example "how many messages are produced by user, by period of time?" 

In order to achieve that, we actually need to deviate at least 1 fact- and 2 dimension-tables from our message table as follow:
    • dim_user
      • id (pk)
      • username
    • dim_time
      • id (pk)
      • dt
      • hour
      • day
      • month
      • year
      • weekday
      • weeknum
    • message
      • id (pk)
      • user_id (fk)
      • time_id (fk)
      1. units
You will find that this schema is very close to the factorized one suggested further above, but actually it's not. Indeed, we intentionally started with the "message" collection because its main Business Entity is of central interest for a messenger. Furthermost, the above question from the Business Analyst suggested that we should attach measures to "message" transactions, turning the object into a so called fact or event table. At the same time, he spoiled "user" and "time" as examples of grouping and filtering criteria to enable, a practice also known as "slice and dice" of multi-dimensional models aka. data cubes.

A less intuitive decision is the type and level of expected granularity. Users may have tons of attributes which were not yet considered at design time, but could become of interest at a later point of time. Since storing and processing all possible attributes may introduce higher data privacy concerns, implementation efforts and operational costs, it is tempting to limit the size of individual dimensions. Also note that a Snowflake schema, that is to say a schema in which dimensions are stored in their normalized form, might better support information drill-down and schema evolution. 

Although DW is quite an old topic, there were only a few promising researches around Design automation so far, e.g. Automating Data Warehouse Conceptual Design. There are also good reasons for that:
  1. Since the concept of a DW was born at Wall-Mart, it has been more or less replicated from receipt across all retailers, and finally adopted by any industry with a large amount of Order transactions. So that a vast majority of productive instances are used for Sales analytics. 
  2. Designing a DW is often (wrongly) considered as One-shot initiative, or seen as an activity coming right from the "Dark side":
    • Experimental, so that it doesn't easily justify any long-term investment.
    • Sensitive, so that may it may affect legislation for data privacy and intellectual property.
    • Confidential, so that it may impact market competition and innovation.
  3. Producing a reliable recommendation out of both an operational schema and some reasonable user input (e.g. from a Business Analyst) is difficult for a machine. Indeed, taking decisions based on Static dictionaries or Machine Learning models requires large and representative historical knowledge. While some academical websites exist (ex. databaseanswers.orglearningdatamodeling.com), there is no widely recognized catalog other than the Data Model Resource Book (DRMB) originally written by Len Silverston.
  4. Making a Data schema evolutionary (e.g. Data Vault Modeling) adds some more complexity.
  5. Even Gartner makes up a Wish-wash of actual DW advancements in 2020!

Model implementation

Coming back to our simplistic exercise, note that our source schema (flattened data on staging area) is normalized whereas a Star schema is denormalized. And, that having derived 1 simple fact and 2 basic dimensions out of 1 collection doesn't mean that other collections contain more facts at all. 
If Rocketchat does that one thing really well, then there is nothing else to do than adding more dimensions like message groups and channels, replies, likes and hashtags... until you are done with a Star schema, i.e. 1 core and many branches. 
In case new features are added or a new business model is created, ex. awarding users with credits, allowing them to place orders etc., then you'll need to change your schema. 

The below script assumes that the input collection holds a fact, and that the user manually flags and renames the columns which should translate into dimensions.
## denormalization.py

# read input file
import json
fact_name = 'message'
with open(r'rocketchat_' + fact_name + '_sample.json') as f:
    data = json.load(f)

# custom: flag metrics and dimension columns
data['units'] = 0.0
data['dim_user'] = data.pop('u')
data['dim_time'] = {'_id':0,'dt':data.pop('ts')}

# custom: enrich dimensions
import pandas as pd, dateutil.parser
dt = dateutil.parser.parse(data['dim_time']['dt'])
data['dim_time']['hour'] = dt.hour
data['dim_time']['day'] = dt.day
data['dim_time']['month'] = dt.month
data['dim_time']['year'] = dt.year
data['dim_time']['weekday'] = dt.weekday()
data['dim_time']['weeknum'] = dt.isocalendar()[1]

# normalize data
from pandas.io.json import json_normalize
df = pd.DataFrame(json_normalize(data))

# create file db
from sqlalchemy import create_engine
db_uri = 'sqlite:///db.sqlite.denorm'
engine = create_engine(db_uri, echo=False)

# create temporary dimension tales
for col in df.columns:
    if col.startswith('dim_'):
        tbl = col.split('.')[0]
        pd.DataFrame(json_normalize(data[tbl])).to_sql(tbl, 
            con=engine,
            if_exists='replace',
            index=False)

# create temporary fact table
df.drop(list(df.filter(regex='dim_[^.]*\.[^_]')), axis=1, inplace=True)
df.to_sql(fact_name,
        con=engine,
        if_exists='replace',
        index=False)
Command:
$ pip install -r requirements.txt
$ python denormalization.py
A relevant output will be created as part of the next and last script.

Now, our goal is to generate a Data Definition Language (DDL) manifest including informative PK/FK constraints (DW usually auto-index and do not enforce such constraints). Because of limitations from above Pandas.to_sql function for adding constraints, and from SQLite for altering database objects, we'll use SqlAlchemy-Migrate module to modify the schema, and finally re-create the tables based on that.
## reflection.py

fact_name = 'message'

# open file db
from sqlalchemy import create_engine
db_uri = 'sqlite:///db.sqlite.denorm'
engine = create_engine(db_uri, echo=False)

# reflect sample data and create DDL
from sqlalchemy import MetaData, Table, Column
from sqlalchemy.schema import CreateTable
from migrate.changeset.constraint import PrimaryKeyConstraint, ForeignKeyConstraint
meta = MetaData()
meta.reflect(bind=engine)
# sort tables by dependency order
for t in meta.sorted_tables:
    print('-- ', t.name, ' sample data: ', engine.execute(t.select()).fetchall(), '\n') # stdout ddl
    # fact
    if t.name == fact_name:
        for c in t.columns:
            # fk
            if c.name.startswith('dim_'):
                t.append_constraint( ForeignKeyConstraint([c.name], [c.name]) )
        # rm all non relevant cols
        ddl_lines = str(CreateTable(t)).split('\n')
        for i in reversed(range(2, len(ddl_lines) - 3)):
            if not [ele for ele in ['units ','"dim_'] if(ele in ddl_lines[i])]:
                del ddl_lines[i]
        tbl_ddl = "\n".join(ddl_lines)
    # dimension
    else:
       # pk
       t.append_constraint( PrimaryKeyConstraint('_id', name=t.name + '_pk') )
       tbl_ddl = CreateTable(t)
    # todo: standardize data types to VARCHAR (except fact units)
    # todo: replace dots with underscores in column names
    print(tbl_ddl) # stdout ddl
    # recreate table
    t.drop(engine)
    engine.execute(tbl_ddl)
Command:
$ pip install -r requirements.txt
$ python reflection.py
Output:
--  dim_time  sample data:  [(0, '2020-08-30T15:21:37.974Z', 15, 30, 8, 2020, 6, 35)] 
CREATE TABLE dim_time (
	_id VARCHAR NOT NULL, 
	dt VARCHAR, 
	hour VARCHAR, 
	day VARCHAR, 
	month VARCHAR, 
	year VARCHAR, 
	weekday VARCHAR, 
	weeknum VARCHAR, 
	CONSTRAINT dim_time_pk PRIMARY KEY (_id)
);

--  dim_user  sample data:  [('3jh6AN3eLZKRCD6E9', 'tncad')] 
CREATE TABLE dim_user (
	_id VARCHAR NOT NULL, 
	username VARCHAR, 
	CONSTRAINT dim_user_pk PRIMARY KEY (_id)
);

--  message  sample data:  [('MshE4AKDejGaiwJkv', 'uj', 'GENERAL', 'tncad', False, '2020-08-30T15:21:37.974Z', 0.0, '3jh6AN3eLZKRCD6E9', 0)] 
CREATE OR REPLACE TABLE message (
	UNITS FLOAT, 
	DIM_USER__ID VARCHAR, 
	DIM_TIME__ID VARCHAR, 
	FOREIGN KEY(DIM_USER__ID) REFERENCES dim_user (_id), 
	FOREIGN KEY(DIM_TIME__ID) REFERENCES dim_time (_id)
);


Model Visualization

As a Bonus, we can even auto-generate some Entity Relation (ER) diagram directly from the database using Python module ERAlchemy.
$ sudo apt install graphviz libgraphviz-dev pkg-config
$ pip install pygraphviz eralchemy
$ eralchemy -i sqlite:///db.sqlite.denorm -o erd_from_sqlite.pdf


Extension

Note that in the above schema, we have 2 tables with a high frequency of change (message, dim_time) and 1 table with low frequency of change (dim_user). For the later, it can be interesting to trace those changes in order to keep historical data even when deleted from the operational database, and better understand what actually happens with the entity. The Data Warehouse technique that addresses this problem is called Slowly Changing Dimension (SCD). It typically consists in adding multiple versions of the same record within the table, a common approach categorized as SCD Type 2. In order to differentiate those versions, additional columns need to be added to the table and managed by the integration layer, such as a separate identifier aka. "surrogate key" or version number, as well as start and end date of record validity.


Sources

Comments