In the last article we imported Microsoft’s Wide World Importers (WWI) sample database into Snowflake and established it as a data source for dbt. In this article we’ll build our first Kimball dimensional model focusing on Sales data.
This will be broken into five sections:
Because we don’t have a group of business users that can talk us through the data stored in their system and explain what is important and relevant, we must rely on the data alone. To do this we’ll create some entity relationship diagrams (ERDs) of the WWI database to show the relationships between objects and understand their attributes. We'll still be guessing at how the business might operate in reality, but in concert with Microsoft's definition[1], it should be a reasonable guess. In the real world, it would be better to learn about the important concepts, processes and business objects from business stakeholders (or at least read some of their documentation). This is preferred for many reasons including that a business might have to adapt its data to a system during implementation so it may be a skewed or an incomplete representation of reality.
We start by modifying a SQL script found in this Dataedo article[2] to identify the relationships within the WWI database ny their foreign keys. We did this in SQL Server as the primary and foreign key relationships were not migrated to Snowflake:
You’ll note above that the tables are prefixed with the schema they belong to. For example, ‘Application.Cities’ indicates the Cities table is in the Application schema. There are four in total for the WWI database: Application, Warehousing, Sales, and Purchasing.
Next, we move to Snowflake and extract schema, table and column names from the ‘information_schema’ schema to identify all the attributes of each table in the database.
We now have all the information we need to reverse-engineer the ERDs.
First, we looked at the relationships within each of the four schemas:
Although we could have added in all the relationships, including those between tables in different schemas, this would make the ERD too busy. These were added into the more detailed ERD for each of the schemas below:
Now we have a good understanding of how all the tables (objects) in this database are related. The ERDs will be used as a reference for us for the remainder of our work with the WWI database and they will expedite further analysis by giving a birds-eye-view of the data.
Using Kimball’s 4 step dimensional modelling process:
we’ll develop our dimensional model at a high level.
We’ll start with the sales process (following along with ‘The Data Warehouse Toolkit’).
As I’ve worked in process improvement before, I know that a couple of tips for defining a process are to include a verb for the action being performed and to identify the trigger. Thinking about sales processes for WWI as a wholesaler, candidates for the process might be ‘customer placing an order’ or ‘WWI invoicing a customer’. A quick look at the orders table in the Sales schema identifies that backorders are a thing and if we look at the WWI database definition [1] we can see that a customer placing an order does not necessarily result in a sale if the whole order is not able to be filled (by out-of-stock scenarios, etc.). For this reason, we’ll use ‘customer invoicing’ as our target process, because at the point a customer is invoiced, we expect to be paid for providing our product/s - i.e., we’ve made a sale. I have no finance or accounting background, but this makes sense to me. :-)
For the grain, we want to capture the finest level of detail available in relation to our ‘customer invoicing’ process. This detail will make up our fact table and by choosing as fine a grain as possible gives us the most flexibility to roll it up or cross-reference it later which also provides greater flexibility for the business user to cut up the data. The ‘invoice lines’ table captures this detail, with each record relating to an individual product and identifying its quantity and unit price and line profit at the time of sale.
In summary, the grain is:
‘One row per line item on a customer invoice’
The dimensions are how we’d like to dissect the data to provide insight, examining the ERDs once again, here is my list of the dimensions that will be useful to us initially along with my rationale for selecting them.
Using ‘The Data Warehouse Toolkit’ as a reference, we lean on their examples to:
We ended up with:
Collectively, the additive facts above will provide insight into key sales details such as how many items of each product are sold, what they cost the business, how much profit was derived from them and how much potential revenue was lost to discounts during promotions.
The non-additive facts require a little special handling in our Analytics software (Power BI analysis coming up next). For example, unit profit will need to be weighted by sales quantity during aggregation, but once available, it will help the user to understand what products are really profitable (after adjusting for promotions for instance) and will prompt questions like “should we really stock product X if it only sells on promotion and it’s adjusted unit profit is negative?”.
The below diagram summarises the dimensional model we’ve just designed:
Now we can get to the fun part of building!
As we build each table in our dimensional model we'll create a new code branch in dbt linked to our GitHub account. In this way we can practice version control, integration and deployment as we build the model out incrementally.
The date dimension is special because it will be the first dimension created and support many models in the future. Like our other dimensions it will be 'conformed', meaning that any dimensional model that uses it will use this as a source table and reuse the dimension as a whole or in a 'shrunken' form.. All of our dimensions are about the business using the same definitions of the same data when it is used in different contexts.
Conforming the date dimension means that when the Sales Team summarises facts about ‘Weekdays in Quarter 1’, they’re including exactly the same days as the Distribution Team would for their own facts about ‘Weekdays in Quarter 1’. The business can then get insight into what proportion of the sales in that period were shipped the same day - something that couldn’t be done (accurately) if each team was preparing their own reports with different dates included for that period. This helps build up an enterprise view over time and examine whole value streams.
In line with best practice [3], our date dimension will meet the following needs:
Apart from that, we adapted [4] to create the following date dimension create script in Snowflake:
Here we see the code for the special case where a date type is ‘non-date’. A good example of how we will use this record is in the ‘returned delivery date’ attribute of the invoices table. Most invoices will never be returned - returns are an exception case - so if we ever build a model which looks at the return date for invoices, most of them will need an ‘inapplicable’ value to show that a return has not occurred. It is much easier for a businessperson to interpret the value ‘inapplicable’ for a return date instead of seeing ’NULL’, which could mean anything.
Speaking of nulls... As a standard practice for Kimball dimensional models, we will:
Here is a sample of the final date dimension, it was created in Excel just to make the output a little more compact:
We can also use some of Snowflake's built-in analytics to validate that our programmatically-created date dimension's values have been populated correctly. In the example below, a heat map shows that quarter 2 of financial year 2016 has an extra 7 days. This is great because to implement weeks in line with the ISO8601 standard[5], we expect to see a 'leap week' here.
How we’ll build our fact table:
We will build the fact table incrementally, starting with its facts and then adding keys to each of its dimensions as they are created. Delivering the fact table first means that we can get the business to confirm their accuracy, relevance and value as early as possible while we work to implement the other dimensions for cross-referencing.
Finally, we will add tests to the fact table to confirm that all its foreign key references to dimensions are non-null. In the dbt YAML file containing our tests, you can also see we are testing an assumption that the invoice number combined with the product represent the natural key for the fact table. This is important because it confirms that our grain is accurate. If there could be multiple lines on an invoice for the same product, then our grain might need to be refined or those lines combined in order not to affect calculations/aggregation during analysis. In fact, if we re-stated the grain as 'one row per product per invoice' it might be a better logical description but isn't as connected to the business reality as 'one row per line item on a customer invoice'.
When dbt builds our models, it will perform these tests. Similarly, after deployment, if this were part of a scheduled job to update the data warehouse, then dbt’s tests can be run again to ensure our assumptions about data coming in remain correct.
To decide what data to add to the staging models, I referred to our high-level design above along with the ERDs to identify which fields were needed to generate the facts.
I first identified the source table attributes required to produce each fact:
And then grouped the attributes by their source table:
Now we can stage each of the source tables, only bringing in the attributes needed.
The 'special deals' table holds information about promotions but is several tables removed from the invoice lines table. To make assembling the final fact table easier, we create an intermediate table to help relate promotions to invoice lines:
Assembling the fact table is now quite straightforward as you can see from the script below. It performs some simple calculations and transformations and gives each fact a name which will be easy for the business user to interpret - or the next data analyst. ;-)
dbt provides a nice Directed Acyclic Graph (DAG) view of the models we have created that the fact table depends upon. This is a great tool to understand dependencies and how our models have been constructed for future reference.
The process was similar for building the remaining dimensions and the highlights are given below. Here is the routine in simple terms:
Each of the final dimension tables and the fact table are put in their own folder in dbt. This serves the dual purpose of logically organising our models and enabling different materialisation rules for different folders. This might mean that incremental models are materialised as views (or not at all), while the dimension and fact tables are materialised as physical table to improve performance. It should be noted however, that any performance improvement may be unnoticeable given the exceptionally high performance of modern database technology.
Now that we have created the models for all our dimensions and linked them to our fact table we are basically done. The way that dbt works with version control means that as we finalised each model and ‘ran’ it, a table or view was created within a development schema in our Snowflake database.
If this were for a real organisation, the Sales team could take a look at the final dimensional model in the development schema (or maybe a test schema) before approving it for deployment. dbt’s integration with version control software means that the Sales team could even get incremental (Agile) delivery and approve the fact table and dimensions for deployment or raise changes against them. Very powerful.
While we’ve been working away, dbt has automatically been creating documentation for our models and graphs to show their lineage. There is a lot to explore in dbt (perhaps in the future), but for now, it is satisfying to see the lineage for our models. These lineage graphs also make it easier to explain the structure and derivation of a dimensional model to a new person who might need to maintain it.
Finally, we can call the build a success by connecting to the schema in Snowflake using Power BI. Power BI has detected all the relationships between fact and dimension tables for us and we can now see that we’ve made our star-schema Kimball dimensional model a reality! :-)
In the next article, we’ll use Power BI to:
Whew! There was a lot in there and really we bit off more than we should for one article which is - hopefully - under a 10 minute read.
Experimenting with dbt and thinking about an incremental approach to dimensional model development is something to drill down on further as it represents a lot of value to the business. Delivering a data model piece by piece, making a standard practice out of unit testing data models and then passing them on in a test environment to user acceptance testing (UAT) at a high frequency can both benefit the data engineer - by making each delivery smaller - and benefit the business user - by realising value early, providing more input to model development, and simplifying UAT by keeping the changes to review small.
What I took away from this exercise is what fertile ground there is in data engineering to implement standard practices. The kind of data engineering involved here - working within the data warehouse - really should be distinct from the data engineering associated with building system-to-system integrations, working with APIs, etc.
As an aspiring data analyst, this was fertile ground for me too, as it helped me to understand how I can deliver value incrementally. I certainly intend to adapt my process to deliver downstream analytics products in much smaller chunks now.
[1] World Importers sample databases for Microsoft SQL
[2] Dataedo: List foreign keys in Azure SQL Database
[3] Design Tip #51: Latest Thinking On Time Dimension Tables
[4] Sonra: Snowflake date and time dimension
[5] ZapBI: About Configuring Fiscal Calendars
Cover Photo by Adi Goldstein on Unsplash