Wide World Importers Part 02: The Sales Dimensional Model

Overview

In the last article we imported Microsoft’s Wide World Importers (WWI) sample database into Snowflake and established it as a data source for dbt. In this article we’ll build our first Kimball dimensional model focusing on Sales data.

This will be broken into five sections:

  1. Understanding the data: where we’ll take a look at the WWI database to understand how business objects are related
  2. Designing the model: where we decide the business process, the grain and the facts and dimensions
  3. Building the date dimension: where we create a date dimension from scratch which will be reused for all of our dimensional models in the future
  4. Building the fact table: where we decide what facts are relevant to this business process and grain
  5. Building other dimensions: where we see how a fact table can provide additional insight by attaching multiple dimensions to it

Understanding the Data

Because we don’t have a group of business users that can talk us through the data stored in their system and explain what is important and relevant, we must rely on the data alone. To do this we’ll create some entity relationship diagrams (ERDs) of the WWI database to show the relationships between objects and understand their attributes. We'll still be guessing at how the business might operate in reality, but in concert with Microsoft's definition[1], it should be a reasonable guess. In the real world, it would be better to learn about the important concepts, processes and business objects from business stakeholders (or at least read some of their documentation). This is preferred for many reasons including that a business might have to adapt its data to a system during implementation so it may be a skewed or an incomplete representation of reality.

We start by modifying a SQL script found in this Dataedo article[2] to identify the relationships within the WWI database ny their foreign keys. We did this in SQL Server as the primary and foreign key relationships were not migrated to Snowflake:

You’ll note above that the tables are prefixed with the schema they belong to. For example, ‘Application.Cities’ indicates the Cities table is in the Application schema. There are four in total for the WWI database: Application, Warehousing, Sales, and Purchasing.

Next, we move to Snowflake and extract schema, table and column names from the ‘information_schema’ schema to identify all the attributes of each table in the database.

We now have all the information we need to reverse-engineer the ERDs.

First, we looked at the relationships within each of the four schemas:

Although we could have added in all the relationships, including those between tables in different schemas, this would make the ERD too busy. These were added into the more detailed ERD for each of the schemas below:

Now we have a good understanding of how all the tables (objects) in this database are related. The ERDs will be used as a reference for us for the remainder of our work with the WWI database and they will expedite further analysis by giving a birds-eye-view of the data.

Designing the Dimensional Model

Using Kimball’s 4 step dimensional modelling process:

  1. select the business process
  2. declare the grain
  3. identify the dimensions
  4. identify the facts

we’ll develop our dimensional model at a high level.

1 Select the business process

We’ll start with the sales process (following along with ‘The Data Warehouse Toolkit’).

As I’ve worked in process improvement before, I know that a couple of tips for defining a process are to include a verb for the action being performed and to identify the trigger. Thinking about sales processes for WWI as a wholesaler, candidates for the process might be ‘customer placing an order’ or ‘WWI invoicing a customer’. A quick look at the orders table in the Sales schema identifies that backorders are a thing and if we look at the WWI database definition [1] we can see that a customer placing an order does not necessarily result in a sale if the whole order is not able to be filled (by out-of-stock scenarios, etc.). For this reason, we’ll use ‘customer invoicing’ as our target process, because at the point a customer is invoiced, we expect to be paid for providing our product/s - i.e., we’ve made a sale. I have no finance or accounting background, but this makes sense to me. :-)

2 Declare the grain

For the grain, we want to capture the finest level of detail available in relation to our ‘customer invoicing’ process. This detail will make up our fact table and by choosing as fine a grain as possible gives us the most flexibility to roll it up or cross-reference it later which also provides greater flexibility for the business user to cut up the data. The ‘invoice lines’ table captures this detail, with each record relating to an individual product and identifying its quantity and unit price and line profit at the time of sale.

In summary, the grain is:

‘One row per line item on a customer invoice’

3 Identify the dimensions

The dimensions are how we’d like to dissect the data to provide insight, examining the ERDs once again, here is my list of the dimensions that will be useful to us initially along with my rationale for selecting them.

Dimension Rationale
date a date dimension is virtually essential to any dimensional model. The business will absolutely want to understand how sales this week compare to last week, how they change based on the day of the week, etc.
salesperson in a wholesale context, I expect the salesperson to play a larger role in driving sales, compared to retail and we further expect the impact of each salesperson on overall business profit to be much greater as each customer they interact with is likely to place a more expensive order than the retail context. For this reason it is important to be able to dissect our sales by salesperson and understand how each is performing.
customer again, because of the wholesale context and expecting each customer’s spending to be much greater than retail, it is important to understand our customers’ characteristics to get some insight into how their characteristics might help or hinder sales
promotion we need to understand the impact of promotions, particularly on uplift and profit. the business will want to know what promotions are effective in driving sales upward and also whether they produce a profit
product apart from date, product is probably the most important dimension, if the business doesn’t understand anything else, it should absolutely understand what products sell!

4 Identify the facts

Using ‘The Data Warehouse Toolkit’ as a reference, we lean on their examples to:

  • identify additive facts - low hanging fruit that can be easily summarised
  • identify discounts for sales
  • identify regular prices
  • identify profit
  • define all of the above - and other facts - consistently with the grain (one row per invoice line)

We ended up with:

  • sales quantity (additive)
  • regular unit price
  • net unit price
  • unit profit
  • unit cost
  • discount unit price
  • extended regular dollar amount (additive)
  • extended discount dollar amount (additive)
  • extended sales dollar amount (additive)
  • extended cost dollar amount (additive)
  • extended gross profit dollar amount (additive)
  • extended gross margin percentage

Collectively, the additive facts above will provide insight into key sales details such as how many items of each product are sold, what they cost the business, how much profit was derived from them and how much potential revenue was lost to discounts during promotions.

The non-additive facts require a little special handling in our Analytics software (Power BI analysis coming up next). For example, unit profit will need to be weighted by sales quantity during aggregation, but once available, it will help the user to understand what products are really profitable (after adjusting for promotions for instance) and will prompt questions like “should we really stock product X if it only sells on promotion and it’s adjusted unit profit is negative?”.

The below diagram summarises the dimensional model we’ve just designed:

Now we can get to the fun part of building!

As we build each table in our dimensional model we'll create a new code branch in dbt linked to our GitHub account. In this way we can practice version control, integration and deployment as we build the model out incrementally.

Building the Date Dimension

The date dimension is special because it will be the first dimension created and support many models in the future. Like our other dimensions it will be 'conformed', meaning that any dimensional model that uses it will use this as a source table and reuse the dimension as a whole or in a 'shrunken' form.. All of our dimensions are about the business using the same definitions of the same data when it is used in different contexts.

Conforming the date dimension means that when the Sales Team summarises facts about ‘Weekdays in Quarter 1’, they’re including exactly the same days as the Distribution Team would for their own facts about ‘Weekdays in Quarter 1’. The business can then get insight into what proportion of the sales in that period were shipped the same day - something that couldn’t be done (accurately) if each team was preparing their own reports with different dates included for that period. This helps build up an enterprise view over time and examine whole value streams.

In line with best practice [3], our date dimension will meet the following needs:

  • use a meaningless ‘surrogate key’ to maintain referential integrity with the fact table even when a fact is not defined or is otherwise unavailable for a date
  • include a date type
  • use a grain of 'one row per calendar day'

Apart from that, we adapted [4] to create the following date dimension create script in Snowflake:

Here we see the code for the special case where a date type is ‘non-date’. A good example of how we will use this record is in the ‘returned delivery date’ attribute of the invoices table. Most invoices will never be returned - returns are an exception case - so if we ever build a model which looks at the return date for invoices, most of them will need an ‘inapplicable’ value to show that a return has not occurred. It is much easier for a businessperson to interpret the value ‘inapplicable’ for a return date instead of seeing ’NULL’, which could mean anything.

Speaking of nulls... As a standard practice for Kimball dimensional models, we will:

  • use meaningless surrogate keys for dimensions
  • test that fact table foreign keys (to dimensions) are never null
  • decline to use 'null' for values in dimensions wherever possible and instead replace them with descriptive, meaningful values that the business can easily interpret.

Here is a sample of the final date dimension, it was created in Excel just to make the output a little more compact:

We can also use some of Snowflake's built-in analytics to validate that our programmatically-created date dimension's values have been populated correctly. In the example below, a heat map shows that quarter 2 of financial year 2016 has an extra 7 days. This is great because to implement weeks in line with the ISO8601 standard[5], we expect to see a 'leap week' here.

Building the Fact Table

How we’ll build our fact table:

  1. stage the relevant data in dbt models - one per source table. This includes all relevant tables and attributes, including all relevant foreign and primary keys to enable joining where necessary
  2. create intermediate models where necessary
  3. create references between models, ending with the sales fact table (i.e. data 'lineage')

Stage tables

We will build the fact table incrementally, starting with its facts and then adding keys to each of its dimensions as they are created. Delivering the fact table first means that we can get the business to confirm their accuracy, relevance and value as early as possible while we work to implement the other dimensions for cross-referencing.

Finally, we will add tests to the fact table to confirm that all its foreign key references to dimensions are non-null. In the dbt YAML file containing our tests, you can also see we are testing an assumption that the invoice number combined with the product represent the natural key for the fact table. This is important because it confirms that our grain is accurate. If there could be multiple lines on an invoice for the same product, then our grain might need to be refined or those lines combined in order not to affect calculations/aggregation during analysis. In fact, if we re-stated the grain as 'one row per product per invoice' it might be a better logical description but isn't as connected to the business reality as 'one row per line item on a customer invoice'.

When dbt builds our models, it will perform these tests. Similarly, after deployment, if this were part of a scheduled job to update the data warehouse, then dbt’s tests can be run again to ensure our assumptions about data coming in remain correct.

To decide what data to add to the staging models, I referred to our high-level design above along with the ERDs to identify which fields were needed to generate the facts.

I first identified the source table attributes required to produce each fact:

Fact Table Attribute Definition
Date Key invoices.invoicedate
Promotion Key specialdeals.specialdealid
Salesperson Key invoices.salespersonpersonid
Product Key invoicelines.stockitemid
Customer Key invoices.customerid
Invoice ID invoicelines.invoiceid
Sales Quantity invoicelines.quantity
Regular Unit Price stockitems.unitprice
Net Unit Price invoicelines.unitprice
Unit Profit invoicelines.lineprofit / invoicelines.quantity
Unit Cost invoicelines.unitprice - [Unit Profit]
Discount Unit Price [Regular Unit Price] - [Net Unit Price]
Extended Regular Dollar Amount [Regular Unit Price] * [Sales Quantity]
Extended Discount Dollar Amount [Discount Unit Price] * [Sales Quantity]
Extended Sales Dollar Amount [Net Unit Price] * [Sales Quantity]
Extended Cost Dollar Amount [Unit Cost] * [Sales Quantity]
Extended Gross Profit Dollar Amount [Extended Sales Dollar Amount] - [Extended Cost Dollar Amount]
Extended Gross Margin Percentage [Extended Gross Profit Dollar Amount] / [Extended Sales Dollar Amount] * 100

And then grouped the attributes by their source table:

Source Table Attribute
invoices invoicedate
salespersonid
invoicedate
invoicelines stockitemid
invoiceid
quantity
unitprice
lineprofit
stockitems unitprice
specialdeals specialdealid

Now we can stage each of the source tables, only bringing in the attributes needed.

Intermediate 'special deals' table

The 'special deals' table holds information about promotions but is several tables removed from the invoice lines table. To make assembling the final fact table easier, we create an intermediate table to help relate promotions to invoice lines:

Assembling the fact table

Assembling the fact table is now quite straightforward as you can see from the script below. It performs some simple calculations and transformations and gives each fact a name which will be easy for the business user to interpret - or the next data analyst. ;-)

dbt provides a nice Directed Acyclic Graph (DAG) view of the models we have created that the fact table depends upon. This is a great tool to understand dependencies and how our models have been constructed for future reference.

Building Other Dimensions

The process was similar for building the remaining dimensions and the highlights are given below. Here is the routine in simple terms:

  1. create a new code branch using dbt’s GitHub integration
  2. find interesting attributes related to each dimension
  3. summarise them in a table with their definition based on the source tables
  4. stage any tables and attributes that are not already staged
  5. include primary keys and relevant foreign keys for any newly staged table to ensure it can be joined as necessary
  6. write SQL in each dbt model to transform the staged data into a final dimension table
  7. test the model and then merge and delete the code branch before moving on to the next dimension

Each of the final dimension tables and the fact table are put in their own folder in dbt. This serves the dual purpose of logically organising our models and enabling different materialisation rules for different folders. This might mean that incremental models are materialised as views (or not at all), while the dimension and fact tables are materialised as physical table to improve performance. It should be noted however, that any performance improvement may be unnoticeable given the exceptionally high performance of modern database technology.

Salesperson Dimension

Customer Dimension

Promotion Dimension

Product Dimension

Finalising the Implementation

Now that we have created the models for all our dimensions and linked them to our fact table we are basically done. The way that dbt works with version control means that as we finalised each model and ‘ran’ it, a table or view was created within a development schema in our Snowflake database.

If this were for a real organisation, the Sales team could take a look at the final dimensional model in the development schema (or maybe a test schema) before approving it for deployment. dbt’s integration with version control software means that the Sales team could even get incremental (Agile) delivery and approve the fact table and dimensions for deployment or raise changes against them. Very powerful.

While we’ve been working away, dbt has automatically been creating documentation for our models and graphs to show their lineage. There is a lot to explore in dbt (perhaps in the future), but for now, it is satisfying to see the lineage for our models. These lineage graphs also make it easier to explain the structure and derivation of a dimensional model to a new person who might need to maintain it.

Finally, we can call the build a success by connecting to the schema in Snowflake using Power BI. Power BI has detected all the relationships between fact and dimension tables for us and we can now see that we’ve made our star-schema Kimball dimensional model a reality! :-)

In the next article, we’ll use Power BI to:

  • develop the model further with DAX functions to handle aggregation for non-additive facts
  • implement visualisations to enable a business user to gain insight from the model

Discussion

Whew! There was a lot in there and really we bit off more than we should for one article which is - hopefully - under a 10 minute read.

Experimenting with dbt and thinking about an incremental approach to dimensional model development is something to drill down on further as it represents a lot of value to the business. Delivering a data model piece by piece, making a standard practice out of unit testing data models and then passing them on in a test environment to user acceptance testing (UAT) at a high frequency can both benefit the data engineer - by making each delivery smaller - and benefit the business user - by realising value early, providing more input to model development, and simplifying UAT by keeping the changes to review small.

Conclusion

What I took away from this exercise is what fertile ground there is in data engineering to implement standard practices. The kind of data engineering involved here - working within the data warehouse - really should be distinct from the data engineering associated with building system-to-system integrations, working with APIs, etc.

As an aspiring data analyst, this was fertile ground for me too, as it helped me to understand how I can deliver value incrementally. I certainly intend to adapt my process to deliver downstream analytics products in much smaller chunks now.

References

[1] World Importers sample databases for Microsoft SQL

[2] Dataedo: List foreign keys in Azure SQL Database

[3] Design Tip #51: Latest Thinking On Time Dimension Tables

[4] Sonra: Snowflake date and time dimension

[5] ZapBI: About Configuring Fiscal Calendars

Cover Photo by Adi Goldstein on Unsplash