Wide World Importers Part 01: Loading the Data

Abstract

In part oneof this series using Microsoft’s Wide World Importers sample database, we focus on loading the data into a Snowflake environment to enable future model creation and analysis. We also use dbt to establish the loaded tables as datasources, we document them, add testing and create a code base in GitHub to store transformation scripts and implement version control.

Why this series and Wide World Importers?

My first attempt at an article for this website used a dataset found nearly at random with the demonstrable shortcoming that it was not a ‘rich’ source for analysis. After looking for further individual datasets the author realised that a sample database would be better. In the future a database built up out of a collection of related individual datasets is a good idea, but to get the ball rolling a pre-made sample database was the better option.

Microsoft created the Wide World Importers (WWI) database for educational purposes. You should review the link in the footer to read more about it and understand the rules for its use. It’s a rich database including tables for historical data (which will enable the modelling of slowly-changing dimensions in the future) and stored procedures to simulate the addition of new data over time.

The base database is normalised and represents a fictitious company but there is also a de-normalised data-warehouse-style database available for business intelligence(BI) practise. As this series hopes to practise the creation of a datawarehouse (DW) itself, we are going to ignore Microsoft’s DW version of WWI for now –and fully expect to make some mistakes with our own implmentation. 😉

Microsoft describes WWI as:

“the fictitious company Wide World Importers [...] is awholesale novelty goods importer and distributor operating from the SanFrancisco bay area” [1]

How will we do this?

Figure 1

The WWI database will support many analyses during the first half of 2022 as we practise Kimball data warehousing and BI reporting on its tables. We may not be done with Kimball by mid-year but for diversity, the intention is to create a Data Vault in the second half of 2022. Here’s a rough outline of the articles to come in this series, the intention being to deliver one per week.

  • Part 1 – Loading the Data (this article)
  • Part 2 – The Sales Dimensional Model
  • Part 3 – Visualising Sales Data
  • Part 4 – The Inventory Dimensional Model
  • Part 5 – Visualising Inventory Data

The domains have been chosen to align to the author’s experience, to be compatible with the available data in the WWI database and to follow Kimball’s DW Toolkitbook book closely. [2]

Parts 4 and 5 are significant because they will leverage multiple fact tables joined on conformed dimensions to provide an integrated view of inventory across business functions. This is illustrated in the diagram below (Kimball, pg. 123). [2]

Kimball Data Warehousing

We will be implementing a Kimball data warehouse simply because this is what the author has most experience with. For those unfamiliar, the Kimball architecture is characterised by:

  • two table types: dimensions and facts
  • two layers: back office for staging data and preparing ‘master’ dimension and fact tables, a front office/presentation area for assembling (sometimes transformed) master dimension and fact tables into ‘dimensional models’ and perhaps data cubes and other BI artefacts
  • de-normalised table structures designed specifically to maximise query performance by visualisation/reporting tools

Ralph Kimball’s book “The Data Warehouse Toolkit” was published in 1996 and we willuse it as a reference. It should be noted that Kimball did not invent the datawarehouse or even the idea of dimension and fact tables but the book is wellwritten and available. Later this year we will examine other architectures which may be better suited to the hardware available today which is drastically different to that available in the 90s.

The Data Vault and Inmon architectures are also well known but others exist and some even suggest a return to flat data structures – storing all the data for an analysis domain/scenario in one table – as this may be simpler for the user and hardware today is now performant on these simple structures.

Now, let's begin...

Loading the Data

We’ll demonstrate three key technologies during this exercise. SQL Server, the Snowflake cloud database and dbt (a data transformation tool). They will each fill a different purpose as shown in the list below:

1.      SQLServer

1.1.   Restore the database

1.2.   Extract the database schema

1.3.   Export the table data

2.      Snowflake

2.1.   Duplicate the schema in Snowflake

2.2.   Load the table data into an Amazon S3 bucket

2.3.   Load the table data into Snowflake

2.4.   Validate the data migration

3.      dbt

3.1.   Establish a GitHub repository for transformation scripts

3.2.   Definethe data source in dbt

3.3.   Document the data source in dbt

3.4.   Create tests for the data source in dbt

1 SQL Server

1.1 Restore the WWI database

The ‘.bak’file containing a backup of Microsoft’s Wide World Importers database can be found on Microsoft’s GitHub page [3]. It is then a simple matter of downloading it and using SQL Server to restore the database as shown in Figure 2. SQL Server is free to download and can be run on your own computer.

Figure 2

1.2 Extract the database schema

SQL Server will also generate scripts for database objects (Figure 3). We use it here to extract the tables alone. We don’t need foreign keys and other constraints as we will implement testing in the data warehouse to confirm our expectations for this data source including referential integrity, uniqueness, etc.

Figure 3

The WWI sample database also includes a number of tables with historic versions of tabledata. For example the People_Archive table contains outdated records for thePeople table. We’ll import it into Snowflake for later use when we practise implementing slowly changing dimensions.

You can see the resulting script in Figure 4, below. We’ll tidy it up for our purposes later.

Figure 4

1.3 Export the table data

After struggling with CSV formatting and handling complex data types, there didn’tseem to be a clear, simple way to export table data from SQL Server. Fortunately our tables were not so large and we were able to save data from the query results grid. Annoying but expedient.

2 Snowflake

2.1. Duplicate the schema in Snowflake

In Figure 5, we see the simplified schema we’ve created from our SQL Server schema export. We’ve used Snowflake’s worksheet functionality to write the table creation script. These worksheets are a simple, attractive IDE for writing Snowflake’sversion of SQL, enhanced by Snowflake’s own database functionality.

Figure 5

Some renaming was required as ‘warehouse’ and ‘tag’ are reserved words in Snowflake while WWI uses them to name one of its schemas and some columns respectively. Some SQL Server datatypes had to be converted to Snowflake’s versions too. E.g. datetime2 (SQL Server) => ntz_timestamp (Snowflake). Apart from this the schemas, tables, columns, names and types were all preserved from the source database in SQL Server.

2.2. Load the table data into an Amazon S3 bucket

A simple way to load flat file data into Snowflake is to establish a connection to an Amazon S3 bucket. In Snowflake, we need to create a ‘stage’ which will connect to the S3 bucket. Stages are created at the schema level, so we must create 4 to load data for the Sales, Warehousing, Purchasing and Application schemas which are the destinations for the data we will load. Snowflake also requires us to define the file format’ we will use to load the files (also at the schema level).

2.3. Load the table data into Snowflake

There are three steps to loading the table data into Snowflake:

Creating a file format so that Snowflake understands how to read the source files in the Amazon S3 bucket (Figure 6).

Creating a 'stage' which is Snowflake's method for creating a link to the S3 bucket to use it like a container (Figure 7).

Using a load script to move the data from the appropriate file in the S3 bucket into the corresponding table in Snowflake (Figure 8).

2.4. Validate the data migration

The scripts below from the SQL Server source database and from the destination Snowflake databaseshow that the row count matches between source and destination for all tables. In a production context we would use more rigour to properly determine that data has arrived in a complete and accurate way. As this website is focused on data analysis, a lighter touch is applied to data engineering as you will note here.

Note also that we followed two principles for loading data from the WWI source database:

  • load whole tables
  • no transformations – we’ll do this within the warehouse

So, we’re following an E->L->T. approach as opposed to E->T->L. Extract, load, transform means we get the data into our warehouse faster where we can expose it to the powerful transformation tools and computing power available in a modern cloud-based data warehouse.

3 dbt

3.1 Establish a GitHub repository for transformation scripts

Now that wehave our data in the warehouse we can start messing around with it, right? Not quite, we’ll be using dbt next to take it from here which integrates with Snowflake and GitHub to provide some useful functionality. dbt can manage our environments, data sources, data models and deployment targets all in a documented, version controlled and shareable fashion. It also provides basic data lineage functionality and the ability to implement tests so that we can validate our expectations for each data source and each data model we create.

The firststep is to create a new GitHub repository. Then we can create a dbt project which links to that repository and our WWI database in Snowflake.

Next we initialise our project in dbt to inherit its default project structure (Figure 9)

And perform an initial commit to GitHub (Figure 10).

3.2 Define the data source in dbt

dbt understands that source data is special and has the means to identify sourcesand their properties explicitly. We create a new branch in our code using the green/GitHub button. Next, you can see below that we can configure a YAML file to identify the schemas and tables that will become data sources for our project (Figure 11).

3.3 Documentthe data source in dbt

We add description attributes to our source data schemas and tables which is added directly into dbt’s documentation of our project. Markdown, stored in ‘docblocks’ can also be used to enhance documentation but for now we will keep it simple (Figure 12).

3.4 Create tests for the data source in dbt

In the test scripts for dbt, our goal is to validate our expectations for our source dataset. This means that:

  • primary key columns are unique and not null
  • Mandatory columns are not null
  • referential integrity exists between tables where foreign keys are used
  • columns which are constrained to particular values have their content validated

We could do more – dbt supports custom tests also – but for the purposes of this ‘in principle’exercise this will do.

Now that we have successfully implemented our WWI as a data source in dbt, we can focus onbuilding some dimensional models in the next article!

Discussion

This exercise was focused on the data engineering necessary to begin working with the WWI database as a datasource for our dimensional models/warehouse. As we implemented our infrastructure at the same time as learning the technology and using it for the first time, the need for an (E)xtraction and (L)oading tool became apparent. This would:

  • make data type conversion straightforward and automated
  • shorten time to get a new data source into the warehouse
  • reduce likelihood of error (by using restricted words as table names for example)

SQL scripts or another mechanism to profile the data source would also be preferable as time was spent writing them to:

  • understand the data types in the WWI database
  • understand the relationships and constraints implemented
  • document row counts and other data about the database that might validate successful migration

Conclusion

We’ve successfully imported Microsoft’s Wide World Importers sample database into Snowflake and established it as a data source for dbt. This means that we can focus on the creation of a Sales dimensional model in the next article, followed by analysis and visualisation of it in the following article to complete our first end-to-end analysis.

Final thought: This exercise makes me think that the tools and skillset for loading data into a warehouse are possibly very different to the skills needed to tranform the data once it’s there. I wonder if there is an argument for separation of data engineering responsibilities. One engineer might focus on adding newsources, which is potentially more complex and different each time, while another engineer focuses on transformation which is likely more repeatable and fertile ground to increase efficiency.

References

[1] Microsoft sample database: Wide World Importers

[2] Kimball, R, Ross, M 2013, The Data Warehouse Toolkit, John Wiley & Sons, Inc., Indianapolis, Indiana.

[3] Microsoft GitHub repository for Wide World Importers

Cover Photo by Adi Goldstein on Unsplash