Solution of the Analytics Engineering Assignment from Vio
To execute the solution for the test:
Build the docker image:
$ docker build -t viodotcom/assignment .
Run a container interactively with the image:
$ docker run -it --rm viodotcom/assignment
The approach to solving the first task - data ingestion - involved creating an event_raw
table with the given schema and using the SQLite .import
command.
Since SQLite version 3.31.1
does not have an argument to remove the file header for the .import command (introduced in version 3.32.0
), the solution was to use a UNIX CLI command | tail -n +2
to remove the header while ingesting the data.
Additionally, all columns in the table were created with the TEXT
datatype. It is good practice to keep the schema of the ingestion layer in your data warehouse as textual—unless there is a compelling reason to maintain the same datatype as the source system, such as a partitioning column—to avoid issues with your data pipeline if the expected behavior of your source system changes. If there are any changes over time, it is also easier to maintain the data pipeline if the data is already in your data warehouse.
Before cleaning the data, some assumptions were made about the behaviour of each field to generate the event_clean
table.
The table below lists every checked assertion and the type of the issue. An error is a row that will not be inserted in event_clean
because it may indicate some malfunction of the source system (like a missing event_time
records) and/or break the transformations used event_clean
(like a non-numeric user_id
); a warning is a row that may indicate some problem, but was treated because it seems to be an expected change in the behavior of the source system (like a different event_type
record) or we may, probably, find the same information in other tables (like the brand
record from our product).
Ideally, we should monitor both types of assertions to keep track of the data quality in our data pipeline.
PERIOD | COLUMN | ASSERTION | TYPE | # OCCURENCES |
---|---|---|---|---|
2020-01 | price | price_with_negative_values | error | 39 |
2020-01 | event_time | NULL_or_empty_values | error | 0 |
2020-01 | product_id | product_id_non_numeric_values | error | 0 |
2020-01 | category_id | category_id_non_numeric_values | error | 0 |
2020-01 | user_id | user_id_with_non_numerical_values | error | 0 |
2020-01 | user_id | NULL_or_empty_values | error | 0 |
2020-02 | price | price_with_negative_values | error | 37 |
2020-02 | event_time | NULL_or_empty_values | error | 0 |
2020-02 | product_id | product_id_non_numeric_values | error | 0 |
2020-02 | category_id | category_id_non_numeric_values | error | 0 |
2020-02 | user_id | user_id_with_non_numerical_values | error | 0 |
2020-02 | user_id | NULL_or_empty_values | error | 0 |
2020-01 | category_code | NULL_or_empty_values | warning | 4,190,033 |
2020-01 | brand | NULL_or_empty_values | warning | 1,775,630 |
2020-01 | user_session | NULL_or_empty_values | warning | 1,314 |
2020-01 | event_time | outside_file_period | warning | 0 |
2020-01 | event_type | NULL_or_empty_values | warning | 0 |
2020-01 | event_type | event_type_outside_expected_values | warning | 0 |
2020-01 | product_id | NULL_or_empty_values | warning | 0 |
2020-01 | category_id | NULL_or_empty_values | warning | 0 |
2020-01 | price | NULL_or_empty_values | warning | 0 |
2020-01 | price | price_with_non_numerical_values | warning | 0 |
2020-02 | category_code | NULL_or_empty_values | warning | 4,079,497 |
2020-02 | brand | NULL_or_empty_values | warning | 1,825,908 |
2020-02 | user_session | NULL_or_empty_values | warning | 1,055 |
2020-02 | event_time | outside_file_period | warning | 0 |
2020-02 | event_type | NULL_or_empty_values | warning | 0 |
2020-02 | event_type | event_type_outside_expected_values | warning | 0 |
2020-02 | product_id | NULL_or_empty_values | warning | 0 |
2020-02 | category_id | NULL_or_empty_values | warning | 0 |
2020-02 | price | NULL_or_empty_values | warning | 0 |
2020-02 | price | price_with_non_numerical_values | warning | 0 |
The above data was extracted through the sql/099-clean-data-analysis.sql
file query.
As for each field, the data type and transformation were determined based on the table below:
COLUMN | TRANSFORMATION |
---|---|
event_time | Removed UTC from the record |
Casted as DATETIME |
|
event_type | Possible missing values were replaced with "None" |
product_id | Possible missing values were replace with 0 |
Casted as INTEGER |
|
category_id | Possible missing values were replace with 0 |
Casted as INTEGER |
|
category_code | Possible missing values were replaced with "None" |
brand | Possible missing values were replaced with "None" |
price | Casted as REAL |
user_id | Casted as INTEGER |
user_session | Possible missing values were replaced with "None" |
For the daily_sales
table, the solution was to aggregate the event_clean
data through the transformation from event_time
to date
and the sum, for each day, of the price
column.
For the daily_stats
table, the solution was also the aggregation for event_clean
data through the transformation from event_time
to date
, counting both the occurrences of each event (event_type = ${EVENT}
) and distinct users (user_id
) for each event.
For the daily_funnel
table, the solution was, based on the data already made available in daily_stats
table, the ratio between each step of the funnel.
In this task, it is crucial to emphasise the importance of maintaining high data quality in our pipeline. If the expected event_type
for any funnel step changes, we need to update our data_stats
table to prevent potential issues such as attempting to divide a conversion step by zero, which could break the pipeline
Also, we have to cast at least one column as real REAL
while dividing them so SQLite returns a REAL
data type number.
The primary challenge in dealing with the daily_ticket
table was to determine the most straightforward method for solving the percentile issue. SQLite has fewer statistical tools than other analytical data warehouses, such as BigQuery, Redshift, and Snowflake. Furthermore, a percentile is defined in the Britannica Encyclopedia as:
Percentile, a number denoting the position of a data point within a numeric dataset by indicating the percentage of the dataset with a lesser value.
So, if we want to calculate the 25th, 50th and 75th percentiles, the sql/060-daily-ticket
was created following the algorithm below:
- Calculate the
user_session_ticket_value
(the totalprice
for auser_session
anddate
); - Count the number of
tickets
for eachdate
(daily_total_user_session_count
); - Order (
ROW_NUMBER
) all tickets based on theiruser_session_ticket_value
; - Get the expected number of records for each day below the 25th, 50th and 75th percentiles (
(k_percentile/100) * daily_total_user_session_count = k_perc_rn
); - Finally, look up the ordered tickets exactly the
k_perc_rn
for each day and itsuser_session_ticket_value
.
After this task, we have the generated lineage below:
In the final task, all queries were adjusted to use incremental logic for each load. The approach was to develop an agnostic solution, importing all *.csv files into a temporary table in sql/010-import-data.sql
('.import | tail -n +2 -q data/*.csv'
) and then ingesting only the data relevant to the new period into event_raw
.
If the database state is maintained, and a CSV file for a March 2020 event is placed in the data/
folder, only the new period from the temporary table will be loaded into the database. This new period will be programmatically calculated by selecting the most recent event_time
in the event_raw
table. The same approach was also used to load data into the event_clean
table.
Since all the assigned tables have a temporal reference (aggregating data based on date
), the simplest solution was to process and insert the new data according to what was already present in each table, just as it was done with the event_raw
and event_clean
tables.
This solution was created with a focus on completing the assigned request,but it may only be effective in specific scenarios.
-
Low volume of data: Since we are importing each file into a temporary table in the database, having many files in the
data/
folder means that we will be importing a significant amount of data that we won't use. -
100% consistent event_time: If a new file has a single record with a timestamp like
2099-12-31 00:00:00
, all new batch loads will not load anything. As we are importing CSVs, we may have some human intervention that makes this situation not so unlikely. -
Never reprocessing data: If some delayed events show up in a new
2020-Jan_v2.csv
file, we will be unable to import this data or even recreate the downstream tables with the new data. This situation is likely to happen in a real-life scenario.
What changes can we make to this data pipeline to ensure greater consistency? The diagram below shows a suggested architecture with the same tools already used in this assignment:
- Call
010-import-data.sql
script to ingest the files from anew/
folder (our "landing area"); - Import all files data into
event_raw
table. In this scenario, theevent_raw
table must be an insert only table. It is important that we alter ourevent_raw
table and add two new columns:reference_period
andimported_at
; - Call both
020-clean-data.sql
to generateevent_clean
table and amove-processed-files.sh
bash script´ - The bash script will move the files from
new/
folder toprocessed/
folder; - Import, into
event_clean
table, only the most recent events for each month. In this scenario, we will do a full load of the most recent event based on thereference_period
andimported_at
column. In this scenario, we will assure that only the most recent version of each period will be used downstream; - Execute all
030-daily-sales.sql
,040-daily-stats.sql
and060-daily-ticket.sql
scripts. We can parallelize the three tasks as there is no dependency on then; - Update
daily_sales
,daily_stats
anddaily_ticket
tables; - After finishing the update on
daily_stats
table, execute the050-daily-funnel.sql
script, asdaily_stats
table is an upstream dependency todaily_funnel
table; - Finally, update
daily_funnel
table.
Of course, since we are working with a SQLite database, this approach still has limitations regarding data volume. It is more advisable to execute this in more robust data warehouses. This approach addresses the lack of consistency when new data arrives or when we must update an already processed period. However, it also considerably increases the load on the downstream tables
- Submitted solution
- Fixed a typo in
README
- Fixed a typo in
Makefile
in import data task (everything was strangely working, now it seems better)