Hum Engagement Time Machine

Griffin McCauley, Eric Tria, Theo Thormann, Jake Weinberg

UVA MSDS Capstone Project 2023

Project Overview

Using first-party data collected from the customer data platform (CDP) Hum, we developed a model that can accurately classify the online readers of an academic publisher as being high- or low-quality based on their early-stage engagement profiles. Hum’s relational database contains over a dozen tables and almost 100 features in total. Using these tables, we engineered four new variables to serve as the basis of our analysis that illuminate differences between high-value and low-value user behavior. Through a combination of k-means clustering for determining training labels and a multilayer perceptron (MLP) for predicting which of our client’s users belong to each cluster, we were able to identify what characteristics are indicative of high- versus low-quality engagement. We also demonstrated our model’s ability to distinguish between these two profile types off of only a small volume of user data.

To ensure our analysis was interpretable and marketable for our sponsor, we purposefully limited our classification to two clusters. This enabled us to see striking patterns across the four features in question that strongly resembled the tendencies of a good user: a low number of articles read per event (signifies deeper engagement levels), a lower percentage of content that was reached through Google (as opposed to a more scholarly source such as PubMed or the publisher themselves), a lower percentage of content read that was an article (indicates engagement with figures and tables), and a high number of events performed per day engaging with the platform. Our MLP model leverages these four features derived from the users’ first 16 events (which is roughly equivalent to four total article reads due to the nature of how events are being tracked in the platform), and, based solely upon these features from the first 16 events, our MLP model is able to predict whether a user is high- or low-quality with 95% accuracy. While the innovative data cleaning and impressive model performance produced through this project are valuable on their own, these developments are additionally exciting as the engineered features and this model framework will now be able to serve as foundational components in the burgeoning field of digital academic publisher engagement.

Primary Repository Contents

Code_Archive
- Notebooks produced during the course of this project that document various iterations and previous versions of our model and codebase
Data_Archive
- A collection of .csv files used in the training and testing of some of our early-stage models
Final
- The code, data, and documentation for our final model
- (Note: Hum developers should also refer to the /AWS/lib/ and /AWS/notebooks/ folders located inside for our fully integrated final models)
Resources
- Documents chronicling our progress over the course of the year and supplemental administrative materials related to our team's composition and organization

Data

Data will be accesed through Snowflake and Snowpark

Our project will mainly use the Event, Profile, and Content tables

Data Description

Event:

Column	Type	Description
CLIENT	VARCHAR	ID for the client
ID	VARCHAR	Unique ID for the event in each row
TAGS	VARIANT (JSON)	Tags or topics of the content
META	VARIANT (JSON)	Meta data for each event
DAY	DATE	Date when the event occurred
KEYWORDS	VARIANT (JSON)	Keywords used in the content
REFERER	VARCHAR	The source or link where the event came from
UTM_CAMPAIGN	VARCHAR	To be discussed with Hum
UTM_CONTENT	VARCHAR	To be discussed with Hum
UTM_MEDIUM	VARCHAR	To be discussed with Hum
UTM_SOURCE	VARCHAR	To be discussed with Hum
UTM_TERM	VARCHAR	To be discussed with Hum
SET_PROFILE	VARCHAR	ID connecting with Profile table
SET_USER	VARCHAR	User email
IP	VARCHAR	IP address of a user
USER_AGENT	VARCHAR	User agent of a user
SOURCE	VARCHAR	Source of the content. For this project: "rupress"
URL	VARCHAR	URL of the content
VISITOR_ID	VARCHAR	Unique ID per visitor. To be confirmed with Hum if this is per session
DATE	TIMESTAMP	Timestamp of the event
EVENT	VARCHAR	Event type
CONTENT_ID	VARCHAR	ID of the content
CREATED	TIMESTAMP	Timestamp of when the event was created
UPDATED	TIMESTAMP	Timestamp of when the event was last updated

Profile:

Column	Type	Description
CLIENT	VARCHAR	ID for the client
ID	VARCHAR	Unique ID for the row. Connects with Event set_profile
USER_ID	VARCHAR	Unique ID for each user
EMAILS	VARCHAR	Email addresses associated with a user
CAMPAIGNS	VARIANT (JSON)	Campaigns a user participated in
CREATED	TIMESTAMP	Timestamp of when a user was created
UPDATED	TIMESTAMP	Timestamp of when a user was last updated
DOMAINS	VARIANT (JSON)	Domains that a user has visited
FIRST_VISIT	TIMESTAMP	Timestamp of when a user first visited the platform
IDENTIFIED_ON	TIMESTAMP	To be discussed with Hum
IDENTIFYING_REFERER	VARCHAR	To be discussed with Hum
IDENTIFYING_UTM	VARCHAR	To be discussed with Hum
LAST_ACTIVE	TIMESTAMP	Timestamp of when a user was last active on the platform
ORGANIZATION_IDS	ARRAY	Organizations that a user is part of
SEGMENTS	ARRAY	To be discussed with Hum
PROPERTIES	VARIANT (JSON)	To be discussed with Hum
METRICS	VARIANT (JSON)	To be discussed with Hum
PERCENTILES	VARIANT (JSON)	To be discussed with Hum
USER_SIDS	ARRAY	To be discussed with Hum

Content:

Column	Type	Description
CLIENT	VARCHAR	ID for the client
ID	VARCHAR	Unique ID for the row. Connects with Event set_profile
CONTENT_ID	VARCHAR	Unique ID for each content
KEYWORDS	ARRAY	Keywords associated with a content
DOWNLOAD_SLIDE	DOUBLE	Number of times the content had a download slide event
PDF_CLICK	DOUBLE	Number of times the content had a PDF click event
PAGEVIEW	DOUBLE	Number of times the content had a page view event
POST_READ	DOUBLE	Number of times the content had a post read event
POST_READ_MID	DOUBLE	Number of times the content had a post read mid event
POST_READ_START	DOUBLE	Number of times the content had a post read start event
POST_READ_END	DOUBLE	Number of times the content had a post read end event
SCROLL	DOUBLE	Number of times the content had a scroll event
EXCERPT	VARCHAR	Text excerpt from the content
CONTENT	VARCHAR	Description of the content
SCORE	DOUBLE	Sum of all the event columns
SOURCE	VARCHAR	Source of the content
TITLE	VARCHAR	Title of the content
TYPE	VARCHAR	Type of the content
URL	VARCHAR	URL of the content
CREATED	TIMESTAMP	Timestamp of when a content was created
UPDATED	TIMESTAMP	Timestamp of when a content was updated

Query Columns

Final queries used can be found at /Final/AWS/lib/snowpark_runner.py
Difference between classification & clustering queries:
- Classification: comptues features using the first X events of a user
- Clustering: computes features using all events of a user
Queries from the start of 2022 to the current date

Raw Query Columns

REACHED_X_EVENTS: flag if a user reaches the event threshold
RECENT_LAST_EVENT: flag if a user's latest event is within the last 21 days
EVENT_CYCLES: number of periods in between idle periods
- Idle time is roughly 72 hours or 3 days
- Idle time computed using the average time between user events
- Assuming the events behave as a Poisson random process, we modeled the time between events an exponential random variable. Then:
  - Approximate its parameter lambda to be the mean of the event gaps
  - The value corresponding to the 95% quantile of the CDF is used to determine the idle period length
DISTINCT_ARTICLES: distinct number of articles that a user has interacted with
PERCENT_GOOGLE_ARTICLES: percent of a user's articles interacted with that originated from a Google search
PERCENT_ARTICLE_CONTENT: percent of a user's content interacted with that is an article
AVERAGE_CONTENT_SCORE: average score of the content that a user interacted with
DAYS_TO_X_EVENTS: days it took to get to X events
EVENTS: total number of events
FIRST_EVENT_TIME: timestamp of a user's first event
LATEST_EVENT_TIME: timestamp of a user's latest event
DISTINCT_DAYS: number of distinct days that a user is active on the platform
ARTICLES_PER_EVENT: number of distinct articles divided by the number of events
EVENT_DENSITY: number of events divided by the number of distinct days

Final Model Features

Final features used in the model:

ARTICLES_PER_EVENT
PERCENT_GOOGLE_ARTICLES
PERCENT_ARTICLE_CONTENT
EVENT_DENSITY

Models

Hum employees or individuals with access to Hum's Snowflake and AWS systems should refer to the code located in /Final/AWS/lib/models.py and /Final/AWS/notebooks/ for our fully integrated and packaged Python models and Jupyter notebooks, respectively.

For everyone else, however, the methodology and results of our model can be reproduced locally using the files located in /Final/, and the following sections will outline how to sequentially perform the appropriate clustering and classification using static datasets which were previously extracted from Snowflake.

Clustering

In order to determine the training labels for our data, we applied k-means clustering to the engineered features described above and derived from the users' entire event sequences. The entire process is laid out in the notebook /Final/Clustering/Clustering.ipynb, and the data required to execute this clustering can be downloaded and accessed from /Final/Clustering/reached_16_all.csv.

Classification

Once the training labels were generated, we constructed an MLP model to perform user classification based on the engineered features described above but derived only from the users' first 16 events. The model building, training, and evaluation process is demonstrated in the notebook /Final/Classification/Classification.ipynb, and the data necessary to support training can be downloaded and reached from /Final/Classification/training_labels.csv and /Final/Classification/reached_16_first_16.csv.

Results

The final results and performance metrics from our model are summarized by the graphics below.

Our MLP model was able to quickly converge to a stable parameterization and achieve an accuracy of 94% after less than 10 epochs:

The soft prediction scores for hold-out validation set corresponded to the following ROC curve with an AUC of 0.96:

When choosing a threshold that tried to balance True Postives and True Negatives equally, our predictions produced the following confusion matrix and associated True/False Positive/Negative rates:

Acknowledgment

We would like to acknowledge the contributions of the Hum staff, specifically Dr. Will Fortin, Niall Little, and Dylan DiGioia, to this project. We would also like to thank our capstone advisor, Dr. Judy Fox, for her assistance with this project.

Full Repository Manifest

Code_Archive/
- eda/
  - eda.ipynb
  - eda.py
  - eda_features.ipynb
  - env-format.txt
- resources/
  - aws_exeuction_role.png
  - aws_sagemaker_notebook.png
  - aws_tags.png
- ClusterAnalysis.ipynb
- FinalModel.ipynb
- HumMLP.ipynb
- HumMLP_kNN.ipynb
- IdleSequenceLengths.ipynb
- RNN_sql_cleaning.ipynb
- RNNdatacleaning.ipynb
- profile_event.ipynb
- stacked_hist.ipynb
Data_Archive/
- RNNdata.csv
- data.md
- hum_schema.png
- new_features_40.csv
- reached_16_all.csv
- reached_16_first_16.csv
- training_labels.csv
Final/
- AWS
  - lib/
    - aws_helper.py
    - file_helper.py
    - models.py
    - snowpark_conn.py
    - snowpark_runner.py
  - notebooks/
    - classification.ipynb
    - clustering.ipynb
    - data_extraction.ipynb
    - de_requirements.txt
  - AWS_setup.py
- Classification/
  - Classification.ipynb
  - reached_16_first_16.csv
  - training_labels.csv
- Clustering/
  - Clustering.ipynb
  - reached_16_all.csv
- Deliverables/
  - Capstone Final Presentation.pdf
  - Capstone Poster.pdf
  - Final Report.pdf
Resources/
- 02-13Update.pdf
- Budget Proposal.pdf
- CapstoneProjectBudget.pdf
- DS6013 Capstone Project Proposal.pdf
- EngagementTimeMachine_ProgressReport1.pdf
- Final Project Report.pdf
- Introduction - Hum-UVA_EngagementTimeMachine.pptx.pdf
- Project Proposal.md
- Retention Model Proposal v1.pdf
- Talent Dashboard.pdf
- Team Charter.pdf
- Weekly Progress.md
- confusion_matrix.png
- eda.png
- literature.txt
- methods.png
- model_2.png
- rates.png
- roc.png
- training.png
css/
- Ernest&Emily.otf
- SVG_Trade_Gothic_XBold.otf
- controls.css
- custom.css
- frame.css
- widgets.css
static/
- css/
  - bulma-carousel.min.css
  - bulma-slider.min.css
  - bulma.min.css
  - fontawesome.all.min.css
  - index.css
- images/
  - ccd16.png
  - cluster.png
  - cm.png
  - contentT.png
  - data.png
  - evals.png
  - evecs.png
  - eventT.png
  - favicon.ico
  - mlp.png
  - msds.png
  - pca.png
  - pipe.png
  - profile.png
  - profileT.png
  - roc.png
  - team.png
  - train.png
  - varprop.png
- js/
  - bulma-carousel.min.js
  - bulma-slider.min.js
  - fontawesome.all.min.js
  - index.js
.gitignore
LICENSE
README.md
index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hum Engagement Time Machine

Griffin McCauley, Eric Tria, Theo Thormann, Jake Weinberg

UVA MSDS Capstone Project 2023

Project Overview

Primary Repository Contents

Data

Data Description

Event:

Profile:

Content:

Query Columns

Raw Query Columns

Final Model Features

Models

Clustering

Classification

Results

Acknowledgment

Full Repository Manifest

About

Releases

Packages

Contributors 5

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 207 Commits
Code_Archive		Code_Archive
Data_Archive		Data_Archive
Final		Final
Resources		Resources
css		css
static		static
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
index.html		index.html

License

Data-ScienceHub/ETM

Folders and files

Latest commit

History

Repository files navigation

Hum Engagement Time Machine

Griffin McCauley, Eric Tria, Theo Thormann, Jake Weinberg

UVA MSDS Capstone Project 2023

Project Overview

Primary Repository Contents

Data

Data Description

Event:

Profile:

Content:

Query Columns

Raw Query Columns

Final Model Features

Models

Clustering

Classification

Results

Acknowledgment

Full Repository Manifest

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages