Because there were already too many GitHub repos named "timecop", and when providing reporting to clients, there's always just one more thing.
In reporting on KPIs and other project management metrics, internally and to clients, it is common to report trends, histograms, and averages of durations between project phases or other tasks. This repo provides a single function to intake a DataFrame and conditions to track start and end points, and report time deltas back to the user.
chronumbo
comes in two flavours:
add_event_delta_single()
to track time between two events, commonly project phasesadd_event_delta_pairs()
to track time between multiple pairs of events, for recurring actions
While some terminology is fairly universal, many terms are infinitely varied across different industries, companies, and corporate cultures. For the purposes of this project, we'll define here a few placeholder terms.
- Project - these sorts of asks can be for projects, services, orders, bugs, etc.; for simplicity's sake, I will refer to these generically as "projects". While the particulars of these differ widely, the collection of data is broadly the same.
- Current State Table - a table which generally has one row per project, with the latest data for that project (e.g. current status, current assignee, etc.).
- Event Log Table - a table which has one row per event, and many rows per project.
- Project ID - the unique identifier for the Current State Table.
- Event ID - the unique identifier for rows in the Event Log Table.
The test dataset ../chronumbo/test/project-event-log.csv
represents an Event Log table, and contains the following fields:
Event logs produced by most tools will have a similar structure, and is fairly straightforward data structure to replicate for those tools that do not produce such logs explicitly (e.g. Google Sheets, Smartsheet, etc.).
The most basic implementation is simply calling either add_event_delta_single()
or add_event_delta_pairs()
with the following general logic:
df
a DataFrame containing Event Log, ordered by ID and dateid_field
the Project ID in the Event Logevent_date_field
the event date field in the Event Logstart_conditions
dictionary containing key-value pairs of fields and values marking the start pointend_conditions
dictionary containing key-value pairs of fields and values marking the end pointstart_at_earliest
if True, starts at the first instance of start conditionend_at_latest
if True, ends at the latest instance of end conditionuse_earliest_if_no_start
if True, start at first event in log if no matching start condition*use_latest_if_no_end
if True, end at last event in log if no matching end condition*
🚨 See Issue #3; currently, add_event_delta_pairs()
has no use_earliest_if_no_start
or use_latest_if_no_end
parameters. This function looks for multiple sets, and was initially designed for that purpose explicitly. There will be cases where the user will want to use the earliest and latest event dates to start or end a pair. This will be added.
You can feed a DataFrame into either on your own, but for the less tech-savvy (and the lazy), I've included in this project the following functions to get you going with minimal work. See Additional helper functions below for more information.
# create SQLAlchemy connection engine
engine = create_engine(
db=db,
dialect=dialect,
user=user,
password=password,
endpoint=endpoint,
verbose=verbose
)
# create DataFrame from SQL query
df = db_to_df(
query=audit_query,
engine=engine,
verbose=verbose
)
Your DataFrame will need to be sorted:
# sort by IDs and date
df_sorted = df.sort_values(by=[id_field, event_date_field])
For both add_event_delta_single()
and add_event_delta_pairs()
, given col_prefix='project_res_time'
, the input DataFrame is returned with four additional fields:
project_res_time_start
a flag indicating the start point; for debugging and reportingproject_res_time_end
a flag indicating the end point; for debugging and reportingproject_res_time_delta
returns a string version of time delta, e.g. '0 days 01:30:00'project_res_time_delta_sec
returns time delta as seconds, e.g. 5400.0
🚨 See Issue #4; it may be best to include optional sorting inside the function. This may be added in the future.
To find the time delta between a single starting event and a single ending event, use add_event_delta_single()
. This is useful in determining the time it takes for a project to be completed, or between stages of a project.
Below, the example function looks for where the Status field lists "Created" as the start point for calculating a time delta, and where the Status field lists "Resolved" as the end point.
# find start, end point pair; return df with added values
final_kpi_single_df = add_event_delta_single(
df=df_sorted,
col_prefix='project_res_time',
id_field=id_field,
date_field=event_date_field,
start_conditions={'event': 'Status', 'description': 'Created'},
end_conditions={'event': 'Status', 'description': 'Resolved'},
start_flag='start',
end_flag='end',
end_at_latest=False,
use_earliest_if_no_start=True,
use_latest_if_no_end=True
)
To find each time delta between a series of start and end events, use add_event_delta_pairs()
. This is useful in determining the average time of recurring events, like responding to customer correspondence and project manager's response.
Below, the example function looks for where a non-employee sends a message in the project management system, and when an employee responds. Once the end condition has been met, the function looks for the next start condition.
# find start, end point pairs; return df with added values
final_kpi_paired_df = add_event_delta_paired(
df=df_sorted,
col_prefix='correspondence',
id_field=id_field,
date_field=event_date_field,
start_conditions={'event_type': 'Correspondence', 'is_employee': False},
end_conditions={'event_type': 'Correspondence', 'is_employee': True},
start_flag='start',
end_flag='end'
)
A one-stop shop for creating a SQLAlchemy engine for MSSQL, postgreSQL, or MySQL.
Many users will have their own constructions of this, or will simply use the basic SQLAlchemy functions to do this, but this is a helpful tool for doing the connection formatting work for you.
# create SQLAlchemy connection engine
engine = create_engine(
db=db, # name of database
dialect=dialect, # 'postgres', 'mysql', or 'mssql'
user=user,
password=password,
endpoint=endpoint,
verbose=True # if True, prints status to terminal; for dev and debug
)
While many users will have their own construction of this, this is a variant on pd.read_sql()
with built-in error handling. Given a SQLAlchemy engine and a SQL query, returns the query as a DataFrame.
query = """
SELECT * FROM project_audit_trail
"""
df = db_to_df(
query=query,
engine=engine,
verbose=True
)
Unlike the simplicity of db_to_df()
, this function utilises df.to_sql()
to push a DataFrame to SQL, with the optional functionality of handling dtypes between DataFrames and SQL to ensure successful upload.
df_to_db(
engine=engine,
df=df,
tbl='event_log_with_time', # name of SQL table to upload data to
if_tbl_exists='replace', # as with df.to_sql()
retrieve_dtype_from_db=True, # if True, recasts DataFrame with SQL field types
dtype_override=None, # dictionary of column names and dtypes
chunksize=10000,
verbose=True
)
Helper function to retrieve column types from SQL tables.
🚨 See Issue #5; This function currently only supports postgreSQL.
get_sql_col_types(
engine=engine,
tbl=tbl,
verbose=True
)
the_time_keeper() isn't specifically used in this project; time delta calculations are handled directly in add_event_delta_single()
and add_event_delta_pairs()
. I include it within this project as many event logs can be millions of rows long, and during dev and debug, this can be helpful to gauge how long your operations take.
Example usage is as follows:
start_time = the_time_keeper() # init with no value passed to t; time script begins
# do stuff
the_time_keeper(start_time) # pass value to t; print duration of task