Optimize extract from marketo #88

haleemur · 2023-11-08T20:10:38Z

Description of change

adds more context to some log messages. this additional context has proved useful in identifying performance issues
reduces number of function calls to str.replace & an additional layer of redirection via a generator expression when reading the saved csv file.
Go back to using csv.DictReader for better ergonomics.
extracts some immutable computation (available_fields & individual fields' output schema["type"]) from within the inner loop to outside of the loop. This reduces the complexity required to process individual rows of the csv. This is done by two new methods: get_stream_fields & get_output_typemap.
optimize format_value function (see below)
prevent a final loop of data processing due to export_start < job_started because the left side's microsecond- truncated while the right side is not.
make attribution_window user-configurable (currently this is hard-coded ATTRIBUTION_WINDOW_DAYS = 1)
avoid unnecessary setting of max_bookmark within the loop iterating over the csv file when the marketo client supports corona (max_bookmark is set outside of the loop)

optimize `format_value`

format_value is used in the inner most loop of the csv file processing. as such it is performance critical.

the existing method performs the following operations that harm performance.

uses isinstance to check for types. this is unnecessary as the input value either None or of type string.
uses str.find in parsing integers
allocates field_type on each function call and determines the value via an if-else block
performs a membership check on a list to determine the output type (rather than an equality check)
emits a log message whenever a marketo-percent field is encountered. this harms performance for any workloads where multiple percent-fields are extracted.
uses multiple nested conditions to determine the right formatting logic to apply
allocates a list & performs a membership check against this list list to determine if the resulting value should be None.

The optimized version

Uses value is None or value == '' or value == 'null' to return early if value should be None.
Eliminates isinstance checks used in conversion to integer & boolean and replaces the logic with appropriate functions that can work with both string & non string types.
In 1 specific case for integer, it avoids calling str.find and additional conditional logic with int(float(value))
The new version also avoids emitting a log event as it seems that the handling of marketo-percentage custom type is correct.
the signature of the new function accepts _type instead of schema and avoids looking up schema["type"]. _type is either a string or a tuple and this allows the method to replace conditions such as "boolean" in field_type with _type == "boolean"

Manual QA steps

Note that we're running this tap in production, and have verified that both the csv processing time is reduced and the unnecessary extract due to microsecond-truncation is avoided

Below i'm copying over some of the internal validation that we did.

I verified performance of the `format_value` refactor wrt parsing integers

This may be verified by anyone by running the following in a jupyter notebook cell.

def to_int1(x):
    n = x.find('.')
    if n > 0:
        return int(x[:n])
    return int(x)

def to_int2(x):
    return int(float(x))

print("testing to_int1('65.00')")
%timeit to_int1('65.00')
print("testing to_int2('65.00')")
%timeit to_int2('65.00')

in my case, I observed the following

testing to_int1('65.00')
153 ns ± 1.5 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
testing to_int2('65.00')
78.7 ns ± 0.463 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

verified the performance improvement with different output types.

another snippet that can be run in a jupter notebook:

import pendulum
import singer

def format_value1(value, schema):  # 7 ifs + 3 nested-ifs
    if not isinstance(schema["type"], list):
        field_type = [schema["type"]]
    else:
        field_type = schema["type"]

    if value in [None, "", 'null']:
        return None
    elif schema.get("format") == "date-time":
        return pendulum.parse(value).isoformat()
    elif "integer" in field_type:
        if isinstance(value, int):
            return value

        # Custom Marketo percent type fields can have decimals, so we drop them
        decimal_index = value.find('.')
        if decimal_index > 0:
            value = value[:decimal_index]
        return int(value)
    elif "string" in field_type:
        return str(value)
    elif "number" in field_type:
        return float(value)
    elif "boolean" in field_type:
        if isinstance(value, bool):
            return value
        return value.lower() == "true"

    return value

def format_value3(value, _type): # 5 total conditions, 0 allocations
    """
    Format the values read from the csv line based on the stream's schema.

    NOTE: This function gets called on each field of each row, and is thus its performance
    is critical to processing the downloaded csv file.

    returns any of:
        None
        datetime string
        integer
        float
        boolean
        string 
        compoud data structure, dict / list: default last option
    """
    if value is None or value == '' or value == 'null':
        return None
    if "string" == _type:
        return str(value)
    if _type == "integer":
        return int(float(value))
    if _type == "number":
        return float(value)
    if _type == "date-time":
        return pendulum.parse(value).isoformat()
    if _type == "boolean":
        return str(value).lower() == "true"  
    return value
test_cases = {
    'any_return_null_0': (None, {"type": ["null", "string"]}),
    'any_return_null_1': ('', {"type": ["null", "string"]}),
    'any_return_null_2': ('null', {"type": ["null", "string"]}),
    'boolean_return_original_0': (True, {"type": ["null", "boolean"]}),
    'boolean_return_parsed_0': ("True", {"type": ["null", "boolean"]}),
    'string_return_original_0': ("123aljdfakjflkajljlkj23rjlkjajflkj43k1j5l1j5j134lja907b0981o5hb580rjbrb", {"type": ["null", "string"]}),
    'integer_return_original_0': (123, {"type": ["null", "integer"]}),
    'integer_return_parsed_0': ("123.123", {"type": ["null", "integer"]}),
    'integer_return_parsed_1': ("123", {"type": ["null", "integer"]}),
    'number_return_original_0': (123.23, {"type": ["null", "number"]}),
    'number_return_parsed_1': ("12.3", {"type": ["null", "number"]}),
    'number_return_parsed_0': ("123", {"type": ["null", "number"]}),
    'datetime_return_parsred_0': ('2023-10-31T22:29:13.243Z', {"type": ["null", "string"], "format": "date-time"}),
    'object_return_original_0': ({"hello": "world"}, {"type": ["null", "object"]}),
}

for name, case in test_cases.items():
    print(f'format_value1,{name},', end='')
    value, value_type = case
    %timeit format_value1(value, value_type)

    value_type3 = [schema.get("format", t) for t in value_type["type"] if t != "null"][0]
    print(f'format_value3,{name},', end='')
    %timeit format_value3(value, value_type3)

This resulted in the following output for me:

format_value1,any_return_null_0,52.2 ns ± 0.218 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value3,any_return_null_0,22.7 ns ± 0.101 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value1,any_return_null_1,60.5 ns ± 0.297 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value3,any_return_null_1,28 ns ± 3.22 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value1,any_return_null_2,68.5 ns ± 0.116 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value3,any_return_null_2,30.4 ns ± 0.189 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value1,boolean_return_original_0,180 ns ± 1.28 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value3,boolean_return_original_0,63.8 ns ± 0.582 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value1,boolean_return_parsed_0,205 ns ± 1.05 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
format_value3,boolean_return_parsed_0,54.5 ns ± 0.349 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value1,string_return_original_0,134 ns ± 1.43 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value3,string_return_original_0,52.7 ns ± 0.325 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value1,integer_return_original_0,126 ns ± 0.574 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value3,integer_return_original_0,64.6 ns ± 0.38 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value1,integer_return_parsed_0,279 ns ± 0.959 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
format_value3,integer_return_parsed_0,53.4 ns ± 0.2 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value1,integer_return_parsed_1,240 ns ± 1.7 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
format_value3,integer_return_parsed_1,53.7 ns ± 0.135 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value1,number_return_original_0,167 ns ± 0.364 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value3,number_return_original_0,67.8 ns ± 0.158 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value1,number_return_parsed_1,180 ns ± 0.299 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value3,number_return_parsed_1,55.8 ns ± 0.298 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value1,number_return_parsed_0,179 ns ± 0.336 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value3,number_return_parsed_0,54 ns ± 0.161 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value1,datetime_return_parsred_0,4.79 µs ± 28 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
format_value3,datetime_return_parsred_0,53.8 ns ± 0.115 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value1,object_return_original_0,178 ns ± 0.322 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
format_value3,object_return_original_0,65.7 ns ± 0.313 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Visualizing Performance Gains

Cleaning the output and plotting the speedup yields the following chart (lower is better)

An interesting observation is that the date-time formatting takes about 5 µs, whereas the rest of the timings of the optimized version are in the range of 100 ns

This suggests that the performance of format_values is dominated by the parsing of datetime fields, and that we should look to optimize pendulum.parse(value).isoformat(). the version of pendulum used is from 2017, and the current beta version of pendulum rewrote several extensions in rust. We should update this dependency to benefit from any performance improvement.

I also tested for the equivalency between the two methods using the above test cases

for name, case in test_cases.items():
    assert (format_value1(*case) is None and format_value2(*case) is None) or (format_value1(*case) == format_value2(*case)), f'equivalence assertion failed for case: {case}'

no assertion error was raised.

Risks

i don't think there are any risks. that at this point, these changes have been tested in production in our org

Rollback steps

revert this branch

Related Issues:

closes: tap-marketo seems to suffer from several performance issues #87

Convert catalog to dict. Resolves `is not subscriptable` error

Optimize extract from marketo

haleemur added 8 commits November 8, 2023 14:15

add more context to log messages

813833d

optimize csv processing & prevent extra iteration

0bf51af

add config var for attribution_window & update README

91a1b17

update test mocks in response to code changes

f32da13

test new format_value, format_values & get_output_schema functions

ee3e0be

move filtering available fields out of the inner loop

c1a2229

fix test fixture schema breadcrumb

f667d1c

remove unused private method poll_export

1b2d5a4

haleemur marked this pull request as draft November 12, 2023 15:05

haleemur added 2 commits November 13, 2023 10:07

evaluate field types outside of for loop & optimize format_value

2d218b3

make attribution-window user-configurable

f581d75

haleemur force-pushed the optimize-extract-from-marketo branch from e2f16f0 to f581d75 Compare November 14, 2023 02:51

haleemur changed the title ~~draft: Optimize extract from marketo~~ Optimize extract from marketo Nov 14, 2023

haleemur marked this pull request as ready for review November 14, 2023 03:11

update readme

e18ca00

haleemur mentioned this pull request Nov 16, 2023

Readme incomplete #80

Open

Hal Ali and others added 5 commits November 15, 2023 22:33

Convert catalog to dict. Resolves is not subscriptable error

0f865cc

Merge pull request #1 from haleemur/fix/catalog-attribute-access

7ae9314

Convert catalog to dict. Resolves `is not subscriptable` error

Merge pull request #2 from haleemur/optimize-extract-from-marketo

7d67c63

Optimize extract from marketo

alias import of sync in __init__ to prevent name collision

8896953

Merge branch 'master' into optimize-extract-from-marketo

846ba06

edgarrmondragon mentioned this pull request Oct 2, 2024

tap-marketo to Singer SDK MeltanoLabs/Singer-Most-Wanted#103

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize extract from marketo #88

Optimize extract from marketo #88

haleemur commented Nov 8, 2023 •

edited

Loading

Optimize extract from marketo #88

Are you sure you want to change the base?

Optimize extract from marketo #88

Conversation

haleemur commented Nov 8, 2023 • edited Loading

Description of change

optimize format_value

Manual QA steps

I verified performance of the format_value refactor wrt parsing integers

verified the performance improvement with different output types.

Visualizing Performance Gains

Risks

Rollback steps

Related Issues:

haleemur commented Nov 8, 2023 •

edited

Loading

optimize `format_value`

I verified performance of the `format_value` refactor wrt parsing integers