-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize extract from marketo #88
Open
haleemur
wants to merge
16
commits into
singer-io:master
Choose a base branch
from
haleemur:optimize-extract-from-marketo
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Optimize extract from marketo #88
haleemur
wants to merge
16
commits into
singer-io:master
from
haleemur:optimize-extract-from-marketo
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
haleemur
force-pushed
the
optimize-extract-from-marketo
branch
from
November 14, 2023 02:51
e2f16f0
to
f581d75
Compare
haleemur
changed the title
draft: Optimize extract from marketo
Optimize extract from marketo
Nov 14, 2023
Convert catalog to dict. Resolves `is not subscriptable` error
Optimize extract from marketo
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of change
str.replace
& an additional layer of redirection via a generator expression when reading the saved csv file.csv.DictReader
for better ergonomics.available_fields
& individual fields' outputschema["type"]
) from within the inner loop to outside of the loop. This reduces the complexity required to process individual rows of the csv. This is done by two new methods:get_stream_fields
&get_output_typemap
.format_value
function (see below)export_start < job_started
because the left side's microsecond- truncated while the right side is not.attribution_window
user-configurable (currently this is hard-codedATTRIBUTION_WINDOW_DAYS = 1
)max_bookmark
within the loop iterating over the csv file when the marketo client supports corona (max_bookmark
is set outside of the loop)optimize
format_value
format_value
is used in the inner most loop of the csv file processing. as such it is performance critical.the existing method performs the following operations that harm performance.
isinstance
to check for types. this is unnecessary as the inputvalue
eitherNone
or of typestring
.str.find
in parsing integersfield_type
on each function call and determines the value via an if-else blockNone
.The optimized version
value is None or value == '' or value == 'null'
to return early if value should beNone
.isinstance
checks used in conversion to integer & boolean and replaces the logic with appropriate functions that can work with both string & non string types.str.find
and additional conditional logic withint(float(value))
_type
instead ofschema
and avoids looking upschema["type"]
._type
is either astring
or atuple
and this allows the method to replace conditions such as"boolean" in field_type
with_type == "boolean"
Manual QA steps
Note that we're running this tap in production, and have verified that both the csv processing time is reduced and the unnecessary extract due to microsecond-truncation is avoided
Below i'm copying over some of the internal validation that we did.
I verified performance of the
format_value
refactor wrt parsing integersThis may be verified by anyone by running the following in a jupyter notebook cell.
in my case, I observed the following
verified the performance improvement with different output types.
another snippet that can be run in a jupter notebook:
This resulted in the following output for me:
Visualizing Performance Gains
Cleaning the output and plotting the speedup yields the following chart (lower is better)
An interesting observation is that the date-time formatting takes about
5 µs
, whereas the rest of the timings of the optimized version are in the range of100 ns
This suggests that the performance of
format_values
is dominated by the parsing of datetime fields, and that we should look to optimizependulum.parse(value).isoformat()
. the version of pendulum used is from 2017, and the current beta version of pendulum rewrote several extensions in rust. We should update this dependency to benefit from any performance improvement.I also tested for the equivalency between the two methods using the above test cases
no assertion error was raised.
Risks
Rollback steps
Related Issues: