-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Airflow process - CBS data backfill #9
Comments
we are not using S3 at the moment, we have all the files in the airflow data directory - https://airflow-data.anyway.co.il/cbs/files/ |
there is no point to have load_start_year 2020 as that is the default (current year - 1) |
development complete, deployed to dev
initiated 2 dag runs on dev for testing: assigning to @atalyaalon to test and make a release |
@OriHoch seems like 2020 data is not in our DB even though default load_start_year is 2020 as you've mentioned - seems like data from 2020 and 2021 in that case is deleted however only data from 2021 is loaded in the CBS process (from s3). Can you take a look? |
how do you check this? |
@OriHoch |
we are not loading any data from s3 how did you check that data in DB from 2020 is not loaded? |
Why not loading from s3? |
all the data is available in our storage, S3 is not used at all at the moment - https://airflow-data.anyway.co.il/cbs/files/ could you write the query you used? |
https://app.redash.io/hasadna/queries/1008428 |
@OriHoch we would like to extract updated data by tomorrow evening (monday) for a specific report. Is it possible you'll take a look tomorrow morning? Before moving to airflow, the cbs process used the load_start_year var to extract the relevant data from s3, all the data starting this year, and in this way we didn't have holes in the data. |
I don't think I'll be able to do it that soon |
@OriHoch no worries - I'm taking care of it for now using Jenkins :) |
|
Hi @OriHoch,
|
@atalyaalon this separation exists, each step of the dag is completely independent as you wrote in your comment - https://github.com/hasadna/anyway-etl/blob/main/airflow_server/dags/cbs.py#L17 regarding S3 - we replaced it with the local data storage which is available here, if you want we can copy that over to S3 too |
@OriHoch Thanks. If so - perhaps it's better to copy it only to S3 - and not local storage. The reason for that is that we might want to reload multiple previous years (and the email process only loads latest years) AND that it's important for us to maintain all files in S3. |
Let me clarify - I think that the consistent location to save the files data is not local storage rather than s3 - That's why I think that local storage should not be used for data consistency |
ok, please open a new issue for this and we can discuss it there, it's not related to the CBS data backfill |
Thanks! opened #17 |
fixed in #16 (pending merge & deploy) |
Create an airflow process that allows CBS data backfill from s3 (without importing from email) - and with a load_start_year parameter that can be changed by the airflow user The relevant command: (python main.py process cbs --source s3 --load_start_year 2020
We had a Jenkins process that enabled such backfill.
The text was updated successfully, but these errors were encountered: