Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Airflow process - CBS data backfill #9

Open
atalyaalon opened this issue Sep 21, 2021 · 21 comments · Fixed by #10
Open

Airflow process - CBS data backfill #9

atalyaalon opened this issue Sep 21, 2021 · 21 comments · Fixed by #10
Assignees
Milestone

Comments

@atalyaalon
Copy link
Contributor

Create an airflow process that allows CBS data backfill from s3 (without importing from email) - and with a load_start_year parameter that can be changed by the airflow user The relevant command: (python main.py process cbs --source s3 --load_start_year 2020
We had a Jenkins process that enabled such backfill.

@OriHoch
Copy link
Contributor

OriHoch commented Sep 21, 2021

we are not using S3 at the moment, we have all the files in the airflow data directory - https://airflow-data.anyway.co.il/cbs/files/

@OriHoch
Copy link
Contributor

OriHoch commented Sep 21, 2021

there is no point to have load_start_year 2020 as that is the default (current year - 1)

@OriHoch
Copy link
Contributor

OriHoch commented Sep 21, 2021

development complete, deployed to dev
to run a back-fill - do a manual dag run with the following example json:

{"load_start_year": 2019}

initiated 2 dag runs on dev for testing:

  1. default run without any parameters
  2. run with load_start_year 2019

assigning to @atalyaalon to test and make a release

@OriHoch OriHoch assigned atalyaalon and unassigned OriHoch Sep 21, 2021
@atalyaalon
Copy link
Contributor Author

@OriHoch seems like 2020 data is not in our DB even though default load_start_year is 2020 as you've mentioned - seems like data from 2020 and 2021 in that case is deleted however only data from 2021 is loaded in the CBS process (from s3). Can you take a look?

@OriHoch
Copy link
Contributor

OriHoch commented Oct 17, 2021

how do you check this?

@OriHoch OriHoch assigned OriHoch and unassigned atalyaalon Oct 17, 2021
@atalyaalon
Copy link
Contributor Author

@OriHoch
When load start year is 2020, then data starting 2020 and on is deleted.
Now the question is what data is loaded from s3? I assume datat from 2020 is not loaded for some reason

@OriHoch
Copy link
Contributor

OriHoch commented Oct 17, 2021

we are not loading any data from s3

how did you check that data in DB from 2020 is not loaded?

@atalyaalon
Copy link
Contributor Author

Why not loading from s3?
I queried markers table in our DB, no accidents in 2020

@OriHoch
Copy link
Contributor

OriHoch commented Oct 17, 2021

all the data is available in our storage, S3 is not used at all at the moment - https://airflow-data.anyway.co.il/cbs/files/

could you write the query you used?

@atalyaalon
Copy link
Contributor Author

@atalyaalon
Copy link
Contributor Author

@OriHoch we would like to extract updated data by tomorrow evening (monday) for a specific report. Is it possible you'll take a look tomorrow morning? Before moving to airflow, the cbs process used the load_start_year var to extract the relevant data from s3, all the data starting this year, and in this way we didn't have holes in the data.

@OriHoch
Copy link
Contributor

OriHoch commented Oct 18, 2021

I don't think I'll be able to do it that soon

@atalyaalon atalyaalon added this to the v.0.11.0 milestone Oct 19, 2021
@atalyaalon
Copy link
Contributor Author

@OriHoch no worries - I'm taking care of it for now using Jenkins :)
I will soon add a description here of the steps that needed to be done - but no rush - working on a quick solution for now, until you'll be available to fix airflow (disabled the airflow cbs process for now).
Thanks for everything!

@OriHoch
Copy link
Contributor

OriHoch commented Nov 21, 2021

select 
        accident_year,
        accident_severity_hebrew,
        count(*) 
    from markers_hebrew 
where accident_severity_hebrew is not null
group by 
    accident_year,
    accident_severity_hebrew
order by accident_severity_hebrew, accident_year

@atalyaalon
Copy link
Contributor Author

Hi @OriHoch,
anyway-etl implementation is not ready - there is a logical bug there:
There should be a full separation between:

  • Email extraction and loading into S3 process - so all data will be saved in S3 as we did in previous anyway process implementation (in anyway repo)
  • Loading to DB - from s3 - so we can reload relevant data from any load_start_year (in the code we first delete data from relevant years and then reload new data)

@OriHoch
Copy link
Contributor

OriHoch commented Dec 18, 2021

@atalyaalon this separation exists, each step of the dag is completely independent as you wrote in your comment - https://github.com/hasadna/anyway-etl/blob/main/airflow_server/dags/cbs.py#L17

regarding S3 - we replaced it with the local data storage which is available here, if you want we can copy that over to S3 too

@atalyaalon
Copy link
Contributor Author

atalyaalon commented Dec 18, 2021

@OriHoch Thanks.

If so - perhaps it's better to copy it only to S3 - and not local storage.
And then load data to DB using S3 files and not local storage.
Meaning not using local storage in these 2 processes

The reason for that is that we might want to reload multiple previous years (and the email process only loads latest years) AND that it's important for us to maintain all files in S3.

@atalyaalon
Copy link
Contributor Author

Let me clarify - I think that the consistent location to save the files data is not local storage rather than s3 - That's why I think that local storage should not be used for data consistency

@OriHoch
Copy link
Contributor

OriHoch commented Dec 18, 2021

ok, please open a new issue for this and we can discuss it there, it's not related to the CBS data backfill

@atalyaalon
Copy link
Contributor Author

Thanks! opened #17

@OriHoch
Copy link
Contributor

OriHoch commented Dec 27, 2021

fixed in #16 (pending merge & deploy)

@OriHoch OriHoch assigned atalyaalon and unassigned OriHoch Dec 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants