Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ETL] Issues with monitoring/testing #110

Closed
Tracked by #109
mahalakshme opened this issue Sep 30, 2024 · 7 comments
Closed
Tracked by #109

[ETL] Issues with monitoring/testing #110

mahalakshme opened this issue Sep 30, 2024 · 7 comments
Assignees

Comments

@mahalakshme
Copy link
Contributor

mahalakshme commented Sep 30, 2024

Issue:

ETL for rwbngos2023 completed in a minute, but in database it looks like it took 15 mins giving a wrong picture

Image

Image

If you see in the below image as well, the start time of some jobs are earlier than the end time of other jobs. So this looks like either the start time of next job or end time of previous job is recorded incorrectly. This is posing issues for monitoring the ETL jobs.

Image

AC:

  • started_at time needs to be the time, when the ETL process started for that org and ended_at needs to be the time ETL job finished running.
  • Extend this report to send notification mail(to Maha alone) for the below:
    - When ETL of an organisation of 'Organisation Category' - production or UAT and 'Organisation Status' - Live fails.
    - When time taken to complete one round of ETL takes more than 1.5 hours
  • Add a way to trigger the ETL of an org immediately overriding the other orgs in the queue - Currently testing of ETL stories are very difficult and time-consuming since even after disabling->enabling the ETL, most of the time it takes around 20-30 mins to trigger for that organisation. This should work when a button 'Trigger' is clicked on the org page of super admin. For usual enable, disable it need not trigger immediately. This is to prevent accidental disabling and to ve the audit unchanged.

Image

Technical analysis/suggestions:

  • Finding the times at appropriate callback methods(job listener events) should help to fix start and end time
  • To trigger immediately: Currently we are triggering with a scheduler, and hence it might not take effect even when we mention a start time. Try triggering it once without scheduler like below, and then trigger it with a scheduler.
    Trigger trigger = TriggerBuilder.newTrigger()
    .withIdentity("triggerName", "triggerGroup")
    .startNow()
    .build();
  • One way to identify if all the jobs of ETL are triggered in one-and-half an hour is to cross-check entries of scheduled_job_run table with qrtz_job_details

Ignore:

What:

  • ETL failures
  • time taken for an ETL if it exceeds 15 mins
  • one run completes in 1:30 hours
  • disabled for unneeded orgs

Who:

  • can setup report - get alert - Maha to look into it
  • generate bundle from UAT - store ETL status in the bundle
  • check with implementation team
@mahalakshme mahalakshme converted this from a draft issue Sep 30, 2024
@mahalakshme mahalakshme mentioned this issue Nov 4, 2024
@mahalakshme mahalakshme moved this from In Analysis to In Analysis Review in Avni Product Nov 5, 2024
@mahalakshme mahalakshme changed the title ETL taking long time issue [ETL] Issues with monitoring Nov 5, 2024
@mahalakshme mahalakshme moved this from In Analysis Review to Ready in Avni Product Nov 29, 2024
@mahalakshme mahalakshme changed the title [ETL] Issues with monitoring [ETL] Issues with monitoring/testing Nov 29, 2024
@mahalakshme mahalakshme moved this from Ready to In Analysis in Avni Product Dec 2, 2024
@himeshr
Copy link
Contributor

himeshr commented Dec 2, 2024

Joy's Comment

the ACs look like they will be costly to implement since we are leveraging spring batch here and queuing and the tables are managed by it.

For AC1 (monitoring), we can rely on logs as source of truth and ignore the DB.
For AC2, we don't really have a concept of 'round' of ETL so again might be costly/complicated to determine this. Per org should be easier to do
For AC3, would it be sufficient to have an endpoint that disables ETL for all orgs so we can focus on the org we want to test as implementing priority within the queue is again going to be costly

@himeshr
Copy link
Contributor

himeshr commented Dec 2, 2024

Himesh's Comment

In general, i agree with the issues that we aim to resolve here.. but have difference in the approach to resolve them though

  • For AC1, Would recommend introducing an additional ETL-JOB-AUDIT table with info like ORGANISATION_UUID, ETL_TRIGGER_TYPE(org/orgGroup), ETL_START_TIME, ETL_END_TIME, ETL_JOB_STATUS, ETL_JOB_RUNTIME
  • For AC2, Create a Metabase alert on ETL-JOB-AUDIT table as per requirement
  • For AC3, Introduce an Adhoc ETL Job, that runs in Parallel to the Quartz scheduled ETL jobs, this would be run only once, scheduled immediately / within a day based on queue of Adhoc triggers and does not make any change to the Quartz based periodic execution of ETL (precedent exists in Avni-Integration-service for doing this)

@mahalakshme
Copy link
Contributor Author

Viveks comment:

We are using Quartz btw, not spring batch for ETL.
I think this line may be the problem:

scheduledJobRun.startedAt = trigger.getNextFireTime();

We perhaps should use new Date here
We can try to figure out why we are getting this issue, because the database entries are managed by us using the JobListener
If job listener is not the best way, we can hook into the actual execution callback that we get to record the times

@mahalakshme mahalakshme moved this from In Analysis to Ready in Avni Product Dec 4, 2024
@1t5j0y 1t5j0y moved this from Ready to QA Failed in Avni Product Dec 10, 2024
@1t5j0y 1t5j0y moved this from QA Failed to In Progress in Avni Product Dec 10, 2024
@1t5j0y 1t5j0y self-assigned this Dec 10, 2024
1t5j0y added a commit that referenced this issue Dec 13, 2024
… actual job execution and add a higher priority trigger for first run of ETL Sync job for an org
@1t5j0y
Copy link
Contributor

1t5j0y commented Dec 13, 2024

AC1 fixed as per Vivek's input above and seems to work well.
AC3 fixed by adding an additional higher priority trigger for the first run after enabling ETL for an org.

AC2 (metabase report) pending. Moving to code review ready so AC1 and AC3 can be tested.

@1t5j0y 1t5j0y moved this from In Progress to Code Review Ready in Avni Product Dec 13, 2024
@1t5j0y
Copy link
Contributor

1t5j0y commented Dec 13, 2024

AC2 Metabase Reports:
https://reporting.avniproject.org/question/4840-latest-etl-run-failures-for-live-uat-and-prod-orgs-production-environment

https://reporting.avniproject.org/question/4841-etl-round-completed-in-90-minutes

Alerts can be enabled after this change is promoted due to inaccurate start/end times in scheduled_job_run

@himeshr himeshr moved this from Code Review Ready to In Code Review in Avni Product Dec 13, 2024
@himeshr
Copy link
Contributor

himeshr commented Dec 13, 2024

AC2 Metabase Reports: https://reporting.avniproject.org/question/4840-latest-etl-run-failures-for-live-uat-and-prod-orgs

https://reporting.avniproject.org/question/4841-etl-round-completed-in-90-minutes

Alerts can be enabled after this change is promoted due to inaccurate start/end times in scheduled_job_run

Made slight additions to the first report to filter by "SyncJobs" job_group and show OrgCategory and OrgStatus values in readable format.

Code review didn't result in any other issues of concern.

@himeshr himeshr moved this from In Code Review to Code Review Ready in Avni Product Dec 13, 2024
@himeshr himeshr moved this from Code Review Ready to QA Ready in Avni Product Dec 13, 2024
@himeshr
Copy link
Contributor

himeshr commented Dec 13, 2024

@AchalaBelokar AchalaBelokar moved this from QA Ready to In QA in Avni Product Dec 16, 2024
@AchalaBelokar AchalaBelokar moved this from In QA to Done in Avni Product Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

4 participants