Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Want to merge functionality of events into combine #150

Open
FaroutYLq opened this issue May 30, 2024 · 11 comments
Open

Want to merge functionality of events into combine #150

FaroutYLq opened this issue May 30, 2024 · 11 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request PRIORITY

Comments

@FaroutYLq
Copy link
Collaborator

It is known that somehow with the current structure, in SR1 there is a ~10% chance that plugins computed in peaklets level (like veto_interval and peaklets) end up having different length when processing (peaks, peaklet_classification) and (event_info, veto_proximity).

We suspect something tricky happened when combining. To make these runs fail immediately when issues happen, even before we upload their combined peaklets, we want to load test in combine job after computing:

st.get_array(run, ("peaks", "peak_basics", "peak_positions")) 
st.get_array(run, ("event_info", "cut_daq_veto"))

Once failed, nothing will be uploaded.

@FaroutYLq
Copy link
Collaborator Author

FaroutYLq commented May 30, 2024

If it is indeed something tricky happens in combine, shouldn't the problematic runs' peaklets and lone_hits always trigger a trouble. Let's try erase those who failed down to peaklets+lone_hits, and then reprocess on RCC.

  • If you can succeed, then peaklets+lone_hits are OK, and after we merge events into combine we will forever solve the length mismatch problem
  • If you cannot, and keep running into same length problem, it means somehow the peaklets+lone_hits themselves are broken. We want to check the loaded length and the length promised in meta data.

@FaroutYLq
Copy link
Collaborator Author

Additionally, such merging will benefit us in the sense that we will have less time wasted in transferring between sites, and search for nodes. The downside will be losing the flexibility of computing peaks/events only. I would suggest we add a super_events instead of overwriting existing combine workflow.

@FaroutYLq
Copy link
Collaborator Author

FaroutYLq commented Jun 2, 2024

Test: some outputs from combined jobs keep getting same error, even erasing everything above peaklets and reprocess on dali, it keeps failing. Example run 049374, using /scratch/midway2/yuanlq/corruption_museum/. It means, indeed something tricky happened in combine. The

ValueError: Cannot merge chunks with different number of items: [[049374.peaklets: 1670689659sec 188225560 ns - 1670689702sec 137898520 ns, 83481 items, 8.2 MB/s], [049374.peaklet_classification: 1670689659sec 188225560 ns - 1670689702sec 137898520 ns, 83482 items, 0.0 MB/s]]
Python script failed with exit code 1

Edit: it turns out to be an old peaklet_classification causing the trouble

yuanlq@dali003:/dali/lgrandi/xudc/scratch-midway2/bk/bk-0602_del/nton/Make/logs$ ls -lh /gpfs3/cap/.snapshots/weekly-2024-06-02.04h07/dali/lgrandi/xenonnt/processed/049374-peaklet_classification-p3m6pr2fhz
total 2.5K
-rw-rwxr--+ 1 yuanlq yuanlq  92M Feb  4 23:57 peaklet_classification-p3m6pr2fhz-000000
-rw-rwxr--+ 1 yuanlq yuanlq  93M Feb  4 23:57 peaklet_classification-p3m6pr2fhz-000001
-rw-rwxr--+ 1 yuanlq yuanlq  93M Feb  4 23:57 peaklet_classification-p3m6pr2fhz-000002
-rw-rwxr--+ 1 yuanlq yuanlq  52M Feb  4 23:57 peaklet_classification-p3m6pr2fhz-000003
-rw-rwxr--+ 1 yuanlq yuanlq 6.1K Feb  4 23:57 peaklet_classification-p3m6pr2fhz-metadata.json

@FaroutYLq
Copy link
Collaborator Author

Now we want to download rr and process same run from scratch and compare the peaklets

@FaroutYLq
Copy link
Collaborator Author

FaroutYLq commented Jun 4, 2024

We did some test on 049374. Starting from raw_records, while processing on OSG, the total length of peaklets is 37572935, while on DaLI it is 37572938. Both can get ("peaklets", "peaklet_classification") without problem. Those missing peaklets from OSG is not having patterns in relative timing to chunk boundaries. See lots of truncation at the end, but no other waveform feature. Are they near DAQ veto?

@FaroutYLq
Copy link
Collaborator Author

Example
image
image
image
image

@FaroutYLq
Copy link
Collaborator Author

FaroutYLq commented Jun 4, 2024

Two things need to be understood:

  • In a peaklets job, do we rechunk: No
  • In a combine job, how do we rechunk: Yes. By copy_to_frontend

@FaroutYLq
Copy link
Collaborator Author

More investigation shows that split peaks perform differently on different machines. Suspected to be floating point issue. Example here.

Processed on DaLI: max_goodness_of_split
[0.3362855  0.6615487  0.19245173 0.33158863 0.7406279  0.79679215
 0.80455685 0.         0.         0.         0.5374566  0.59021497
 0.40632284 0.53420705 0.         0.3954491  0.         0.8004727
 0.46013355 0.79623634 0.48184547 0.4689006 ]
Processed on OSG: max_goodness_of_split
[0.4330755  0.18698996 0.3486709  0.7406338  0.79679585 0.8045561
 0.         0.         0.         0.5374318  0.5903075  0.40630853
 0.53418773 0.         0.3954804  0.         0.8004847  0.46006623
 0.7962359  0.48178786 0.46880245]

image
image

@FaroutYLq
Copy link
Collaborator Author

FaroutYLq commented Jun 4, 2024

Maybe this architecture requirement is still not enough? This part of peak splitting has too much numba magic, and might be vulnerable, especially the nogil. We expect the single thread processor help if this is the crux. However, keep in mind that we already require single CPU core when processing on OSG. We might also want to check if there is some hyper-threading thing in strax.

@FaroutYLq
Copy link
Collaborator Author

Given the machine dependence, the following scenario will trigger issue:

  • You processed peaklets on OSG in try 1, got length 10086
  • You processed peaklet_classification on DaLI or OSG in try1, got length 10086
  • You found it corrupted in rucio and erased everything. However the peaklet_classification happened to be not erased. It is then downloaded from somewhere. They happen to be of different length

@FaroutYLq
Copy link
Collaborator Author

More details in the tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request PRIORITY
Projects
None yet
Development

No branches or pull requests

3 participants