Want to merge functionality of `events` into `combine` #150

FaroutYLq · 2024-05-30T01:34:49Z

It is known that somehow with the current structure, in SR1 there is a ~10% chance that plugins computed in peaklets level (like veto_interval and peaklets) end up having different length when processing (peaks, peaklet_classification) and (event_info, veto_proximity).

We suspect something tricky happened when combining. To make these runs fail immediately when issues happen, even before we upload their combined peaklets, we want to load test in combine job after computing:

st.get_array(run, ("peaks", "peak_basics", "peak_positions")) 
st.get_array(run, ("event_info", "cut_daq_veto"))

Once failed, nothing will be uploaded.

The text was updated successfully, but these errors were encountered:

FaroutYLq · 2024-05-30T01:54:04Z

If it is indeed something tricky happens in combine, shouldn't the problematic runs' peaklets and lone_hits always trigger a trouble. Let's try erase those who failed down to peaklets+lone_hits, and then reprocess on RCC.

If you can succeed, then peaklets+lone_hits are OK, and after we merge events into combine we will forever solve the length mismatch problem
If you cannot, and keep running into same length problem, it means somehow the peaklets+lone_hits themselves are broken. We want to check the loaded length and the length promised in meta data.

FaroutYLq · 2024-05-30T13:09:11Z

Additionally, such merging will benefit us in the sense that we will have less time wasted in transferring between sites, and search for nodes. The downside will be losing the flexibility of computing peaks/events only. I would suggest we add a super_events instead of overwriting existing combine workflow.

FaroutYLq · 2024-06-02T19:11:41Z

Test: some outputs from combined jobs keep getting same error, even erasing everything above peaklets and reprocess on dali, it keeps failing. Example run 049374, using /scratch/midway2/yuanlq/corruption_museum/. It means, indeed something tricky happened in combine. The
ValueError: Cannot merge chunks with different number of items: [[049374.peaklets: 1670689659sec 188225560 ns - 1670689702sec 137898520 ns, 83481 items, 8.2 MB/s], [049374.peaklet_classification: 1670689659sec 188225560 ns - 1670689702sec 137898520 ns, 83482 items, 0.0 MB/s]]
Python script failed with exit code 1

Edit: it turns out to be an old peaklet_classification causing the trouble

yuanlq@dali003:/dali/lgrandi/xudc/scratch-midway2/bk/bk-0602_del/nton/Make/logs$ ls -lh /gpfs3/cap/.snapshots/weekly-2024-06-02.04h07/dali/lgrandi/xenonnt/processed/049374-peaklet_classification-p3m6pr2fhz
total 2.5K
-rw-rwxr--+ 1 yuanlq yuanlq  92M Feb  4 23:57 peaklet_classification-p3m6pr2fhz-000000
-rw-rwxr--+ 1 yuanlq yuanlq  93M Feb  4 23:57 peaklet_classification-p3m6pr2fhz-000001
-rw-rwxr--+ 1 yuanlq yuanlq  93M Feb  4 23:57 peaklet_classification-p3m6pr2fhz-000002
-rw-rwxr--+ 1 yuanlq yuanlq  52M Feb  4 23:57 peaklet_classification-p3m6pr2fhz-000003
-rw-rwxr--+ 1 yuanlq yuanlq 6.1K Feb  4 23:57 peaklet_classification-p3m6pr2fhz-metadata.json

FaroutYLq · 2024-06-02T19:20:51Z

Now we want to download rr and process same run from scratch and compare the peaklets

FaroutYLq · 2024-06-04T04:27:47Z

We did some test on 049374. Starting from raw_records, while processing on OSG, the total length of peaklets is 37572935, while on DaLI it is 37572938. Both can get ("peaklets", "peaklet_classification") without problem. Those missing peaklets from OSG is not having patterns in relative timing to chunk boundaries. See lots of truncation at the end, but no other waveform feature. Are they near DAQ veto?

FaroutYLq · 2024-06-04T05:02:06Z

Example

FaroutYLq · 2024-06-04T05:22:48Z

Two things need to be understood:

In a peaklets job, do we rechunk: No
In a combine job, how do we rechunk: Yes. By copy_to_frontend

FaroutYLq · 2024-06-04T16:08:05Z

More investigation shows that split peaks perform differently on different machines. Suspected to be floating point issue. Example here.

Processed on DaLI: max_goodness_of_split
[0.3362855  0.6615487  0.19245173 0.33158863 0.7406279  0.79679215
 0.80455685 0.         0.         0.         0.5374566  0.59021497
 0.40632284 0.53420705 0.         0.3954491  0.         0.8004727
 0.46013355 0.79623634 0.48184547 0.4689006 ]
Processed on OSG: max_goodness_of_split
[0.4330755  0.18698996 0.3486709  0.7406338  0.79679585 0.8045561
 0.         0.         0.         0.5374318  0.5903075  0.40630853
 0.53418773 0.         0.3954804  0.         0.8004847  0.46006623
 0.7962359  0.48178786 0.46880245]

FaroutYLq · 2024-06-04T16:08:49Z

Maybe this architecture requirement is still not enough? This part of peak splitting has too much numba magic, and might be vulnerable, especially the nogil. We expect the single thread processor help if this is the crux. However, keep in mind that we already require single CPU core when processing on OSG. We might also want to check if there is some hyper-threading thing in strax.

FaroutYLq · 2024-06-04T17:30:24Z

Given the machine dependence, the following scenario will trigger issue:

You processed peaklets on OSG in try 1, got length 10086
You processed peaklet_classification on DaLI or OSG in try1, got length 10086
You found it corrupted in rucio and erased everything. However the peaklet_classification happened to be not erased. It is then downloaded from somewhere. They happen to be of different length

FaroutYLq · 2024-06-04T19:33:25Z

More details in the tests

FaroutYLq added bug Something isn't working enhancement New feature or request labels May 30, 2024

FaroutYLq assigned minzhong98 and FaroutYLq May 30, 2024

FaroutYLq mentioned this issue May 30, 2024

Want to refactor runstrax.py and strax-wrapper.sh #142

Closed

FaroutYLq added the PRIORITY label May 30, 2024

FaroutYLq assigned noahhood May 31, 2024

dachengx mentioned this issue Sep 27, 2024

No need to save events #189

Closed

FaroutYLq mentioned this issue Nov 22, 2024

Brainstorm: More robust peak splitting AxFoundation/strax#933

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Want to merge functionality of `events` into `combine` #150

Want to merge functionality of `events` into `combine` #150

FaroutYLq commented May 30, 2024

FaroutYLq commented May 30, 2024 •

edited

Loading

FaroutYLq commented May 30, 2024

FaroutYLq commented Jun 2, 2024 •

edited

Loading

FaroutYLq commented Jun 2, 2024

FaroutYLq commented Jun 4, 2024 •

edited

Loading

FaroutYLq commented Jun 4, 2024

FaroutYLq commented Jun 4, 2024 •

edited

Loading

FaroutYLq commented Jun 4, 2024

FaroutYLq commented Jun 4, 2024 •

edited

Loading

FaroutYLq commented Jun 4, 2024

FaroutYLq commented Jun 4, 2024

Want to merge functionality of events into combine #150

Want to merge functionality of events into combine #150

Comments

FaroutYLq commented May 30, 2024

FaroutYLq commented May 30, 2024 • edited Loading

FaroutYLq commented May 30, 2024

FaroutYLq commented Jun 2, 2024 • edited Loading

FaroutYLq commented Jun 2, 2024

FaroutYLq commented Jun 4, 2024 • edited Loading

FaroutYLq commented Jun 4, 2024

FaroutYLq commented Jun 4, 2024 • edited Loading

FaroutYLq commented Jun 4, 2024

FaroutYLq commented Jun 4, 2024 • edited Loading

FaroutYLq commented Jun 4, 2024

FaroutYLq commented Jun 4, 2024

Want to merge functionality of `events` into `combine` #150

Want to merge functionality of `events` into `combine` #150

FaroutYLq commented May 30, 2024 •

edited

Loading

FaroutYLq commented Jun 2, 2024 •

edited

Loading

FaroutYLq commented Jun 4, 2024 •

edited

Loading

FaroutYLq commented Jun 4, 2024 •

edited

Loading

FaroutYLq commented Jun 4, 2024 •

edited

Loading