Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switching the dataloader build_entry_list from pandas to polars #527

Draft
wants to merge 62 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
bd46dc3
updated dspeed version
ggmarshall Sep 29, 2023
b3a44e7
added more possible length combinations for getting parameters
ggmarshall Sep 29, 2023
adf62ae
updated for changes to calibrations
ggmarshall Sep 29, 2023
6ec95eb
rewrote fitting to fit in stages dropping tail if unnecessary with pr…
ggmarshall Sep 29, 2023
d4899c3
changes for new cal fitting, added high stats fitting for super calib…
ggmarshall Sep 29, 2023
a1b1f17
rewrite of aoe routines, better handling of guesses, improved clarity…
ggmarshall Sep 29, 2023
8935d8e
split loading routine into own file as well as function to handle fai…
ggmarshall Sep 29, 2023
7614f06
fixed dspeed version
ggmarshall Sep 29, 2023
96f2e29
style: pre-commit fixes
pre-commit-ci[bot] Sep 29, 2023
e0d2cdc
bugfix on selection when nan values from fit
ggmarshall Sep 30, 2023
767643e
added ability to change tail weighting and changed binning on high st…
ggmarshall Sep 30, 2023
4e5a9b2
merges
ggmarshall Sep 30, 2023
5fcf9ba
style: pre-commit fixes
pre-commit-ci[bot] Sep 30, 2023
c5f34c1
fix merge
ggmarshall Sep 30, 2023
8645583
remove merge lines
ggmarshall Sep 30, 2023
c566771
style: pre-commit fixes
pre-commit-ci[bot] Sep 30, 2023
576b38d
pre commits
ggmarshall Oct 3, 2023
331c924
merges
ggmarshall Oct 3, 2023
a70bf3f
precommit
ggmarshall Oct 3, 2023
4355095
bugfix for partitiona ecal
ggmarshall Oct 3, 2023
433252b
fixed pars to params to eres fit
ggmarshall Oct 3, 2023
da8cbc7
partition ecal naming fix
ggmarshall Oct 3, 2023
a58be77
pre-commit
ggmarshall Oct 4, 2023
fcb7dcd
updated units in fwhm to convention
ggmarshall Oct 9, 2023
3a804c1
corrected units to _in_keV
ggmarshall Oct 12, 2023
b77b34d
corrected units to _in_keV
ggmarshall Oct 12, 2023
9b972b0
moved aoe_calibration function to dataflow
ggmarshall Oct 31, 2023
8dd724a
moved top level funcs to dataflow added pulser field to plot arguments
ggmarshall Oct 31, 2023
e182d8f
added option to pass pulser mask to event selection if not calculate …
ggmarshall Oct 31, 2023
3c60ea1
added pulser mask to load data, modified to have the data loading ext…
ggmarshall Oct 31, 2023
5388d19
switched fit escale and ecal to iminuit and add errors as outputs
ggmarshall Oct 31, 2023
c2847e2
Merge branch 'main' of https://github.com/ggmarshall/pygama
ggmarshall Oct 31, 2023
28277da
Merge branch 'main' of https://github.com/ggmarshall/pygama
ggmarshall Oct 31, 2023
2a20d0f
removed tag_pulser and cut import as had circular dependencies, renam…
ggmarshall Oct 31, 2023
518d3f9
added default arguments, changed timestamp to run_timestamp to differ…
ggmarshall Oct 31, 2023
8795051
cleaned up imports, removing * imports and removing unnecessary argum…
ggmarshall Oct 31, 2023
d0567e1
bugfix for tcm pulser where channel was incorrectly hardcoded
ggmarshall Nov 2, 2023
e84fc4a
added more possible length combinations for getting parameters
ggmarshall Sep 29, 2023
06bef2d
updated for changes to calibrations
ggmarshall Sep 29, 2023
0c36b33
rewrote fitting to fit in stages dropping tail if unnecessary with pr…
ggmarshall Sep 29, 2023
84fa64b
changes for new cal fitting, added high stats fitting for super calib…
ggmarshall Sep 29, 2023
572867c
rewrite of aoe routines, better handling of guesses, improved clarity…
ggmarshall Sep 29, 2023
218dffc
split loading routine into own file as well as function to handle fai…
ggmarshall Sep 29, 2023
2afc5b2
bugfix on selection when nan values from fit
ggmarshall Sep 30, 2023
770b020
added ability to change tail weighting and changed binning on high st…
ggmarshall Sep 30, 2023
3cbf615
bugfix for partitiona ecal
ggmarshall Oct 3, 2023
fec31a7
fixed pars to params to eres fit
ggmarshall Oct 3, 2023
f89c17e
partition ecal naming fix
ggmarshall Oct 3, 2023
b13410e
updated units in fwhm to convention
ggmarshall Oct 9, 2023
b04f20d
corrected units to _in_keV
ggmarshall Oct 12, 2023
56fb95e
moved aoe_calibration function to dataflow
ggmarshall Oct 31, 2023
860b5fd
moved top level funcs to dataflow added pulser field to plot arguments
ggmarshall Oct 31, 2023
db8f747
added option to pass pulser mask to event selection if not calculate …
ggmarshall Oct 31, 2023
0abf21d
added pulser mask to load data, modified to have the data loading ext…
ggmarshall Oct 31, 2023
3d63879
switched fit escale and ecal to iminuit and add errors as outputs
ggmarshall Oct 31, 2023
fafc424
removed tag_pulser and cut import as had circular dependencies, renam…
ggmarshall Oct 31, 2023
1678f6b
added default arguments, changed timestamp to run_timestamp to differ…
ggmarshall Oct 31, 2023
0df88e7
cleaned up imports, removing * imports and removing unnecessary argum…
ggmarshall Oct 31, 2023
522859b
bugfix for tcm pulser where channel was incorrectly hardcoded
ggmarshall Nov 2, 2023
551cb85
fix merge conflict
ggmarshall Nov 2, 2023
5a717d8
Merge branch 'main' of https://github.com/legend-exp/pygama into rebase
ggmarshall Nov 3, 2023
007ba07
switch for pandas to polars in dl
ggmarshall Nov 8, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ project_urls =
packages = find:
install_requires =
colorlog
dspeed>=1.1
dspeed>=1.2
h5py>=3.2
iminuit
legend-daq2lh5>=1.0
Expand Down
76 changes: 45 additions & 31 deletions src/pygama/flow/data_loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@

import numpy as np
import pandas as pd
import polars as pl
from dspeed.vis import WaveformBrowser
from lgdo import Array, LH5Iterator, LH5Store, Struct, Table, lgdo_utils
from lgdo.types.vectorofvectors import build_cl, explode_arrays, explode_cl
Expand Down Expand Up @@ -572,37 +573,39 @@ def build_entry_list(
)
else:
tcm_tb = tcm_tables[0]

tcm_path = os.path.join(
self.data_dir,
self.filedb.tier_dirs[tcm_tier].lstrip("/"),
self.filedb.df.iloc[file][f"{tcm_tier}_file"].lstrip("/"),
)

if not os.path.exists(tcm_path):
raise FileNotFoundError(f"Can't find TCM file for {tcm_level}")

tcm_table_name = self.filedb.get_table_name(tcm_tier, tcm_tb)
try:
tcm_lgdo, _ = sto.read_object(tcm_table_name, tcm_path)
except KeyError:
log.warning(f"Cannot find table {tcm_table_name} in file {tcm_path}")
continue

tcm_lgdo, _ = sto.read_object(tcm_table_name, tcm_path)

# Have to do some hacky stuff until I get a get_dataframe() method
tcm_lgdo[self.tcms[tcm_level]["tcm_cols"]["child_idx"]] = Array(
nda=explode_cl(tcm_lgdo["cumulative_length"].nda)
)
tcm_lgdo["index"] = Array(
nda=np.arange(
len(tcm_lgdo[self.tcms[tcm_level]["tcm_cols"]["child_idx"]])
)
)
tcm_lgdo.pop("cumulative_length")
tcm_tb = Table(col_dict=tcm_lgdo)
f_entries = tcm_tb.get_dataframe()
f_entries = pl.DataFrame(tcm_tb.get_dataframe())
renaming = {
self.tcms[tcm_level]["tcm_cols"]["child_idx"]: f"{child}_idx",
self.tcms[tcm_level]["tcm_cols"]["parent_tb"]: f"{parent}_table",
self.tcms[tcm_level]["tcm_cols"]["parent_idx"]: f"{parent}_idx",
}
f_entries.rename(columns=renaming, inplace=True)
f_entries = f_entries.rename(renaming)
if self.merge_files:
f_entries["file"] = file
# f_entries.with_columns("file") = file
f_entries = f_entries.with_columns(pl.lit(file).alias("file"))
# At this point, should have a list of all available hits/evts joined by tcm

if mode == "any":
Expand All @@ -615,7 +618,6 @@ def build_entry_list(
if self.cuts is not None:
if level in self.cuts.keys():
cut = self.cuts[level]

col_tiers = col_tiers_dict[level]

# Tables in first tier of event should be the same for all tiers in one level
Expand All @@ -630,21 +632,23 @@ def build_entry_list(

# Cut any rows of TCM not relating to requested tables
if level == parent:
f_entries.query(f"{level}_table in {tables}", inplace=True)
f_entries = f_entries.filter(pl.col(f"{level}_table").is_in(tables))

for tb in tables:
tb_table = None
if level == parent:
tcm_idx = f_entries.query(f"{level}_table == {tb}").index
f_entries_filtered = f_entries.filter(
pl.col(f"{level}_table") == tb
)
else:
tcm_idx = f_entries.index
idx_mask = f_entries.loc[tcm_idx, f"{level}_idx"]
f_entries_filtered = f_entries
for tier in self.tiers[level]:
tier_path = os.path.join(
self.data_dir,
self.filedb.tier_dirs[tier].lstrip("/"),
self.filedb.df.loc[file, f"{tier}_file"].lstrip("/"),
)
# print(type(f_entries_filtered.select(f"{level}_idx")).to_series())#.to_list()
if tier in col_tiers[file]["tables"].keys():
if tb in col_tiers[file]["tables"][tier]:
table_name = self.filedb.get_table_name(tier, tb)
Expand All @@ -653,7 +657,9 @@ def build_entry_list(
table_name,
tier_path,
field_mask=cut_cols[level],
idx=idx_mask.tolist(),
idx=f_entries_filtered.select(f"{level}_idx")
.to_series()
.to_list(),
)
except KeyError:
log.warning(
Expand All @@ -667,21 +673,31 @@ def build_entry_list(
if tb_table is None:
continue
tb_df = tb_table.get_dataframe()
tb_df.query(cut, inplace=True)
idx_match = f_entries.query(f"{level}_idx in {list(tb_df.index)}")
tb_df.query(cut, inplace=True, engine="numexpr")

idx_match = f_entries.filter(
pl.col(f"{level}_idx").is_in(list(tb_df.index))
)
if level == parent:
idx_match = idx_match.query(f"{level}_table == {tb}")
idx_match = idx_match.filter(pl.col(f"{level}_table") == tb)
if mode == "only":
keep_idx = idx_match.index
drop_idx = set.symmetric_difference(
set(tcm_idx), list(keep_idx)
set(f_entries_filtered.select("index").to_series()),
list(idx_match.select("index").to_series()),
)
# print(drop_idx)
f_entries = f_entries.filter(
pl.col("index").is_in(drop_idx).not_()
)
f_entries.drop(drop_idx, inplace=True)
elif mode == "any":
evts = list(idx_match[f"{child}_idx"].unique())
keep_idx = f_entries.query(f"{child}_idx in {evts}").index
evts = list(
idx_match.select(f"{child}_idx").to_series().unique()
)
keep_idx = f_entries.query(f"{child}_idx in {evts}").select(
"index"
)
drop = set.symmetric_difference(
set(f_entries.index), list(keep_idx)
set(f_entries_filtered.select("index")), list(keep_idx)
)
if drop_idx is None:
drop_idx = drop
Expand All @@ -695,14 +711,12 @@ def build_entry_list(
if col in for_output:
f_entries.loc[keep_idx, col] = tb_df[col].tolist()

if mode == "any":
if drop_idx is not None:
f_entries.drop(index=drop_idx, inplace=True)

f_entries.reset_index(inplace=True, drop=True)
f_entries = f_entries.with_columns(
pl.Series(np.arange(len(f_entries))).alias("index")
)

if in_memory:
entries[file] = f_entries
entries[file] = f_entries.drop("index").to_pandas()
if output_file:
# Convert f_entries DataFrame to Struct
f_dict = f_entries.to_dict("list")
Expand Down
Loading