Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify data types for dataframes #692

Open
benedikt-voelkel opened this issue May 7, 2020 · 2 comments
Open

Specify data types for dataframes #692

benedikt-voelkel opened this issue May 7, 2020 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@benedikt-voelkel
Copy link
Contributor

benedikt-voelkel commented May 7, 2020

If datatypes for pandas.DataFrame are not specified, a query operation with negative values most likely fails (see also: https://stackoverflow.com/questions/50400843/using-negative-numbers-in-pandas-dataframe-query-expression)

Hence, we cannot directly apply custom cuts with negative values using pandas.DataFrame.query which is a major problem.

Furthermore, specifying datatypes might save diskspace!

This needs to be done at the first processing stage when TTrees are converted and pickled.

@benedikt-voelkel benedikt-voelkel added the bug Something isn't working label May 7, 2020
@benedikt-voelkel benedikt-voelkel self-assigned this May 7, 2020
@benedikt-voelkel
Copy link
Contributor Author

benedikt-voelkel commented May 7, 2020

Update:
There will no disk space saved most likely. Integers in TTrees will get int32, bools are bool and floating point numbers of C++ float will become float32 automatically using uproot (see below).

Converting all float32 columns to float64 columns makes the pickled pandas.DataFrames ~1.3 times as large both using lz4 compression.

Trying with an average candidate TTree:

float32: 48 MB
float64: 65 MB

Example code

# imports

file_path = "/data/TTree/D0DsLckINT7withJets/vAN-20200201_ROOT6-1/pp_2018_data/352_20200202-0239/merged/child_2/pack_7/AnalysisResults.root"

branches_cand = ["cos_t_star", "dca_K0s", "signd0", "imp_par_K0s", "d_len_K0s", "armenteros_K0s", "ctau_K0s", "cos_p_K0s", "pt_prong0", "pt_prong1", "pt_prong2", "imp_par_prong0", "imp_par_prong1", "imp_par_prong2", "inv_mass", "pt_cand", "phi_cand", "eta_cand", "inv_mass_K0s", "pt_K0s", "cand_type", "y_cand", "run_number", "ev_id", "nsigTPC_Pr_0", "nsigTOF_Pr_0", "spdhits_prong0", "spdhits_prong1", "spdhits_prong2",
"pt_jet", "eta_jet", "phi_jet", "delta_eta_jet", "delta_phi_jet", "delta_r_jet", "pt_gen_jet", "eta_gen_jet", "phi_gen_jet", "delta_eta_gen_jet", "delta_phi_gen_jet", "delta_r_gen_jet", "pt_gen_cand",
"p_prong0", "p_prong1", "p_prong2"]

tree = uproot.open(file_path)[treename_name]
df = tree.pandas.df(branches=branches_cand)

# pickling

# casting all float32 to float64

# pickling again

@benedikt-voelkel
Copy link
Contributor Author

This will be fixed bumping pandas as mentioned in #693

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant