-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to save multi-animal pose tracks to single-animal files #71
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -11,72 +11,133 @@ | |||||
logger = logging.getLogger(__name__) | ||||||
|
||||||
|
||||||
def to_dlc_df(ds: xr.Dataset) -> pd.DataFrame: | ||||||
"""Convert an xarray dataset containing pose tracks into a | ||||||
DeepLabCut-style pandas DataFrame with multi-index columns. | ||||||
def to_dlc_df( | ||||||
ds: xr.Dataset, split_individuals: bool = True | ||||||
) -> Union[pd.DataFrame, dict[str, pd.DataFrame]]: | ||||||
"""Convert an xarray dataset containing pose tracks into a DeepLabCut-style | ||||||
pandas DataFrame with multi-index columns for each individual or a | ||||||
dictionary of DataFrames for each individual based on the | ||||||
'split_individuals' argument. | ||||||
|
||||||
Parameters | ||||||
---------- | ||||||
ds : xarray Dataset | ||||||
Dataset containing pose tracks, confidence scores, and metadata. | ||||||
split_individuals : bool, optional | ||||||
If True, return a dictionary of pandas DataFrames, for each individual. | ||||||
If False, return a single pandas DataFrame with multi-index columns | ||||||
for all individuals. | ||||||
Default is True. | ||||||
|
||||||
Returns | ||||||
------- | ||||||
pandas DataFrame | ||||||
pandas DataFrame or dict | ||||||
DeepLabCut-style pandas DataFrame or dictionary of DataFrames. | ||||||
|
||||||
Notes | ||||||
----- | ||||||
The DataFrame will have a multi-index column with the following levels: | ||||||
"scorer", "individuals", "bodyparts", "coords" (even if there is only | ||||||
one individual present). Regardless of the provenance of the | ||||||
points-wise confidence scores, they will be referred to as | ||||||
"likelihood", and stored in the "coords" level (as DeepLabCut expects). | ||||||
The DataFrame(s) will have a multi-index column with the following levels: | ||||||
"scorer", "individuals", "bodyparts", "coords" | ||||||
(if multi_individual is True), | ||||||
or "scorer", "bodyparts", "coords" (if multi_individual is False). | ||||||
Regardless of the provenance of the points-wise confidence scores, | ||||||
they will be referred to as "likelihood", and stored in | ||||||
the "coords" level (as DeepLabCut expects). | ||||||
|
||||||
See Also | ||||||
-------- | ||||||
to_dlc_file : Save the xarray dataset containing pose tracks directly | ||||||
to a DeepLabCut-style ".h5" or ".csv" file. | ||||||
""" | ||||||
|
||||||
if not isinstance(ds, xr.Dataset): | ||||||
error_msg = f"Expected an xarray Dataset, but got {type(ds)}. " | ||||||
logger.error(error_msg) | ||||||
raise ValueError(error_msg) | ||||||
|
||||||
ds.poses.validate() # validate the dataset | ||||||
|
||||||
# Concatenate the pose tracks and confidence scores into one array | ||||||
tracks_with_scores = np.concatenate( | ||||||
( | ||||||
ds.pose_tracks.data, | ||||||
ds.confidence.data[..., np.newaxis], | ||||||
), | ||||||
axis=-1, | ||||||
) | ||||||
|
||||||
# Create the DLC-style multi-index columns | ||||||
# Use the DLC terminology: scorer, individuals, bodyparts, coords | ||||||
scorer = ["movement"] | ||||||
individuals = ds.coords["individuals"].data.tolist() | ||||||
bodyparts = ds.coords["keypoints"].data.tolist() | ||||||
# The confidence scores in DLC are referred to as "likelihood" | ||||||
coords = ds.coords["space"].data.tolist() + ["likelihood"] | ||||||
|
||||||
index_levels = ["scorer", "individuals", "bodyparts", "coords"] | ||||||
columns = pd.MultiIndex.from_product( | ||||||
[scorer, individuals, bodyparts, coords], names=index_levels | ||||||
) | ||||||
df = pd.DataFrame( | ||||||
data=tracks_with_scores.reshape(ds.dims["time"], -1), | ||||||
index=np.arange(ds.dims["time"], dtype=int), | ||||||
columns=columns, | ||||||
dtype=float, | ||||||
) | ||||||
logger.info("Converted PoseTracks dataset to DLC-style DataFrame.") | ||||||
return df | ||||||
|
||||||
|
||||||
def to_dlc_file(ds: xr.Dataset, file_path: Union[str, Path]) -> None: | ||||||
if split_individuals: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You also need to check if there are actually more than one individuals in the data here. I would say if there is only one individual, then the We should only care about "split_individuals", when there are actually many of them to be split. In that case:
the docstring also has to be updated to reflect this behavior. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi @niksirbi, If split_individuals == True and we have only one individual, it will just output one single-animal DataFrame (with "scorer", "bodyparts", "coords"). If split_individuals == False and we have only one individual, it will output one single-animal DataFrame (with "scorer", "individuals", "bodyparts", "coords"). Although for the second case when split_individuals == False the individuals column will just be filled with one individual, it might still be important to have this feature. It might be useful in case they wanted to merge the dataframes with other multi-individual dataframes, where pandas would want the dataframes to have the same format. It also might be unexpected if they ran the function with split_individuals == False on a set of data with a mixture of single and multi-individual xarrays and saw the output being a mixture of single-animal and multi-animal dataframes as the single-individual xarrays would automatically turn into single-individual dataframes with no choice to make it multi-individual. The auto function in to_dlc_file handles cases when the user might want all single individual xarrays to be stored as single-individual dataframes and multi-individual xarrays to be stored as multi-individual dataframes, I can make a separate function for this auto feature and also use it for to_dlc_df if preferable. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Hm, I actually like this suggestion and the flexibility in gives to the user. The way you are proposing means that The arguments you make for it are convincing, so let's go ahead and do this! We just have to be careful to write the docstrings in an understandable way. |
||||||
individuals = ds.coords["individuals"].data.tolist() | ||||||
result = {} | ||||||
|
||||||
for individual in individuals: | ||||||
# Select data for the current individual | ||||||
individual_data = ds.sel(individuals=individual) | ||||||
|
||||||
# Concatenate the pose tracks and confidence scores into one array | ||||||
tracks_with_scores = np.concatenate( | ||||||
( | ||||||
individual_data.pose_tracks.data, | ||||||
individual_data.confidence.data[..., np.newaxis], | ||||||
), | ||||||
axis=-1, | ||||||
) | ||||||
|
||||||
# Create the DLC-style multi-index columns | ||||||
index_levels = ["scorer", "bodyparts", "coords"] | ||||||
columns = pd.MultiIndex.from_product( | ||||||
[scorer, bodyparts, coords], names=index_levels | ||||||
) | ||||||
|
||||||
# Create DataFrame for the current individual | ||||||
df = pd.DataFrame( | ||||||
data=tracks_with_scores.reshape( | ||||||
individual_data.dims["time"], -1 | ||||||
), | ||||||
index=np.arange(individual_data.dims["time"], dtype=int), | ||||||
columns=columns, | ||||||
dtype=float, | ||||||
) | ||||||
|
||||||
""" Add the DataFrame to the result | ||||||
dictionary with individual's name as key """ | ||||||
result[individual] = df | ||||||
|
||||||
logger.info( | ||||||
"""Converted PoseTracks dataset to | ||||||
DLC-style DataFrames for each individual.""" | ||||||
) | ||||||
return result | ||||||
else: | ||||||
"""Concatenate the pose tracks and | ||||||
confidence scores into one array for all individuals""" | ||||||
tracks_with_scores = np.concatenate( | ||||||
( | ||||||
ds.pose_tracks.data, | ||||||
ds.confidence.data[..., np.newaxis], | ||||||
), | ||||||
axis=-1, | ||||||
) | ||||||
Comment on lines
+108
to
+114
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This snippet is repeated twice, once with |
||||||
|
||||||
# Create the DLC-style multi-index columns | ||||||
index_levels = ["scorer", "individuals", "bodyparts", "coords"] | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is also some repetition here. You can define `index_levels = ["scorer", "bodyparts", "coords"] near the top (before the if statements), and then add the "individuals" level in the second position here, only when it's needed. |
||||||
individuals = ds.coords["individuals"].data.tolist() | ||||||
columns = pd.MultiIndex.from_product( | ||||||
[scorer, individuals, bodyparts, coords], names=index_levels | ||||||
) | ||||||
|
||||||
""" Create a single DataFrame with | ||||||
multi-index columns for each individual """ | ||||||
df = pd.DataFrame( | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This bit is also repeated and can be refactored into its own function that takes a numpy array and the columns as arguments. |
||||||
data=tracks_with_scores.reshape(ds.dims["time"], -1), | ||||||
index=np.arange(ds.dims["time"], dtype=int), | ||||||
columns=columns, | ||||||
dtype=float, | ||||||
) | ||||||
|
||||||
logger.info("Converted PoseTracks dataset to DLC-style DataFrame.") | ||||||
return df | ||||||
|
||||||
|
||||||
def to_dlc_file( | ||||||
ds: xr.Dataset, | ||||||
file_path: Union[str, Path], | ||||||
split_individuals: Union[bool, None] = None, | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would change this to as described in the docstring:
Suggested change
I think "auto" is more explicit and informative than None in this case. You would also have to modify the corresponding if statement of course. |
||||||
) -> None: | ||||||
"""Save the xarray dataset containing pose tracks to a | ||||||
DeepLabCut-style ".h5" or ".csv" file. | ||||||
|
||||||
|
@@ -87,11 +148,32 @@ def to_dlc_file(ds: xr.Dataset, file_path: Union[str, Path]) -> None: | |||||
file_path : pathlib Path or str | ||||||
Path to the file to save the DLC poses to. The file extension | ||||||
must be either ".h5" (recommended) or ".csv". | ||||||
split_individuals : bool, optional | ||||||
Format of the DeepLabcut output file. | ||||||
- If True, the file will be formatted as in a single-animal | ||||||
DeepLabCut project: no "individuals" level, and each individual will be | ||||||
saved in a separate file. The individual's name will be appended to the | ||||||
file path, just before the file extension, i.e. | ||||||
"/path/to/filename_individual1.h5". | ||||||
- If False, the file will be formatted as in a multi-animal | ||||||
DeepLabCut project: the columns will include the | ||||||
"individuals" level and all individuals will be saved to the same file. | ||||||
- If "auto" the format will be determined based on the number of | ||||||
individuals in the dataset: True if there are more than one, and | ||||||
False if there is only one. This is the default. | ||||||
|
||||||
See Also | ||||||
-------- | ||||||
to_dlc_df : Convert an xarray dataset containing pose tracks into a | ||||||
DeepLabCut-style pandas DataFrame with multi-index columns. | ||||||
DeepLabCut-style pandas DataFrame with multi-index columns | ||||||
for each individual or a dictionary of DataFrames for each individual | ||||||
based on the 'split_individuals' argument. | ||||||
|
||||||
Examples | ||||||
-------- | ||||||
>>> from movement.io import save_poses, load_poses | ||||||
>>> ds = load_poses.from_sleap("/path/to/file_sleap.analysis.h5") | ||||||
>>> save_poses.to_dlc_file(ds, "/path/to/file_dlc.h5") | ||||||
""" | ||||||
|
||||||
try: | ||||||
|
@@ -104,9 +186,50 @@ def to_dlc_file(ds: xr.Dataset, file_path: Union[str, Path]) -> None: | |||||
logger.error(error) | ||||||
raise error | ||||||
|
||||||
df = to_dlc_df(ds) # convert to pandas DataFrame | ||||||
if file.path.suffix == ".csv": | ||||||
df.to_csv(file.path, sep=",") | ||||||
else: # file.path.suffix == ".h5" | ||||||
df.to_hdf(file.path, key="df_with_missing") | ||||||
logger.info(f"Saved PoseTracks dataset to {file.path}.") | ||||||
# Sets default behaviour for the function | ||||||
if split_individuals is None: | ||||||
individuals = ds.coords["individuals"].data.tolist() | ||||||
if len(individuals) == 1: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The splitting is needed when there are more than one individuals (not when there is only one):
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You could also write this as a one-liner, for example: split_individuals = True if len(individuals) > 1 else False There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We also may want to throw an error if the user passes an invalid type, something like: if split_individuals == "auto":
individuals = ds.coords["individuals"].data.tolist()
split_individuals = True if len(individuals) > 1 else False
elif not isinstance(split_individuals, bool):
error_msg = (
f"Expected 'split_individuals' to be a boolean or 'auto', but got "
f"{type(split_individuals)}."
)
log_error(ValueError, error_msg) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the auto function, would we want it to save a single individual xarray as a single individual dataframe and a multi-individual xarray as a multi-individual dataframe, or save both as single individual dataframes? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would say this one:
I think that's what the users would expect. |
||||||
split_individuals = True | ||||||
else: | ||||||
split_individuals = False | ||||||
|
||||||
"""If split_individuals is True then it will split the file into a | ||||||
dictionary of pandas dataframes for each individual.""" | ||||||
if split_individuals: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again here, as in the above function, we have to check if there are more than one individuals to split, otherwise output only one single-animal file. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If split_individuals is True and there is only one individual to split it should already automatically output only one single-animal file. |
||||||
dfdict = to_dlc_df(ds, True) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In general, I would always explicitly provide the keyword arguments, so people don't have to look up to understand the meaning of this boolean:
Suggested change
|
||||||
if file.path.suffix == ".csv": | ||||||
for ( | ||||||
key, | ||||||
df, | ||||||
) in dfdict.items(): | ||||||
"""Iterates over dictionary, the key is the name of the | ||||||
individual and the value is the corresponding df.""" | ||||||
filepath = ( | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I find f-strings more readable, so I would rewrite this as: filepath = f"{file.path.with_suffix('')}_{key}.csv"
df.to_csv(Path(filepath), sep=",") |
||||||
str(file.path.with_suffix("")) + "_" + str(key) + ".csv" | ||||||
) | ||||||
# Convert the string back to a PosixPath object | ||||||
filepath_posix = Path(filepath) | ||||||
df.to_csv(filepath_posix, sep=",") | ||||||
|
||||||
else: # file.path.suffix == ".h5" | ||||||
for key, df in dfdict.items(): | ||||||
filepath = ( | ||||||
str(file.path.with_suffix("")) + "_" + str(key) + ".h5" | ||||||
) | ||||||
# Convert the string back to a PosixPath object | ||||||
filepath_posix = Path(filepath) | ||||||
df.to_hdf(filepath, key="df_with_missing") | ||||||
|
||||||
logger.info(f"Saved PoseTracks dataset to {file.path}.") | ||||||
|
||||||
"""If split_individuals is False then it will save the file as a dataframe | ||||||
with multi-index columns for each individual.""" | ||||||
if not split_individuals: | ||||||
dataframe = to_dlc_df(ds, False) # convert to pandas DataFrame | ||||||
if isinstance(dataframe, pd.DataFrame): # checking it's a dataframe | ||||||
if file.path.suffix == ".csv": | ||||||
dataframe.to_csv(file.path, sep=",") | ||||||
else: # file.path.suffix == ".h5" | ||||||
dataframe.to_hdf(file.path, key="df_with_missing") | ||||||
logger.info(f"Saved PoseTracks dataset to {file.path}.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this bit is outdated. We no longer have a "multi_individual" argument, it has to be rewritten to reflect the current arguments.