Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] OP-wise Insight Mining #516

Merged
merged 24 commits into from
Dec 20, 2024
Merged

[Feat] OP-wise Insight Mining #516

merged 24 commits into from
Dec 20, 2024

Conversation

HYLcool
Copy link
Collaborator

@HYLcool HYLcool commented Dec 19, 2024

  • Support OP-wise Insight Mining to check the significance of stats/meta changes before and after each OP with T-Test measure.
  • It helps users to know how each OP influences their datasets.
  • Add a new measure, RelatedTTestMeasure, to measure T-Test for two related distributions on their histograms with the same bins.
  • It analyzes small batches of the original dataset and the current dataset after each OP processing. After all OPs are processed, it collects all the analysis results and mines the stats/meta change significance insights from them.

Core related:

  • Analyzer: Support analysis on str tags/categories/stats in stats and meta fields; support passing an existing Dataset object to the run method of analyzer instead of loading from disk.
  • Monitor: Remove error messages from the resource monitor process that are caused by the main process errors, which helps to make logs clearer.
  • Exporter: export meta along with stats if export_stats is True.

Others

  • Add two registries, Non-stats Filters and Tagging Operators, to decorate the Filters that don't produce any stats and OPs that produce tags or other infos in meta field. For now, they are:
    • Non-stats Filters: specified_field_filter, specified_numeric_field_filter, suffix_filter, video_tagging_from_frames_filter (produces tags instead of stats).
    • Tagging Operators: video_tagging_from_frames_filter, image_tagging_mapper, video_tagging_from_audio_mapper, video_tagging_from_frames_mapper

⚠️ Notice

This PR is based on the PR #512 . Please review that PR first before reviewing this one.

@HYLcool HYLcool added enhancement New feature or request dj:core issues/PRs about the core functions of Data-Juicer labels Dec 19, 2024
@HYLcool HYLcool requested review from BeachWang and yxdyc December 19, 2024 13:24
@HYLcool HYLcool self-assigned this Dec 19, 2024
# Conflicts:
#	data_juicer/ops/__init__.py
#	data_juicer/ops/base_op.py
# Conflicts:
#	data_juicer/config/config.py
#	data_juicer/core/analyzer.py
Copy link
Collaborator

@yxdyc yxdyc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, plz see the minor comments.

data_juicer/config/config.py Show resolved Hide resolved
data_juicer/core/adapter.py Outdated Show resolved Hide resolved
data_juicer/core/adapter.py Show resolved Hide resolved
@yxdyc yxdyc merged commit b6f89a9 into main Dec 20, 2024
3 checks passed
@HYLcool HYLcool deleted the feat/insight_mining branch December 22, 2024 10:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dj:core issues/PRs about the core functions of Data-Juicer enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants