Skip to content

Commit

Permalink
batch op in developer guidance (#220)
Browse files Browse the repository at this point in the history
* text action filter

* text action filter

* text entity dependency filter

* complete config_all.yaml and Operators.md

* move spacy-pkuseg to science_requires.txt

* fix typo in comments

* image diffusion mapper

* pre-commit done

* support cuda

* fix executor with_rank & num_proc

* fix unmatch device

* pre-commit done

* update simhash.num_differing_bits

* consistent with HF arg name

* Temporarily skip op fusion due to memory limit

* fix arg renaming

* pip install no cache

* prompt before caption & unit test

* precommit done

* unit test

* unit test

* fix doc

* remove torch dependency

* add gpu develop guide

* one caption all images

* available check

* pre-commit done

* fix typo

* diffuser on gpu

* pre-commit done

* prepare diffusion model and test

* pre-commit done

* unit-test

* unit-test

* unit-test

* unit test rm cache

* unit test rm dj cache

* set up python science only

* set up python science only

* set up python pip no cache

* batch op in developer guidance

---------

Co-authored-by: gece.gc <[email protected]>
Co-authored-by: null <[email protected]>
  • Loading branch information
3 people authored Feb 26, 2024
1 parent 3888388 commit 104feaa
Show file tree
Hide file tree
Showing 2 changed files with 42 additions and 0 deletions.
21 changes: 21 additions & 0 deletions docs/DeveloperGuide.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,27 @@ class StatsKeys(object):
# ... (same as above)
```

- If the operator processes data in batches rather than a single sample, it is necessary to declare `self._batched_op = True`.
```python
# ... (same as above)

@OPERATORS.register_module('text_length_filter')
class TextLengthFilter(Filter):
def __init__(self,
min_len: PositiveInt = 10,
max_len: PositiveInt = sys.maxsize,
*args,
**kwargs):
# ... (same as above)
self._batched_op = True

def compute_stats(self, sample, rank=None):
# ... (same as above)

def process(self, sample, rank=None):
# ... (same as above)
```

3. After implemention, add it to the OP dictionary in the `__init__.py` file in `data_juicer/ops/filter/` directory.

```python
Expand Down
21 changes: 21 additions & 0 deletions docs/DeveloperGuide_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,27 @@ class StatsKeys(object):
# ... (same as above)
```

- 如果算子批量处理数据,输入不是一个样本而是一个batch,需要声明`self._batched_op = True`
```python
# ... (same as above)

@OPERATORS.register_module('text_length_filter')
class TextLengthFilter(Filter):
def __init__(self,
min_len: PositiveInt = 10,
max_len: PositiveInt = sys.maxsize,
*args,
**kwargs):
# ... (same as above)
self._batched_op = True

def compute_stats(self, sample, rank=None):
# ... (same as above)

def process(self, sample, rank=None):
# ... (same as above)
```

3. 实现后,将其添加到 `data_juicer/ops/filter` 目录下 `__init__.py` 文件中的算子字典中:

```python
Expand Down

0 comments on commit 104feaa

Please sign in to comment.