Skip to content

Commit

Permalink
update op desc in overview scan
Browse files Browse the repository at this point in the history
  • Loading branch information
zhijianma committed Oct 31, 2023
1 parent 263910f commit 90ef094
Showing 1 changed file with 3 additions and 1 deletion.
4 changes: 3 additions & 1 deletion demos/overview_scan/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@
| Type | Number | Description |
|-----------------------------------|:------:|-------------------------------------------------|
| Formatter | 7 | Discovers, loads, and canonicalizes source data |
| Mapper | 19 | Edits and transforms samples |
| Mapper | 21 | Edits and transforms samples |
| Filter | 16 | Filters out low-quality samples |
| Deduplicator | 3 | Detects and removes duplicate samples |
| Selector | 2 | Selects top samples based on ranking |
Expand All @@ -111,6 +111,7 @@
'''
| Operator | Domain | Lang | Description |
|-----------------------------------------------|--------------------|--------|----------------------------------------------------------------------------------------------------------------|
| chinese_convert_mapper | General | zh | Convert Chinese between Traditional Chinese, Simplified Chinese and Japanese Kanji (by [opencc](https://github.com/BYVoid/OpenCC)) |
| clean_copyright_mapper | Code | en, zh | Removes copyright notice at the beginning of code files (:warning: must contain the word *copyright*) |
| clean_email_mapper | General | en, zh | Removes email information |
| clean_html_mapper | General | en, zh | Removes HTML tags and returns plain text of all the nodes |
Expand All @@ -125,6 +126,7 @@
| remove_comments_mapper | LaTeX | en, zh | Removes the comments of TeX documents |
| remove_header_mapper | LaTeX | en, zh | Removes the running headers of TeX documents, e.g., titles, chapter or section numbers/names |
| remove_long_words_mapper | General | en, zh | Removes words with length outside the specified range |
| remove_non_chinese_character_mapper | General | en, zh | Remove non Chinese character in text samples. |
| remove_specific_chars_mapper | General | en, zh | Removes any user-specified characters or substrings |
| remove_table_text_mapper | General, Financial | en | Detects and removes possible table contents (:warning: relies on regular expression matching and thus fragile) |
| remove_words_with_incorrect_<br />substrings_mapper | General | en, zh | Removes words containing specified substrings |
Expand Down

0 comments on commit 90ef094

Please sign in to comment.