v2.2.0
- New feature: Prune XML tags and their contents from the input before tokenization (via the command line option
--prune TAGNAME1 --prune TAGNAME2 …
or by passingprune_tags=["TAGNAME1", "TAGNAME2", …]
totokenize_xml
ortokenize_xml_file
). This can be useful when processing HTML files, e.g. for removing any<script>
and<style>
tags from the input.