Skip to content

Latest commit

 

History

History
125 lines (105 loc) · 12.7 KB

modules.md

File metadata and controls

125 lines (105 loc) · 12.7 KB

Available Modules

Go back to top

This page explains all the modules available in appjsonify and their sample usage. Note that the prerequisites in the following tables means that such modules must be executed before a target module.

Document loading related modules

Module name Description Parameters Prerequisites
load_docs load_docs loads tokens in given documents and adds them to each Document instance as Token instances.

x_tolerance: A threshold value to determine if one character forms the same word. Defaults to 3.5.

None
load_objects_with_ml load_objects_with_ml loads objects such as tables, figures, and captions, and adds them to each Page instance as its meta dictionary.

tablebank_threshold: A threshold value for a TableBank detection model. Defaults to 0.75.

publaynet_threshold: A threshold value for a Publaynet detection model. Defaults to 0.75.

docbank_threshold: A threshold value for a DocBank detection model. Defaults to 0.75.

detectron_device_mode: A type of a device for Detectron2 based models. Defaults to cpu.

save_image: Set this to save object images. Defaults to False.

output_imgae_dir: Specify an image path if save_image is True.

load_docs

Sample usage

The following will output all tokens contained in a PDF document as a JSON file.

appjsonify /path/to/pdf/dir/or/path /path/to/output/dir \
    --pipeline load_docs dump_doc_with_tokens \
    --x_tolerance 1.2

Editing-related modules

Extraction

Module name Description Parameters Prerequisites
extract_lines extract_lines forms Line instances from tokens. Each instance indicates that tokens in registered in the instance should be in the same line.

y_tolerance: A threshold value to determine if different tokens are in the same line. Defaults to 3.0.

load_docs
extract_footnotes extract_footnotes extracts Line instances from the body that should be footnotes and saves them as supplementary information.

footnote_offset: A threshold value to determine if an element is a footnote or not. Defaults to 150.

use_horizontal_lines: Set this to use horizontal line object information to judge whether a token is a footnote or not. Defaults to False.

load_docs, extract_lines
extract_paragraphs extract_paragraphs concatenates Line instances and forms a Paragraph instance when the conditions are met.

x_offset: An x-axis threshold value to determine if different lines are in the same paragraph. Defaults to 18.

y_offset: A y-axis threshold value to determine if different lines are in the same paragraph. Defaults to 4.

consider_font_size: Consider font size in making judgements. Defaults to False.

indent_offset: A threshold x-axis offset for indents. Defaults to 25.

listing_offset: A threshold x-axis offset for listing. Defaults to 40.

load_docs, extract_lines
extract_captions_with_ml extract_captions_with_ml extracts Line instances that are highly likely to be captions using outputs from load_objects_with_ml.

caption_overlap_threshold: A threshold value to determine if a line is a caption. Defaults to 0.5.

table_start_ptn_str: Specify a regex pattern to detect a caption for a table. Defaults to Table [0-9]+:.

figure_start_ptn_str: Specify a regex pattern to detect a caption for a figure. Defaults to Figure [0-9]+:.

preset_table_caption_pos: Specify if you want to assume the position of table captions from stndard, below, and above. Defaults to standard.

preset_figure_caption_pos: Specify if you want to assume the position of figure captions from stndard, below, and above. Defaults to standard.

caption_assignment_threshold: Specify a threshold distance value to determine if we should assign a caption to a table or image. Defaults to 75.0.

load_docs, load_objects_with_ml, extract_lines
extract_footnotes_with_ml extract_footnotes_with_ml extracts Line instances that are highly likely to be footnotes using outputs from load_objects_with_ml.

footnote_overlap_threshold: A threshold value to determine if a line is a footnote. Defaults to 0.5.

load_docs, load_objects_with_ml, extract_lines

Sample usage

The following will not only output all lines contained in a PDF document but also extract captions and footnotes as supplemantary information.

Note that the following assumes the use of a GPU. If you do not have it, please remove --detectron_device_mode cuda or set --detectron_device_mode cpu.

appjsonify /path/to/pdf/dir/or/path /path/to/output/dir \
    --pipeline load_docs load_objects_with_ml extract_lines extract_captions_with_ml extract_footnotes_with_ml dump_doc_with_lines \
    --x_tolerance 1.2 \
    --tablebank_threshold 0.9 \
    --publaynet_threshold 0.9 \
    --docbank_threshold 0.9 \
    --detectron_device_mode cuda \
    --y_tolerance 6.0 \
    --caption_overlap_threshold 0.5 \
    --table_start_ptn_str "Table [0-9]+:" \
    --figure_start_ptn_str "Figure [0-9]+:" \
    --preset_table_caption_pos below \
    --preset_figure_caption_pos below \
    --caption_assignment_threshold 75.0 \
    --footnote_overlap_threshold 0.5

Editing

Module name Description Parameters Prerequisites
detect_sections detect_sections judges whether a Paragraph is either a section, subsection, or body. Results will be used for formatting.

headline_names: Specify a known headline name(s) if they do not have section numbers so that they can be easily recognized as a section.

max_headline_len: Specify a maximum number of words to determine if an element is a headline. Defaults to 30.

paper_type: If this is set, detect_sections will detect sections in a more fine-grained manner on the basis of font size and font name.

load_docs, extract_lines, extract_paragraphs
tailor_references tailor_references tailors reference paragraphs that are fragmented.

listing_offset: A threshold x-axis offset for listing. Defaults to 40.

load_docs, extract_lines, extract_paragraphs, detect_sections
concat_columns concat_columns concatenates Paragraph instances, each of which is in a different column, and forms a new Paragraph instance if they met the criteria.

column_offset: Specify a threshold value to determine if a paragraph is in a different column from a previous one. Defaults to 300.

consider_font_size: Consider font size in making judgements. Defaults to False.

load_docs, extract_lines, extract_paragraphs
concat_pages concat_pages concatenates Paragraph instances in different pages if they meet the conditions.

consider_font_size: Consider font size in making judgements. Defaults to False.

load_docs, extract_lines, extract_paragraphs

Sample usage

The following will list all paragraphs contained in a PDF document with section and subsection information.

appjsonify /path/to/pdf/dir/or/path /path/to/output/dir \
    --pipeline load_docs extract_lines extract_paragraphs  detect_sections concat_pages dump_formatted_doc \
    --x_tolerance 1.2 \
    --y_tolerance 6.0 \
    --x_offset 10 \
    --consider_font_size \
    --indent_offset 25 \
    --listing_offset 45 \
    --max_headline_len 10 \
    --headline_names Abstract References Limitations Appendix

Removal

Module name Description Parameters Prerequisites
remove_meta remove_meta excludes header and footer Token instances from the body.

header_offset: A threshold value to determine if an element is a header. Defaults to 80.

footer_offset: A threshold value to determine if an element is a footer. Defaults to 80.

left_side_offset: A threshold value to determine if an element is in a left-hand side margin area. Defaults to 40.

right_side_offset: A threshold value to determine if an element is in a right-hand side margin area. Defaults to 40.

load_docs
remove_lines_by_objects remove_lines_by_objects removes Line instances that can be captions and elements inside figures or tables from the body.

remove_by_obj: Set this to remove captions based on bounding box information on objects. Defaults to False.

remove_by_line: Set this to remove captions based on bounding box information on line objects. Defaults to False.

remove_by_rect: Set this to remove captions based on bounding box information on rect objects. Defaults to False.

remove_by_curve: Set this to remove captions based on bounding box information on curve objects. Defaults to False.

load_docs, extract_lines
remove_figures_with_ml remove_figures_with_ml excludes Line instances inside figures from the body.

object_bbox_offset: Specify a threshold value to determine if a line element is within the margin area of an object. Defaults to 25.

load_docs, load_objects_with_ml, extract_lines
remove_tables_with_ml remove_tables_with_ml excludes Line instances inside tables from the body.

object_bbox_offset: Specify a threshold value to determine if a line element is within the margin area of an object. Defaults to 25.

load_docs, load_objects_with_ml, extract_lines
remove_equations_with_ml remove_equations_with_ml removes Line instances that must be equations on the basis of the outputs from the ML-based bounding box detectors.

equation_overlap_threshold: Specify a threshold value to determine if a line is a equation. Defaults to 0.5.

load_docs, load_objects_with_ml, extract_lines

Sample usage

The following will list all paragraphs contained in a PDF document with section and subsection information, while filtering out non-body contents.

appjsonify /path/to/pdf/dir/or/path /path/to/output/dir \
    --pipeline load_docs remove_illegal_tokens remove_meta extract_lines extract_footnotes remove_lines_by_objects extract_paragraphs  detect_sections concat_columns concat_pages dump_formatted_doc \
    --x_tolerance 1.2 \
    --header_offset 75 \
    --footer_offset 80 \
    --left_side_offset 40 \
    --right_side_offset 40 \
    --y_tolerance 6.0 \
    --footnote_offset 200 \
    --use_horizontal_lines \
    --object_bbox_offset 25 \
    --remove_by_line \
    --remove_by_curve \
    --x_offset 10 \
    --consider_font_size \
    --indent_offset 25 \
    --listing_offset 45 \
    --max_headline_len 10 \
    --headline_names Abstract References Limitations Appendix

Output related modules

Module name Description Parameters Prerequisites
dump_doc_with_tokens dump_doc_with_tokens exports all Token instances in a PDF document.

output_dir: Specify an output directory.

load_docs
dump_doc_with_lines dump_doc_with_lines exports all Line instances in a PDF document.

output_dir: Specify an output directory.

load_docs, extract_lines
dump_doc_with_paragraphs dump_doc_with_paragraphs exports all Paragraph instances in a PDF document.

output_dir: Specify an output directory.

load_docs, extract_lines, extract_paragraphs
dump_doc_with_sections dump_doc_with_sections exports all Paragraph instances with sections information in a PDF document.

output_dir: Specify an output directory.

load_docs, extract_lines, extract_paragraphs, detect_sections
dump_formatted_doc dump_formatted_doc exports all Paragraph instances with sections information in a PDF document. This should be used after concat_pages.

output_dir: Specify an output directory.

load_docs, extract_lines, extract_paragraphs, detect_sections, concat_pages