Available Modules

Go back to top

This page explains all the modules available in appjsonify and their sample usage. Note that the prerequisites in the following tables means that such modules must be executed before a target module.

Document loading related modules

Module name Description Parameters Prerequisites

load_docs

load_docs loads tokens in given documents and adds them to each Document instance as Token instances.

x_tolerance: A threshold value to determine if one character forms the same word. Defaults to 3.5.

None

load_objects_with_ml

load_objects_with_ml loads objects such as tables, figures, and captions, and adds them to each Page instance as its meta dictionary.

tablebank_threshold: A threshold value for a TableBank detection model. Defaults to 0.75.

publaynet_threshold: A threshold value for a Publaynet detection model. Defaults to 0.75.

docbank_threshold: A threshold value for a DocBank detection model. Defaults to 0.75.

detectron_device_mode: A type of a device for Detectron2 based models. Defaults to cpu.

save_image: Set this to save object images. Defaults to False.

output_imgae_dir: Specify an image path if save_image is True.

load_docs

Sample usage

The following will output all tokens contained in a PDF document as a JSON file.

appjsonify /path/to/pdf/dir/or/path /path/to/output/dir \
    --pipeline load_docs dump_doc_with_tokens \
    --x_tolerance 1.2

Editing-related modules

Extraction

Module name	Description	Parameters	Prerequisites
`extract_lines`	`extract_lines` forms `Line` instances from tokens. Each instance indicates that tokens in registered in the instance should be in the same line.	`y_tolerance`: A threshold value to determine if different tokens are in the same line. Defaults to 3.0.	`load_docs`
`extract_footnotes`	`extract_footnotes` extracts `Line` instances from the body that should be footnotes and saves them as supplementary information.	`footnote_offset`: A threshold value to determine if an element is a footnote or not. Defaults to 150. `use_horizontal_lines`: Set this to use horizontal line object information to judge whether a token is a footnote or not. Defaults to False.	`load_docs`, `extract_lines`
`extract_paragraphs`	`extract_paragraphs` concatenates `Line` instances and forms a `Paragraph` instance when the conditions are met.	`x_offset`: An x-axis threshold value to determine if different lines are in the same paragraph. Defaults to 18. `y_offset`: A y-axis threshold value to determine if different lines are in the same paragraph. Defaults to 4. `consider_font_size`: Consider font size in making judgements. Defaults to False. `indent_offset`: A threshold x-axis offset for indents. Defaults to 25. `listing_offset`: A threshold x-axis offset for listing. Defaults to 40.	`load_docs`, `extract_lines`
`extract_captions_with_ml`	`extract_captions_with_ml` extracts `Line` instances that are highly likely to be captions using outputs from `load_objects_with_ml`.	`caption_overlap_threshold`: A threshold value to determine if a line is a caption. Defaults to 0.5. `table_start_ptn_str`: Specify a regex pattern to detect a caption for a table. Defaults to `Table [0-9]+:`. `figure_start_ptn_str`: Specify a regex pattern to detect a caption for a figure. Defaults to `Figure [0-9]+:`. `preset_table_caption_pos`: Specify if you want to assume the position of table captions from `stndard`, `below`, and `above`. Defaults to `standard`. `preset_figure_caption_pos`: Specify if you want to assume the position of figure captions from `stndard`, `below`, and `above`. Defaults to `standard`. `caption_assignment_threshold`: Specify a threshold distance value to determine if we should assign a caption to a table or image. Defaults to 75.0.	`load_docs`, `load_objects_with_ml`, `extract_lines`
`extract_footnotes_with_ml`	`extract_footnotes_with_ml` extracts `Line` instances that are highly likely to be footnotes using outputs from `load_objects_with_ml`.	`footnote_overlap_threshold`: A threshold value to determine if a line is a footnote. Defaults to 0.5.	`load_docs`, `load_objects_with_ml`, `extract_lines`

Sample usage

The following will not only output all lines contained in a PDF document but also extract captions and footnotes as supplemantary information.

Note that the following assumes the use of a GPU. If you do not have it, please remove --detectron_device_mode cuda or set --detectron_device_mode cpu.

appjsonify /path/to/pdf/dir/or/path /path/to/output/dir \
    --pipeline load_docs load_objects_with_ml extract_lines extract_captions_with_ml extract_footnotes_with_ml dump_doc_with_lines \
    --x_tolerance 1.2 \
    --tablebank_threshold 0.9 \
    --publaynet_threshold 0.9 \
    --docbank_threshold 0.9 \
    --detectron_device_mode cuda \
    --y_tolerance 6.0 \
    --caption_overlap_threshold 0.5 \
    --table_start_ptn_str "Table [0-9]+:" \
    --figure_start_ptn_str "Figure [0-9]+:" \
    --preset_table_caption_pos below \
    --preset_figure_caption_pos below \
    --caption_assignment_threshold 75.0 \
    --footnote_overlap_threshold 0.5

Editing

Module name	Description	Parameters	Prerequisites
`detect_sections`	`detect_sections` judges whether a `Paragraph` is either a section, subsection, or body. Results will be used for formatting.	`headline_names`: Specify a known headline name(s) if they do not have section numbers so that they can be easily recognized as a section. `max_headline_len`: Specify a maximum number of words to determine if an element is a headline. Defaults to 30. `paper_type`: If this is set, `detect_sections` will detect sections in a more fine-grained manner on the basis of font size and font name.	`load_docs`, `extract_lines`, `extract_paragraphs`
`tailor_references`	`tailor_references` tailors reference paragraphs that are fragmented.	`listing_offset`: A threshold x-axis offset for listing. Defaults to 40.	`load_docs`, `extract_lines`, `extract_paragraphs`, `detect_sections`
`concat_columns`	`concat_columns` concatenates `Paragraph` instances, each of which is in a different column, and forms a new `Paragraph` instance if they met the criteria.	`column_offset`: Specify a threshold value to determine if a paragraph is in a different column from a previous one. Defaults to 300. `consider_font_size`: Consider font size in making judgements. Defaults to False.	`load_docs`, `extract_lines`, `extract_paragraphs`
`concat_pages`	`concat_pages` concatenates `Paragraph` instances in different pages if they meet the conditions.	`consider_font_size`: Consider font size in making judgements. Defaults to False.	`load_docs`, `extract_lines`, `extract_paragraphs`

Sample usage

The following will list all paragraphs contained in a PDF document with section and subsection information.

appjsonify /path/to/pdf/dir/or/path /path/to/output/dir \
    --pipeline load_docs extract_lines extract_paragraphs  detect_sections concat_pages dump_formatted_doc \
    --x_tolerance 1.2 \
    --y_tolerance 6.0 \
    --x_offset 10 \
    --consider_font_size \
    --indent_offset 25 \
    --listing_offset 45 \
    --max_headline_len 10 \
    --headline_names Abstract References Limitations Appendix

Removal

Module name	Description	Parameters	Prerequisites
`remove_meta`	`remove_meta` excludes header and footer `Token` instances from the body.	`header_offset`: A threshold value to determine if an element is a header. Defaults to 80. `footer_offset`: A threshold value to determine if an element is a footer. Defaults to 80. `left_side_offset`: A threshold value to determine if an element is in a left-hand side margin area. Defaults to 40. `right_side_offset`: A threshold value to determine if an element is in a right-hand side margin area. Defaults to 40.	`load_docs`
`remove_lines_by_objects`	`remove_lines_by_objects` removes `Line` instances that can be captions and elements inside figures or tables from the body.	`remove_by_obj`: Set this to remove captions based on bounding box information on objects. Defaults to False. `remove_by_line`: Set this to remove captions based on bounding box information on line objects. Defaults to False. `remove_by_rect`: Set this to remove captions based on bounding box information on rect objects. Defaults to False. `remove_by_curve`: Set this to remove captions based on bounding box information on curve objects. Defaults to False.	`load_docs`, `extract_lines`
`remove_figures_with_ml`	`remove_figures_with_ml` excludes `Line` instances inside figures from the body.	`object_bbox_offset`: Specify a threshold value to determine if a line element is within the margin area of an object. Defaults to 25.	`load_docs`, `load_objects_with_ml`, `extract_lines`
`remove_tables_with_ml`	`remove_tables_with_ml` excludes `Line` instances inside tables from the body.	`object_bbox_offset`: Specify a threshold value to determine if a line element is within the margin area of an object. Defaults to 25.	`load_docs`, `load_objects_with_ml`, `extract_lines`
`remove_equations_with_ml`	`remove_equations_with_ml` removes `Line` instances that must be equations on the basis of the outputs from the ML-based bounding box detectors.	`equation_overlap_threshold`: Specify a threshold value to determine if a line is a equation. Defaults to 0.5.	`load_docs`, `load_objects_with_ml`, `extract_lines`

Sample usage

The following will list all paragraphs contained in a PDF document with section and subsection information, while filtering out non-body contents.

appjsonify /path/to/pdf/dir/or/path /path/to/output/dir \
    --pipeline load_docs remove_illegal_tokens remove_meta extract_lines extract_footnotes remove_lines_by_objects extract_paragraphs  detect_sections concat_columns concat_pages dump_formatted_doc \
    --x_tolerance 1.2 \
    --header_offset 75 \
    --footer_offset 80 \
    --left_side_offset 40 \
    --right_side_offset 40 \
    --y_tolerance 6.0 \
    --footnote_offset 200 \
    --use_horizontal_lines \
    --object_bbox_offset 25 \
    --remove_by_line \
    --remove_by_curve \
    --x_offset 10 \
    --consider_font_size \
    --indent_offset 25 \
    --listing_offset 45 \
    --max_headline_len 10 \
    --headline_names Abstract References Limitations Appendix

Output related modules

Module name	Description	Parameters	Prerequisites
`dump_doc_with_tokens`	`dump_doc_with_tokens` exports all `Token` instances in a PDF document.	`output_dir`: Specify an output directory.	`load_docs`
`dump_doc_with_lines`	`dump_doc_with_lines` exports all `Line` instances in a PDF document.	`output_dir`: Specify an output directory.	`load_docs`, `extract_lines`
`dump_doc_with_paragraphs`	`dump_doc_with_paragraphs` exports all `Paragraph` instances in a PDF document.	`output_dir`: Specify an output directory.	`load_docs`, `extract_lines`, `extract_paragraphs`
`dump_doc_with_sections`	`dump_doc_with_sections` exports all `Paragraph` instances with sections information in a PDF document.	`output_dir`: Specify an output directory.	`load_docs`, `extract_lines`, `extract_paragraphs`, `detect_sections`
`dump_formatted_doc`	`dump_formatted_doc` exports all `Paragraph` instances with sections information in a PDF document. This should be used after `concat_pages`.	`output_dir`: Specify an output directory.	`load_docs`, `extract_lines`, `extract_paragraphs`, `detect_sections`, `concat_pages`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

modules.md

modules.md

Available Modules

Document loading related modules

Sample usage

Editing-related modules

Extraction

Sample usage

Editing

Sample usage

Removal

Sample usage

Output related modules

Files

modules.md

Latest commit

History

modules.md

File metadata and controls

Available Modules

Document loading related modules

Sample usage

Editing-related modules

Extraction

Sample usage

Editing

Sample usage

Removal

Sample usage

Output related modules