This page explains all the modules available in appjsonify
and their sample usage. Note that the prerequisites
in the following tables means that such modules must be executed before a target module.
Module name | Description | Parameters | Prerequisites |
---|---|---|---|
load_docs |
load_docs loads tokens in given documents and adds them to each Document instance as Token instances. |
|
None |
load_objects_with_ml |
load_objects_with_ml loads objects such as tables , figures , and captions , and adds them to each Page instance as its meta dictionary. |
|
load_docs |
The following will output all tokens contained in a PDF document as a JSON file.
appjsonify /path/to/pdf/dir/or/path /path/to/output/dir \
--pipeline load_docs dump_doc_with_tokens \
--x_tolerance 1.2
Module name | Description | Parameters | Prerequisites |
---|---|---|---|
extract_lines |
extract_lines forms Line instances from tokens. Each instance indicates that tokens in registered in the instance should be in the same line. |
|
load_docs |
extract_footnotes |
extract_footnotes extracts Line instances from the body that should be footnotes and saves them as supplementary information. |
|
load_docs , extract_lines |
extract_paragraphs |
extract_paragraphs concatenates Line instances and forms a Paragraph instance when the conditions are met. |
|
load_docs , extract_lines |
extract_captions_with_ml |
extract_captions_with_ml extracts Line instances that are highly likely to be captions using outputs from load_objects_with_ml . |
|
load_docs , load_objects_with_ml , extract_lines |
extract_footnotes_with_ml |
extract_footnotes_with_ml extracts Line instances that are highly likely to be footnotes using outputs from load_objects_with_ml . |
|
load_docs , load_objects_with_ml , extract_lines |
The following will not only output all lines contained in a PDF document but also extract captions and footnotes as supplemantary information.
Note that the following assumes the use of a GPU. If you do not have it, please remove --detectron_device_mode cuda
or set --detectron_device_mode cpu
.
appjsonify /path/to/pdf/dir/or/path /path/to/output/dir \
--pipeline load_docs load_objects_with_ml extract_lines extract_captions_with_ml extract_footnotes_with_ml dump_doc_with_lines \
--x_tolerance 1.2 \
--tablebank_threshold 0.9 \
--publaynet_threshold 0.9 \
--docbank_threshold 0.9 \
--detectron_device_mode cuda \
--y_tolerance 6.0 \
--caption_overlap_threshold 0.5 \
--table_start_ptn_str "Table [0-9]+:" \
--figure_start_ptn_str "Figure [0-9]+:" \
--preset_table_caption_pos below \
--preset_figure_caption_pos below \
--caption_assignment_threshold 75.0 \
--footnote_overlap_threshold 0.5
Module name | Description | Parameters | Prerequisites |
---|---|---|---|
detect_sections |
detect_sections judges whether a Paragraph is either a section, subsection, or body. Results will be used for formatting. |
|
load_docs , extract_lines , extract_paragraphs |
tailor_references |
tailor_references tailors reference paragraphs that are fragmented. |
|
load_docs , extract_lines , extract_paragraphs , detect_sections |
concat_columns |
concat_columns concatenates Paragraph instances, each of which is in a different column, and forms a new Paragraph instance if they met the criteria. |
|
load_docs , extract_lines , extract_paragraphs |
concat_pages |
concat_pages concatenates Paragraph instances in different pages if they meet the conditions. |
|
load_docs , extract_lines , extract_paragraphs |
The following will list all paragraphs contained in a PDF document with section and subsection information.
appjsonify /path/to/pdf/dir/or/path /path/to/output/dir \
--pipeline load_docs extract_lines extract_paragraphs detect_sections concat_pages dump_formatted_doc \
--x_tolerance 1.2 \
--y_tolerance 6.0 \
--x_offset 10 \
--consider_font_size \
--indent_offset 25 \
--listing_offset 45 \
--max_headline_len 10 \
--headline_names Abstract References Limitations Appendix
Module name | Description | Parameters | Prerequisites |
---|---|---|---|
remove_meta |
remove_meta excludes header and footer Token instances from the body. |
|
load_docs |
remove_lines_by_objects |
remove_lines_by_objects removes Line instances that can be captions and elements inside figures or tables from the body. |
|
load_docs , extract_lines |
remove_figures_with_ml |
remove_figures_with_ml excludes Line instances inside figures from the body. |
|
load_docs , load_objects_with_ml , extract_lines |
remove_tables_with_ml |
remove_tables_with_ml excludes Line instances inside tables from the body. |
|
load_docs , load_objects_with_ml , extract_lines |
remove_equations_with_ml |
remove_equations_with_ml removes Line instances that must be equations on the basis of the outputs from the ML-based bounding box detectors. |
|
load_docs , load_objects_with_ml , extract_lines |
The following will list all paragraphs contained in a PDF document with section and subsection information, while filtering out non-body contents.
appjsonify /path/to/pdf/dir/or/path /path/to/output/dir \
--pipeline load_docs remove_illegal_tokens remove_meta extract_lines extract_footnotes remove_lines_by_objects extract_paragraphs detect_sections concat_columns concat_pages dump_formatted_doc \
--x_tolerance 1.2 \
--header_offset 75 \
--footer_offset 80 \
--left_side_offset 40 \
--right_side_offset 40 \
--y_tolerance 6.0 \
--footnote_offset 200 \
--use_horizontal_lines \
--object_bbox_offset 25 \
--remove_by_line \
--remove_by_curve \
--x_offset 10 \
--consider_font_size \
--indent_offset 25 \
--listing_offset 45 \
--max_headline_len 10 \
--headline_names Abstract References Limitations Appendix
Module name | Description | Parameters | Prerequisites |
---|---|---|---|
dump_doc_with_tokens |
dump_doc_with_tokens exports all Token instances in a PDF document. |
|
load_docs |
dump_doc_with_lines |
dump_doc_with_lines exports all Line instances in a PDF document. |
|
load_docs , extract_lines |
dump_doc_with_paragraphs |
dump_doc_with_paragraphs exports all Paragraph instances in a PDF document. |
|
load_docs , extract_lines , extract_paragraphs |
dump_doc_with_sections |
dump_doc_with_sections exports all Paragraph instances with sections information in a PDF document. |
|
load_docs , extract_lines , extract_paragraphs , detect_sections |
dump_formatted_doc |
dump_formatted_doc exports all Paragraph instances with sections information in a PDF document. This should be used after concat_pages . |
|
load_docs , extract_lines , extract_paragraphs , detect_sections , concat_pages |