GitHub - ThanGrove/OCRProcessing: Python scripts for processing Tibetan OCR

###################### OCRProcessing Scripts ##############################

Author: Than Grove Date: Feb 8, 2013

These are scripts I am creating to process the OCR XML output of Tibetan scanning of NGB made by Zach. The OCR output comes as one XML file (with .txt extension) per volume of a collection.

The goal of these scripts is

to create a process whereby given the catalog data it will break up the individual volume files into text files that will contain the XML marked up file for each text. This process will assign each text a unique sequential id.
to create the individual bibl records for each text named with the text id.
to create an XML file that encodes the catalog hierarchy (cat->vol->text) in the TEI Tibbibl markup devised for the THL system that will reference the text files and bibl files made above.

At the initial commit all the functionality has not yet be created but what is there is all contained in a single script XMLCat.py

This created three types of objects:

XMLCat: this reads in a simple XML file of a catalog, parses its and holds data on: a. volumes in the catalog b. texts in the catalog
XMLText : an object for quickly accessing text information basically creates a dictionary from the XML
OCRVol : this is the object for controlling the OCR volume document which is read in as XML.

NOTE: I am a newbie to Python any feedback as to coding style, easier ways to do things, etc. is appreciated.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
OCRXml		OCRXml
.gitignore		.gitignore
README.md		README.md
addBiblRefs.py		addBiblRefs.py
addLengthToVols.py		addLengthToVols.py
catToCSV.py		catToCSV.py
checkCatalogPagination.py		checkCatalogPagination.py
convertVolToSimple.py		convertVolToSimple.py
convertVolToSimple_old.py		convertVolToSimple_old.py
createCatInfo.py		createCatInfo.py
createCatXML.py		createCatXML.py
extractVolNames.py		extractVolNames.py
findTextBreaks.py		findTextBreaks.py
loadCatalog.py		loadCatalog.py
renumberCat.py		renumberCat.py
test.py		test.py
testTextBreaks.py		testTextBreaks.py
writeTexts.py		writeTexts.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

ThanGrove/OCRProcessing

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages