marc2bib is a Python package that allows to convert bibliographic records from MARC 21 format to BibTeX entries.
marc2bib is available on PyPI:
$ pip install marc2bib
Or for development:
$ git clone https://github.com/xymaxim/marc2bib
$ pip install -e marc2bib
The package works along with pymarc utilized to read MARC files. If you have not used it before, nothing to worry about. Let's read some data from a MARC file and convert it to a BibTeX entry:
>>> from pymarc import MARCReader
>>> from marc2bib import convert
>>> with open("file.mrc", "rb") as f:
... reader = MARCReader(f)
... record = next(reader) # read the first record
... print(convert(record)) # convert it to a BibTeX entry
...
@book{author2022,
author = {Author, Name},
. . .
}
And that is it!
The main function of this package is convert()
. It combines two
steps: (1) mapping MARC fields to BiBTex tags, map_tags()
,
and (2) converting the tags to BibTeX string,
tags_to_bibtex()
. However, instead of converting MARC data to
BibTeX string in one call, you can first map it to a dictionary of
BibTeX tags, for example, for inspection or post-processing (step 1):
>>> from marc2bib import map_tags
>>> tags = map_tags(record)
>>> print(tags["author"])
Author, Name
Then, you can convert these mapped tags to a BibTeX string (step 2):
>>> from marc2bib import tags_to_bibtex
>>> new_bibkey = tags["author"].split(",")[0] + tags["year"]
>>> # By the way, the indentation is customizable.
>>> bibtex = tags_to_bibtex(tags, bibkey=new_bibkey, indent=4)
>>> print(bibtex)
@book{Author2022,
author = {Author, Name},
. . .
}
Of course, the example below can be coded with convert()
function and the choice depends on your needs:
# The bibkey argument can be callable.
def new_bibkey(tags):
return tags["author"].split(",")[0] + tags["year"]
convert(record, bibkey=new_bibkey, indent=4)
To parse a value of BibTeX tags (fields), we use so-called
tag-functions which operate on one tag per function. Currently
marc2bib
fully supports book BibTeX entries—the tag-functions are
defined for the related required and optional tags. The user can
extend or override them easily:
from marc2bib import BOOK_REQ_TAGFUNCS
def title_title(record):
return BOOKS_REQ_TAGFUNCS["title"](record).title()
convert(record, tagfuncs={"title": title_title})
The returned tags can be either all (required and optional), only
required (default), or required with user-provided ones (include
argument):
# Return required tags and "pages".
convert(record, include=["pages"]) # or "all", "required"
A note: if you use tag-functions, no need to specify these tags for including separately.
While tag-functions work with parsing MARC 21 fields, post-hooks run at the end of translation of the parsed fields to BibTeX tags. There are default and user-defined post-hooks which execute in the presented order.
The hook's function may look as follows:
def hook(tag: str, value: str) -> str:
return do_something(value)
Every hook will be called with two arguments: the tag currently processing and its value.
See Cookbook section for examples on mastering hooks.
The default hooks include two hooks which execution can be controlled
via the corresponding arguments (in parentheses) of convert
and
map_tags
functions:
marc2bib.hooks.remove_isbd_punctuation_hook
(remove_punctuation
, default: True) — remove terminal periods and separating punctuation corresponding to MARC 21 format (see the below section for details).marc2bib.hooks.latexify_hook
(latexify
, default: True) — convert tag value to make it suitable for LaTeX. Currently, it escapes LaTeX special characters and normalizes number ranges by replacing hyphens with en-dashes.
After default hooks, the user-defined ones are executed. The
post_hooks
argument accepts a list of these hooks:
convert(record, post_hooks=[hook1, hook2])
In addition to the default hooks, the pre-defined hooks for some common cases are supplied with the package:
marc2bib.hooks.strip_outer_square_brackets_hook
— remove square brackets used to mark the additions made by cataloger;marc2bib.hooks.protect_uppercase_letters_hook
— enclose uppercase letters in curly braces to protect the case from changes.
In the MARC 21 format, the fields and subfields historically may contain and be separated by terminal periods and various punctuation marks [https://www.loc.gov/aba/pcc/documents/isbdmarc2016.pdf]. However, in BibTeX entries we do not need it. To clean up such punctuation we partially follow rules described in the link above (Appendix C). The initials, ordinal numbers and some common abbreviations from AACR2R [https://www.worldcat.org/title/847471922] (Appendix B) are kept.
For testing the package we use pytest. In order to run tests, check out the repository and type:
$ pytest
By default, it runs all tests excluding validation test to do quick testing. The validation test was added for testing the removal of ISBD punctuation on two NLM's test record sets, with some punctuation removed and not removed, from [https://www.loc.gov/aba/pcc/documents/test-records-punctuation.html]. For all tests, do:
$ pytest --runall
Sometimes you will need to apply a hook not for all tags. With
apply_not_for()
it is possible to make an existing hook
tag-conditional. Taking the conditional statement out of a hook could
be useful, for example, to (temporarily) exclude it from the
post-processing of certain record(s).
Let us illustrate it with the following example. Suppose a title of a
bibliographic work ends with a dot and you need to keep it. To do it,
we can turn off the default hook (remove_isbd_punctuation_hook
)
and instead make it applying later for all tags except title tag:
from marc2bib.hooks import apply_not_for
def hook(tag: str, value: str) -> str:
...
convert(record, post_hooks=[apply_not_for(hook, ["tag"])])
It is possible to customize hooks by adding keywords arguments to a
hook's function and passing them later with
functools.partial()
. Suppose a title of a bibliographic work ends
with an abbreviation and you need to keep a period. Here is an example
of customizing remove_isbd_punctuation_hook(tag, value, *,
abbreviations)
hook by providing an extended list of abbreviations:
from functools import partial
from marc2bib.core import COMMON_ABBREVIATIONS
from marc2bib.hooks import remove_isbd_punctuation_hook
abbreviations = COMMON_ABBREVIATIONS + ("abbrev.",)
remove_punctuation_call = partial(
remove_isbd_punctuation_hook, abbreviations=abbreviations
)
convert(record, remove_punctuation=False, post_hooks=[remove_punctuation_call])
As an alternative, especially when working interactively, you can just modify the translated value of title tag:
from marc2bib import map_tags, tags_to_bibtex
tags = map_tags(record)
tags["title"] = tags["title"] + "."
tags_to_bibtex(tags)
Thanks go to all the authors and contributors of pymarc package.