Skip to content

Latest commit

 

History

History
708 lines (576 loc) · 45.2 KB

README.org

File metadata and controls

708 lines (576 loc) · 45.2 KB

Introduction

librarian is a tool for managing a library of documents and other media. It is typically used for documents (such as PDF, epub, etc.), video files, and webpage archives, though it will work with any file (plain text or binary) or directory. Its primary goals are to:

  1. standardize file names and make the naming pattern declarative and easy to change,
  2. make document hierarchical organization declarative and compatible with a tags-based approach to classifying resources,
  3. provide a persistant path for each resource,
  4. be exportable to a bibtex file format (and potentially others),
  5. and support extensive metainformation search capabilities.

These goals and librarian’s means of accomplishing them are elaborated below.

Basic Concepts

TODO config file syntax needs updating.

librarian works on a directory conforming to a basic structure, which it calls a “library”. Let’s assume this library is stored in a directory named library, though it can be given any valid directory name. A library’s structure should conform to:

library
├── catalog.json
└── resources

catalog.json is a JSON configuration file where librarian looks for declarative information about the library’s resources.

resources is a directory containing a flat hierarchy of files and directories. Each file or directory in resources is a called a resource and one resource cannot contain another. Some resources (e.g., archived web pages) are directories themselves and may also contain directories. That’s fine. A directory and all its contained files (traversed recursively) is considered a single resource and in many respects will be treated identically to a single-file document.

The catalog.json file contains 5 major sections. A simple example file is shown below.

{
    "tags": [
        {
            "name": "math",
            "subtags": [
                {
                    "name": "calculus",
                    "subtags": null
                },
                {
                    "name": "algebra",
                    "subtags": null
                }
            ]
        },
        {
            "name": "science",
            "subtags": [
                {
                    "name": "physics",
                    "subtags": [
                        {
                            "name": "quantum mechanics",
                            "subtags": null
                        }
                    ]
                },
                {
                    "name": "chemistry",
                    "subtags": null
                },
                {
                    "name": "biology",
                    "subtags": null
                }
            ]
        },
        {
            "name": "engineering",
            "subtags": [
                {
                    "name": "computing",
                    "subtags": [
                        {
                            "name": "algorithms",
                            "subtags": null
                        }
                    ]
                },
                {
                    "name": "electronics",
                    "subtags": null
                }
            ]
        }
    ],

    "resource_types": [
        {
            "bibtex": "TechReport",
            "name": "application note"
        },
        {
            "bibtex": "Article",
            "name": "article"
        },
        {
            "bibtex": "Book",
            "name": "book"
        },
        {
            "bibtex": "Manual",
            "name": "datasheet"
        },
        {
            "bibtex": "Manual",
            "name": "manual"
        },
        {
            "bibtex": "Miscellaneous",
            "name": "presentation"
        },
        {
            "bibtex": "Manual",
            "name": "standard"
        },
        {
            "bibtex": "Book",
            "name": "textbook"
        },
        {
            "bibtex": "Online",
            "name": "website"
        }
    ],

    "document_types": [
        {
            "name": "PDF",
            "extension": "pdf"
        },
        {
            "name": "website",
            "extension": ""
        }
    ],

    "instances": [
        {
            "name": "primary",
            "filter": "",
            "file_name_pattern": "@title@ (@authors[0]:last@, @edition@e - @year@).@extension@",
            "directory_name_space_delimeter": "_",
            "instantiate_tags": "primary"
        },
        {
            "name": "deduplicating",
            "filter": {
                "size": "< 500",
                "extension": "pdf",
                "tags": "*"
            },
            "file_name_pattern": "@title@ (@authors[0]:last@, @edition@e - @year@).@extension@",
            "directory_name_space_delimeter": " ",
            "instantiate_tags": "all"
        }
    ],

    "resources": [
        {
            "title": "Microelectronic Circuits",
            "authors": [
                {
                    "last": "Sedra",
                    "middle": "S.",
                    "first": "Adel"
                },
                {
                    "last": "Smith",
                    "middle": "C.",
                    "first": "Kenneth"
                }
            ],
            "date": {
                "day": 0,
                "month": 0,
                "year": 2014
            },
            "edition": 7,
            "version": null,
            "publisher": "Oxford University Press",
            "organization": "organization",
            "tags": [ "electronics" ],
            "checksum": "1f41a02ac620f0388a8e40454b48f67137820dcb",
            "historical_checksums": [
                "1f41a02ac620f0388a8e40454b48f67137820dcb"
            ],
            "document_type": "PDF",
            "resource_type": "textbook"
        },
        {
            "title": "BFG591",
            "authors": [
                {
                    "last": "name",
                    "middle": "name",
                    "first": "name"
                }
            ],
            "date": {
                "day": 4,
                "month": 9,
                "year": 1995
            },
            "edition": 0,
            "version": "version",
            "publisher": "publisher",
            "organization": "NXP Semiconductors",
            "tags": [ "electronics" ],
            "checksum": "0e7cddd8f41639bc486c9d95843ceb9db8c06299",
            "historical_checksums": [
                "0e7cddd8f41639bc486c9d95843ceb9db8c06299"
            ],
            "document_type": "PDF",
            "resource_type": "datasheet"
        }
    ]
}

The first section is a hierarchy of tags. Zero or more tags are associated with each resource. librarian uses this hierarchy along with the tags associated with each resource to construct hierarchical directories of library files, called “instances”. librarian can create an instance anywhere within the filesystem. In an instance, the tag hierarchy is translated into a directory hierarchy with each directory getting the name of a tag. For example, in the catalog above, one of the top level instance directories would be “math”, with subdirectories of “calculus” and “algebra”.

To a first approximation, a resource is placed at each location where one of its tags appears in the hierarchy. There are several important exceptions to this, and other qualifiers worth mentioning. First, a resource is only placed at a tag location if it does not also have another tag that is a subtag of this tag. So, in our catalog above, if a resource is associated with “math” and “calculus”, it will only appear under “calculus”. If, on the other hand, it is associated with tags “math”, “calculus”, and “algebra”, it will appear under “calculus” and “algebra”. Additionally, the instance configuration places possible limits on this. The “instantiate_tags” key can take values of “primary” and “all”. If “all” is given, the resource will be placed at the directory locations for all tags. If “primary” is given, the resource will only be placed at the location of the first tag. Instantiations can also filter resources based on metainformation, such as type of resource and file size. Finally, resources are always hardlinked to each instance location in order to avoid storage waste on filesystems lacking built-in deduplication facilities.

Instances also provide declarative file and directory naming syntaxes. These are specified in the “file_name_pattern” and “directory_name_space_delimeter” keys. This permits easy migration between file naming patterns and directory word separators.

The “resource_types” section enumerates zero or more resource types and associates each type with a BibTeX type. This information is used when generating BibTeX files.

The “document_types” section specifies document types and associates each with a file extension that can be used as part of the file naming in an instance.

The final section, “resources”, is where all the resources and their metainformation is enumerated. When a new resource is placed in the “resources” directory, we can use librarian to “catalog” that resource. Cataloging performs several functions. First, it iterates through each resource and computes a (SHA-1) checksum of that resource’s contents (but not its name or position in the filesystem). If a resource is a directory, librarian computes a checksum of the full, recursive contents. Again, this checksum is computed relative to the directory in which it is stored. This makes it trivial to move a library without upsetting checksum values. librarian then looks through the existing catalog file and creates a new entry template for each new resource (without an existing entry). It does this by comparing the resource’s file name to its first historical checksum value. librarian then renames the resource to the checksum it computed for it. When the contents of a resource changes, librarian updates it’s checksum and appends the new checksum to “historical_checksums”. It does not, however, rename the resource. This is because one of the principle goals of librarian is to provide persistant resource naming (for at least one copy, obviously resource names within instances will change). librarian will also delete all but one copy of a resource (as indicated by its checksum).

Some of the resource field are required, but most are optional. Keep in mind that while librarian is fine with null fields, BibTeX may not be.

There are very good existing tools for searching file names within a hierarchy. librarian will not duplicate this functionality. However, it will provide a rich syntax for querying resources within the “resources” directory, which are otherwise very inconveniently named for normal searching strategies. The syntax for this is not yet decided, but it will include regex (within limitation) and other convenient searches (e.g., return matches for resources whose metainformation contains all of the words in a search query).

Catalog

Resources

Data Fields

An important consideration when designing librarian was which metadata should be stored for each resource. Unfortunately, several of librarian’s features and implementation details preclude leaving this decision up to the user. The difficulty with defining these fields myself is that I don’t want to place artificial limits on the use cases for this tool. In particular, it would have been easy to have this tool accommodate my own needs and to inadvertently neglect those of others.

My solution to this problem is to use the data fields specified by BibLaTeX. BibLaTeX (and its predecessor BibTeX) have existed for a long time and have experienced extensive use. It is therefore reasonable to assume that it accommodates most use cases. Modelling the data fields off BibLaTeX has the additional benefit that it makes exporting a BibLaTeX file trivial. And, since it is anticipated that there is some overlap in the user base of these two tools, using this tool does not require learning a new set of data fields.

I intentionally stated that librarian’s data fields are based off those of BibLaTeX, not that they are identical. This is because librarian omits some of BibLaTeX’s fields and makes small changes in the naming of others. The essential reason for this is that these tools serve different purposes: librarian is a resource management tool, whereas BibLaTeX is a tool for generating a bibliography from a collection of resources. For example, BibLaTeX has some fields that are related to bibliographic styling and are therefore not applicable to librarian. One might argue that it could be useful to store these fields anyway to support BibLaTeX export. However, I expect that the value of these fields could change between different instances of a bibliography and therefore it is not useful and even somewhat counterproductive to store these values in a central catalog.

It is also important to note that some data fields have slightly different meanings when used in librarian than the meanings they are assigned in BibLaTeX. This, again, stems from the fact that librarian is designed for resource management, not bibliographic generation. One example of this is the “pages” field. In BibLaTeX this defines the pages that are relevant to the citation, wherease in librarian it signifies the pages contained in the resource. This might occur, for example, if a scan of a textbook only contains a subset of the textbook’s entire content.

The following table documents all BibLaTeX data fields and specifies whether they are included in librarian, omitted, or modified. The justification column states why the resource was omitted or what it was modified to. A justification provided for an included field describes a field that carries a slightly different meaning in librarian than it does in BibLaTeX.

(i)included/
(o)mitted/
field(m)odifiedjustification
<c><c><l>
abstractoThis is not used by bibliographic backends and I don’t see the need for it,
though it’s not thematically inconsistent with the ideas of librarian. I may
add support for it in the future if there is interest.
addendumoRelated to bibliographic styling.
afterwordi
annotationoRelated to bibliographic styling.
annotatori
authori
authortypeoUnused by standard bibliographic backends.
bookauthori
bookpaginationi
booksubtitlei
booktitlei
booktitleaddonoTitle addons appear to be more about bibliographic style than about content
distinctions. Therefore, all addon fields are omitted.
chapteriIf the stored resource is only one chapter from a larger work, this specifies
the stored chapter. This differs from the meaning given in BibLaTeX where it
signifies the cited chapter. Nonetheless, it will be passed to BibLaTeX.
commentatori
datei
doii
editioni
editori
editorai
editorbi
editorci
editortypei
editoratypei
editorbtypei
editorctypei
eidi
entrysubtypeoThis field is auto-populated from the “content_type”.
eprinti
eprintclassi
eprinttypei
eventdatei
eventtitlei
eventtitleaddonoSee justification for “booktitleaddon”.
fileoThis automatically populated by librarian.
forwardi
holderi
howpublishedi
indextitleoRelated to bibliographic styling.
institutionoI don’t understand the difference between organization and institution.
Additionally, there are no BibLaTeX entries that use both. So, librarian only
supports organization and in the biblatex export institution and organization
are both populated with the value from organization.
introductioni
isani
isbni
ismni
isrni
issni
issuei
issuesubtitlei
issuetitlei
issuetitleaddonoSee justification for “booktitleaddon”.
iswci
journalsubtitlei
journaltitlei
journaltitleaddonoSee justification for “booktitleaddon”.
labeloRelated to bibliographic styling.
languagei
libraryoIt doesn’t seem like this field relates to the resource itself, but rather
how/where the resource was acquired.
locationi
mainsubtitlei
maintitlei
maintitleaddonoSee justification for “booktitleaddon”.
monthoThis information should be recorded in “date”.
nameaddonoSee justification for “booktitleaddon”.
noteiThis is included because it may be useful to store additional information
about the resource.
numberi
organizationi
origdatei
origlanguagei
origlocationi
origpublisheri
origtitlei
pagesiLike chapter, this means the pages of the resource, not simply those cited.
Also like chapter, it will be passed to BibLaTeX.
pagetotali
paginationi
parti
publisheri
pubstatei
reprinttitleoI don’t understand when you’d want to use this over title. Additionally, it’s
ignored by the standard bibliographic styles.
seriesi
shortauthoroRelated to bibliographic styling.
shorteditoroRelated to bibliographic styling.
shorthandoRelated to bibliographic styling.
shorthandintrooRelated to bibliographic styling.
shortjournaloRelated to bibliographic styling.
shortseriesoRelated to bibliographic styling.
shorttitleoRelated to bibliographic styling.
subtitlei
titlei
titleaddonoSee justification for “booktitleaddon”.
translatori
typeoAccommodated by “content_type”.
urli
urldatei
venuei
versioni
volumei
volumesi
yearoThis information should be recorded in “date”.

None of the special fields are supported.

FAQ

If resource data fields are inherited from BibLaTeX, why use JSON instead of BibLaTeX for the catalog?

The BibLaTeX format does not support features provided by librarian. For example, I cannot think of a way to provide the tagging and hierarchical instantiation features provided by librarian. The separate format also gives us the liberty to add new features in the future that the BibLaTeX format would not support.

The last, and least compelling, reason is that there are existing robust and fast solutions for serializing/deserializing JSON, whereas I don’t know of similar solutions in Rust for BibTeX/BibLaTeX. But, if it weren’t for the previous explanations I probably would have created one for the Serde framework.

Cataloging

Cataloging refers to the process of:

  1. adding new resources to the catalog,
  2. removing deleted resources from the catalog,
  3. updating the checksum of a resource when its content changes, and
  4. formatting the catalog.

cache file

Librarian uses SHA1 checksums of each resource to identify the content of that resource and to determine when that content changes. Moreover, it conservatively uses every byte of content in the resource to compute the checksum rather than some subset of the content. The operation of reading all resource bytes and computing a checksum from it is quite compute-intensive and can result in long cataloging times, especially for large resource collections.

To address these performance issues, librarian maintains a cache for each library that records the last time the resource’s checksum was verified. It can use this information to only compute the checksum of resources that have been modified (as reported by the operating system) since the resource’s checksum was last verified. This results in dramatic performance improvements for cataloging and is thus enabled by default. Moreover, while the shortcut is not foolproof, it should produce correct results under most circumstances. It is possible to ignore the cache while cataloging, and it may make sense to do this on occasion in order to ensure the continued validity of the cache. Also, while you can choose to ignore the cache, the cache timestamps will still be updated. Therefore, if a cache is somehow invalidated, cataloging while ignoring the cache will return the cache to a valid state.

Finally, librarian always employs UTC-aware timestamps, so (assuming your computer time is properly synchronized to UTC time) the cache will not be invalidated by a change in location.

why not include the verification time in the catalog itself?

This was a bit of a debate for me, but ultimately I decided to maintain a separate cache rather than to include the information within the catalog file. I did this for two primary reasons. The first is that the catalog is intended to store metadata relevant to the end user. That is, the catalog is designed as much for the end user as for the librarian program that processes and modifies it. In my opinion, the last verification time of a checksum does not seem like user-relevant information. Additionally, I expect that some users will version-control their catalog. Recording this information has the potential to create a lot of “noise” in the version-control history.

The primary motivation for me not to use a separate cache file is that I hate it when tools unnecessarily pollute your directories with files. Ultimately, a single cache file in the library directory seemed to me like a lower cost than the result of including the information in the catalog.

Another question that might come up is why I chose to store this cache file in the library directory rather than under ~/.config/librarian. One of my goals for librarian is that you should be free to move around your libraries without affecting the function of the tool. It was not immediately apparent to me how to accomplish this without the cache being in the library directory. Another motivating factor is that the cache is human-readable (it’s also JSON) and it might be useful to version-control it. Maintaining it within the library directory makes this possible.

Configuration File

authors

I think probably the best syntax for this is:

  • “last”
  • “first last”
  • “first middle last”

I think it’s probably ok to not permit only specifying the first name, etc.

date

Use an ISO 8601 date (probably a subset of it). This should be easy to provide custom serialize/deserialize implementations for.

  • “YYYY” (e.g., “1988”) means the year.
  • “YYYY-MM” means the year and month (month must use 2 digits, e.g., 02).
  • “YYYY-MM-DD” means year, month, and day.

How should time syntax work? Check the ISO 8601 syntax. But, probably:

  • “YYYY-MM-DD HH” (hour is 0 to 23, obviously)
  • “YYYY-MM-DD HH:MM”
  • “YYYY-MM-DD HH:MM:SS”
  • “YYYY-MM-DD HH:MM:SS mS:uS:nS:pS” (I don’t know about this, and how much should I support?)

MIME type

MIME type should be “type/subtype”.

Searching

TODO I probably can’t use the quotes as they’re used below (e.g., r”something” probably won’t work) since this won’t work with argument parsing and bash input. Maybe use single quotes? Or, choose another syntax. Can also use clap raw, but this would require search goes after --.

Librarian provides a rich query syntax for retrieving resource metadata. A simple search has the syntax

librarian search string

This will return resource metadata as JSON if “string” fuzzy matches any of the resource fields. A string search with a space must be quoted. For example,

librarian search "some string"

A field qualifier can be prepended to a query string to restrict the match to the corresponding resource field. The field qualifier uses the syntax field:query. For example,

librarian search title:"some title"

would return a resource if the title matches “some title”.

Librarian assumes fuzzy matching by default, but regular expression and exact matching are also supported. An exact string match uses the syntax e”exact” and a regular expression string match uses the syntax r”regex”.

The value of some fields (e.g., tags) are arrays. Librarian handles this by matching each element of the array individually. For example,

librarian search tags:electronics

would return a resource if one its tags matches “electronics”.

Multiple queries can be combined to specify that librarian should match the queries using some combination of “and” and “or”. “And” combinations are made by separating the queries with a space, while “or” combinations use a comma.

For example,

librarian search title:micro tags:electronics

places an implicit and between “title:micro” and “tags:electronics”. Therefore, a resource will be returned if title matches micro and at least one of the tags matches electronics.

To borrow from the terminology of math and computer science, “or” has higher precedence than “and”, so that

librarian search tags:electronics title:"phase noise",title:oscillator

would be treated logically like tags:electronics AND (title:"phase noise" OR title:oscillator).

You can specify that a resource must not match a query by prefixing it with “-“. This precedes the field specifier if there is one. E.g.,

librarian search -tags:electronics

We are free to mix matching algorithms (e.g., regex, exact, and fuzzy) in multi-match queries. Therefore, a previous query could have been instead

librarian search tags:electronics title:r"phase[\- ]noise",title:oscillator

(TODO verify that regex query is syntactically correct).

Finally, parentheses can be used to override operator precedence and to negative combinations of matches. For example,

librarian search -(tags:electronics title:"phase noise"),title:oscillator

would return a resource only if the title matched “oscillator”, or didn’t match both the tag being “electronics” and the title being “phase noise”.

implementation

A field unqualified match is identical to an implicit OR of the same match applied to every field. That is electronics is the same as title:electronics,authors:electronics,.... In the parse tree we should probably replace the former with the latter since it’s easier to process.

grammar

Consider the highest-level element as a “query”.

<query> ::= <match>
        | <combination>

<combination> ::= <match> <operator> <match>
              | <match> <operator> <opt-neg-lparen> <combination> ")"
              | <opt-neg-lparen> <combination> ")" <operator> <match>
              | <opt-neg-lparen> <combination> ")" <operator> <opt-neg-lparen> <combination> ")"

<opt-neg-lparen> ::= "("
                 | "-" "("

<operator> ::= <whitespace>
           | <opt-whitespace> "," <opt-whitespace>

<whitespace> ::=

<opt-whitespace> ::= " " <opt-whitespace>
                 | "\t" <opt-whitespace>
                 | ""

<opt-neg-match> ::= <match>
                | "-" <match>

<match> ::= <string>
        | <field> ":" <string>

; TODO str needs clarification
<string> ::= str
         | \"str\"
         | <string-modifier> \"str\"

<string-modifier> ::= "r"
                  | "e"

<field> ::= "title"
        | "authors"
        | "date"
        | "edition"
        | "version"
        | "publisher"
        | "organization"
        | "tags"
        | "document_type"
        | "content_type"
        | "url"
        | "checksum"
        | "historical_checksums"

query parser

It probably makes sense to define a formal grammar and have some external library perform this step. The trick may be how to get it into the binary tree I want.

binary tree

A binary tree is a very natural data structure for this query language. Each leaf node contains a “match string”, a “match type”, a “field qualifier” and a “logical modifier”. The match string is a string to match against. For example, “electronics”, or “quantum mechanics”. The match type specifies how that string should be matched against the resource. For example, using fuzzy matching, or a regular expression. The field qualifier optionally restricts the match to a single resource field (otherwise, it is an implicit OR of all resource fields). The logical modifier can optionally negate the result of a match.

Each branch (i.e., non-leaf) node has its two children plus a “logical combiner”, which specifies how to combine two child nodes (i.e., with AND or OR).

Each query of a resource corresponds to a complete binary tree. The resource matches the query if the root node evaluates to true. In general we do need to evaluate the child nodes in order to know the value of the root node. However, we don’t always need to evaluate all child nodes. For example, if a parent node uses an OR logical combiner and the first child evaluates to true, we do not need to evaluate the other child node.

To implement this, we must:

  • implement the data structure for each node (how do we handle the fact that leaf and branch nodes are different types?)
  • be able to evaluate whether a match evaluates to true or false given a node and a resource
  • be able to “reduce” a branch to a leaf node (this is obviously a recursive call from the root node)

Tags

qualified tags

TODO I’m not sure if this is a good idea. It may be better to place files directly within the electronics and math hierarchies than in “general” subdirectories of them.

There may be instances in which we want a tag to be a qualification of another tag. For example, perhaps we want one file to appear under “electronics/general” (call this file1) and some other file to appear under “math/general” (file 2). If we give file1 the tags [“electronics”, “general”] and file2 the tags [“math”, “general”], we’ll wind up with the directory structure

├── electronics
│   └── general
│       ├── file1
│       └── file2
└── math
    └── general
        ├── file1
        └── file2

which is not what we want. Instead, we want

├── electronics
│   └── general
│       └── file1
└── math
    └── general
        └── file2

To accomplish this, we can qualify a tag. So, instead of giving file1 the tags [“electronics”, “general”], we’d give it [“electronics:general”].

make tag hierarchy instance-specific

There should probably be a list of acceptable tags and then a tag hierarchy in each instance. It seems reasonable that someone might want different hierarchies for different instances.

Arguments

This section is out of date. In any event, it should probably be removed in favor of topical sections. Argument/subcommand information can be gleaned from the command help feature.

Subcommands

register

librarian register performs several tasks.

First, it iterates through all files and directories in resources. If that file does not have an entry in config.json (this is determined by checking if the file stem (file name minus extension) matches the first entry of "historical_checksums") it is added.

For files that do have an existing entry, librarian checks if the checksum still matches the checksum in config.json. If the checksum has changed, the config.json "checksum" field is set to the new checksum and that new checksum is also appended to "historical_checksums".

It should be clear that this satisfies librarian’s goal of persistant file naming, even with changes in file contents.

rename to update?

instantiate

librarian instantiate instantiates one or more instances from the configuration file. If no additional arguments are given, this instantiates all instances. All additional positional arguments after instantiate will be treated as instances to instantiate. More than one instance can be specified. If at least one instance is provided, no other instances will be instantiated.

info

Query info about a file (e.g., get author, title, etc.).

search

Get file from info. For example, you might type:

librarian search --title "Microelectronic Circuits"

and this would print the file path for a file matching that criteria.

There will be additional options for case insensitivity, regex, etc.

Options

directory

--directory or -d. Specifies the library directory. If the value is a relative path, it is relative to the current working directory. It is an absolute path if the value is an absolute path. If omitted, it defaults to the current working directory.

config

--config or -c. Config file path. This defaults to config.json relative to the specified directory (see directory) if omitted. If the value is a relative path it is relative to the specified directory. If the value is an absolute path, it is interpreted as an absolute path.

resources

--resources or -r. Resources directory path. This defaults to resources relative to the specified directory if omitted. If the value is a relative path it is relative to the specified directory. If the value is an absolute path, it is interpreted as an absolute path.

File Naming

Standardized and declarative file names mean that you specify a pattern for the name of a file (e.g., title (author, edition - year).extension) and librarian will instantiate the corresponding file name for each file (and directory).

file name pattern construction using Rust functions

It would be useful to be able to call a user-defined rust function on a string in the file name pattern. For example @first_character(title)@ .... This would provide a lot more flexibility.

Bibliography Generation

librarian can automatically generate a BibTeX file for your library.

Sorting a Config File

librarian can sort a config file for you. This will sort each resource in the contents field in alphanumeric order.

Programming

API

passing around files

Before a file is opened, it should be passed around as a PathBuf. After it has been opened, it should be passed around as a std::fs::File.

Task List

initialize field values to information provided by the document

For example, with PDF use metadata.

provide a summary of changes after registering new resources

Something like:

New resources:
PDF 32000 Standard (v1.7, 2008).pdf -> 1da235fe14c82f0a1bcdb3cc309b7b714d881b8c

Modified resources:
(None)

Deleted catalog resources (orphans):
542b4e6da11c31dc94f81105583784a8ac365e0e (title: Oscillator design guide for STM8AF/AL/S and STM32 microcontrollers)

titles can have slashes, which should be replaced in instantiations

add a config file that records the location of the library so you don’t need to pass it when invoking librarian

should I support other checksum formats than sha1?

rename contents to resources

should “original resource” be renamed to “primary resource”

If so, we may want to change “clone resource” to “secondary resource”.

does anything need to be changed to handle other binary files such as firmware?

The current conception of this tool should technically work, but the question is whether the abstraction is still a nice one for binary files. For example, does the somewhat rigid field structure for resources (title, author, year, edition, publisher, etc.) not work well for other kinds of files.

this tool is a natural way of more generally organizing content declaratively

use wget2 instead of wget

Task List Before I (Personally) Start Using This

This section is a personal note. It probably won’t be relevant to anyone else.

add an elisp package to interface with the librarian command line tool

open a file based on useful information

For instance, open a file by title. Practically, this probably means implementing some subset of the “search” subcommand. Then, adding an interactive elisp function to invoke it.

open an archived webpage

This isn’t really a blocker, since I don’t have a convenient way to do this currently anyway.

This should be an extension of the previous item. And, it’s not really a task for librarian. It’s more a task for the elisp function that invokes it.

If opening a file leads to a directory, then query the resource type. If it’s a webpage then get the HTML page with

find . -name "*.html"

open that, and then invoke shr-render-buffer.