This library provides a generic interface to XML treebanks. An XML treebank is a directory structure with XML files representing parser output, such as dependency structure. Three types of XML treebanks are supported:
- Simple directory-based treebanks.
- An indexed treebank, consisting of a data file and an index file. The data file is a concatenation of chunks of data, such as XML documents or compressed derivation trees. The index file contains names for each chunk, along with the offset and size of the chunk encoded in base64 format.
- Treebanks stored in Berkeley DB XML databases.
Treebanks can be iterated by file or by query result.
This library evolved from the libcorpus library of the Alpino parser, adding query-based iteration, support for Berkeley DB XML treebanks, and a Qt-ish API.
Nearly all functionality is modelled as C++ classes using RAII, meaning that memory is managed by virtue of construction/destruction. Where necessary, errors are reported as exceptions. Language-specific wrappers can catch exceptions and translate them to the language's native error reporting method.
Documentation for this library can be obtained by running 'doxygen' in the root of the source archive.
The alpinocorpus command-line tools can be easily installed with Nix. On systems without Nix, follow the build instructions below.
On recent versions of Nix that support flakes, you can spawn a shell with the alpinocorpus utilities:
$ nix shell github:rug-compling/alpinocorpus
If you want the utitlities to be permanently available, you can install them in your profile:
$ nix profile install github:rug-compling/alpinocorpus
On Nix versions that do not support flakes, you can use the following command to install alpinocorpus into your profile:
nix-env \
-f https://github.com/rug-compling/alpinocorpus/archive/master.tar.gz \
-i
Requirements
- A C++ compiler.
- Meson
- Boost 1.47.0.
- Berkeley DB XML 6.1.4 (with a small patch to correct a query processing bug, see #131).
- libxml2
- libxslt
If Berkeley DB XML is installed as a system package, alpinocorpus can be built as follows:
$ meson builddir
$ ninja -C builddir
# If you want to install the library:
$ ninja -C builddir install
If Berkeley DB XML, XQilla, Xerces-C, and Berkeley DB are installed as a bundle using the upstream Berkeley DB XML distribution, there are two options for building alpinocorpus.
Assuming that the DB XML bundle is installed in /opt/dbxml
, the
first option is to alpinocorpus as follows:
$ meson builddir -D dbxml_bundle=/opt/dbxml
$ ninja -C builddir
# If you want to install the library:
$ ninja -C builddir install
This embeds the DB, Xerces-C, XQilla, and DB XML library paths in the alpinocorpus library. This will allow you to use the library and utilities without further ado.
The second option is to instead define some variables before building to point Meson to the headers and libraries:
$ export LD_LIBRARY_PATH=/opt/dbxml/lib
$ export LIBRARY_PATH=/opt/dbxml/lib
$ export CPATH=/opt/dbxml/include
$ meson builddir
$ ninja -C builddir
# If you want to install the library:
$ ninja -C builddir install
This does not embed the DB XML library paths into the alpinocorpus
shared library. Consequently, LD_LIBRARY_PATH
should always be set
when using alpinocorpus (LIBRARY_PATH
and CPATH
only have to be
set at build time).
Bindings for Python 2 and 3 are available from:
http://github.com/rug-compling/alpinocorpus-python
Bindings for Go are available from:
http://github.com/rug-compling/alpinocorpus-go
- Daniël de Kok <[email protected]>
- Jelmer van der Linde <[email protected]>
- Lars Buitinck <[email protected]>
- Peter Kleiweg <[email protected]>
Copyright 2010-2017 Daniël de Kok
Copyright 2010-2012 University of Groningen
This library is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public
License as published by the Free Software Foundation; either
version 2.1 of the License, or (at your option) any later version.
This library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public
License along with this library; if not, write to the
Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor,
Boston, MA 02110-1301 USA