Mercury is a semantic-assisted, cross-text text labeling tool.
- semantic-assisted: when you select a text span, semantically related text segments will be highlighted -- so you don't have to eyeball through lengthy texts.
- cross-text: you are labeling text spans from two different texts.
Therefore, Mercury is very efficient for the labeling of NLP tasks that involve comparing texts between two documents which are also lengthy, such as hallucination detection or factual consistency/faithfulness in RAG systems. Semantic assistance not only saves time and reduces fatigues but also avoids mistakes.
Currently, Mercury only supports labeling inconsistencies between the source and summary for summarization in RAG.
Note
You need Python and Node.js.
Mercury uses sqlite-vec
to store and search embeddings.
-
pip3 install -r requirements.txt && python3 -m spacy download en_core_web_sm
-
If you don't have
pnpm
installed, please install withnpm install -g pnpm
- you may needsudo
. If you don't havenpm
, trysudo apt install npm
. -
To use
sqlite-vec
via Python's built-insqlite3
module, you must have SQLite>3.41 (otherwiseLIMIT
ork=?
will not work properly withrowid IN (?)
for vector search) installed and ensure Python's built-insqlite3
module is built for SQLite>3.41. Note that Python's built-insqlite3
module uses its own binary library that is independent of the OS's SQLite. So upgrading the OS's SQLite will not upgrade Python'ssqlite3
module. To manually upgrade Python'ssqlite3
module to use SQLite>3.41, here are the steps:-
Download and compile SQLite>3.41.0 from source
wget https://www.sqlite.org/2024/sqlite-autoconf-3460100.tar.gz tar -xvf sqlite-autoconf-3460100.tar.gz cd sqlite-autoconf-3460100 ./configure make
-
Set Python's built-in
sqlite3
module to use the compiled SQLite. Suppose you are currently at path$SQLITE_Compile
. Then set this environment variable (feel free to replace$SQLITE_Compile
with the actual absolute/relative path):export LD_PRELAOD=$SQLITE_Compile/.libs/libsqlite3.so
You may add the above line to
~.bashrc
to make it permanent. -
Verify that Python's
sqlite3
module is using the correct SQLite, run this Python code:python3 -c "import sqlite3; print(sqlite3.sqlite_version)"
If the output is the version of SQLite you just compiled, you are good to go.
-
If you are using Mac and run into troubles, please follow SQLite-vec's instructions.
-
-
To use
sqlite-vec
directly insqlite
prompt, simply compilesqlite-vec
from source and load the compiledvec0.o
. The usage can be found in the SQLite-vec's README.
-
Ingest data for labeling
Run
python3 ingester.py -h
to see the options.The ingester takes a CSV, JSON, or JSONL file and loads texts from two text columns (configurable via option
ingest_column_1
andingest_column_2
which default tosource
andsummary
) of the file. After ingestion, the data will be stored in the SQLite database, denoted asMERCURY_DB
in the following steps. -
pnpm install && pnpm build
(You need to recompile the frontend each time the UI code changes.) -
Manually set the labels for annotators to choose from in the
labels.yaml
file. Mercury supports hierarchical labels. -
Generate and set a JWT secret key:
export SECRET_KEY=$(openssl rand -base64 32)
. You can rerun the command above to generate a new secret key when needed, especially when the old one is compromised. Note that changing the JWT token will log out all users. Optionally, you can also setEXPIRE_MINUTES
to change the expiration time of the JWT token. The default is 7 days (10080 minutes). -
Administer the users:
python3 user_utils.py -h
. You need to create users before they can work on the annotation task. You can register new users, reset passwords, and delete users. User credentials are stored in a separate SQLite database, denoted asUSER_DB
in the following steps. -
Start the Mercury annotation server:
python3 server.py --mercury_db {MERCURY_DB} --user_db {USER_DB}
. Be sure to set the candidate labels to choose from in thelabels.yaml
file.
The annotations are stored in the annotations
table in a SQLite database (hardcoded name mercury.sqlite
). See the
section annotations
table for the schema.
The dumped human annotations are stored in a JSON format like this:
[
{ # first sample
'sample_id': int,
'source': str,
'summary': str,
'annotations': [ # a list of annotations from many human annotators
{
'annot_id': int,
'sample_id': int, # relative to the ingestion file
'annotator': str, # the annotator unique id
'annotator_name': str, # the annotator name
'label': list[str],
'note': str,
'summary_span': str, # the text span in the summary
'summary_start': int,
'summary_end': int,
'source_span': str, # the text span in the source
'source_start': int,
'source_end': int,
}
],
'meta_field_1': Any, # whatever meta info about the sample
'meta_field_2': Any,
...
},
{ # second sample
...
},
...
]
You can view exported data in http://[your_host]/viewer
python3 migrator.py export --workdir {DIR_OF_SQLITE_FILES} --csv unified_users.csv
python3 migrator.py register --csv unified_users.csv --db unified_users.sqlite
Terminology:
- A sample is a pair of source and summary.
- A document is either a source or a summary.
- A chunk is a sentence in a document.
[!NOTE] SQLite uses 1-indexed for
autoincrement
columns while the rest of the code uses 0-indexed.
Mercury needs two SQLite databases, denoted as MERCURY_DB
, which stores a corpus for annotation, and USER_DB
, which stores login credentials. One USER_DB
can be reused for multiple MERCURY_DB
s for the same group of users to annotation different corpora.
user_id | user_name | hashed_password | |
---|---|---|---|
add93a266ab7484abdc623ddc3bf6441 | Alice | [email protected] | super_safe |
68d41e465458473c8ca1959614093da7 | Bob | [email protected] | my_password |
- The column
user_name
inusers
table is not unique and are not used as part of login credentials. An annotator logs in using a combination ofemail
andhashed_password
. - Password is hashed by
argon2
with parameterstime_cost=2, memory_cost=19456, parallelism=1
.
Tables: chunks
, embeddings
, annotations
, config
.
All powered by SQLite. In particular, embeddings
is powered by sqlite-vec
.
Each row is a chunk.
A JSONL file like this:
# test.jsonl
{"source": "The quick brown fox. Jumps over a lazy dog. ", "summary": "26 letters."}
{"source": "We the people. Of the U.S.A. ", "summary": "The U.S. Constitution. It is great. "}
will be ingested into the chunks
table as below:
chunk_id | text | text_type | sample _id | char _offset | chunk _offset |
---|---|---|---|---|---|
0 | "The quick brown fox." | source | 0 | 0 | 0 |
1 | "Jumps over the lazy dog." | source | 0 | 21 | 1 |
2 | "We the people." | source | 1 | 0 | 0 |
3 | "Of the U.S.A." | source | 1 | 15 | 1 |
4 | "26 letters." | summary | 0 | 0 | 0 |
5 | "The U.S. Constitution." | summary | 1 | 0 | 0 |
6 | "It is great." | summary | 1 | 23 | 1 |
Meaning of select columns:
char_offset
is the offset of a chunk in its parent document measured by the starting character of the chunk. It allows us to find the chunk in the document.chunk_offset_local
is the index of a chunk in its parent document. It is used to find the chunk in the document.text_type
is takes value from the ingestion file.source
andsummary
for now.- All columns are 0-indexed.
- The
sample_id
is the index of the sample in the ingestion file. Because the ingestion file could be randomly sampled from a bigger dataset, thesample_id
is not necessarily global.
rowid | embedding |
---|---|
1 | [0.1, 0.2, ..., 0.9] |
2 | [0.2, 0.3, ..., 0.8] |
rowid
here andchunk_id
in thechunks
table have one-to-one correspondence.rowid
is 1-indexed due tosqlite-vec
. We cannot do anything about it. So when aligning the tableschunks
andembeddings
, remember to subtract 1 fromrowid
to getchunk_id
.
annot_id | sample _id | annot_spans | annotator | label | note |
---|---|---|---|---|---|
1 | 1 | {'source': [1, 10], 'summary': [7, 10]} | 2fe9bb69 | ["ambivalent"] | "I am not sure." |
2 | 1 | {'summary': [2, 8]} | a24cb15c | ["extrinsic"] | "No connection to the source." |
sample_id
are theid
's of chunks in thechunks
table.text_spans
is a JSON text field that stores the text spans selected by the annotator. Each entry is a dictionary where keys must be those in thetext_type
column in thechunks
table (hardcoded tosource
andsummary
now) and the values are lists of two integers: the start and end indices of the text span in the chunk. For extrinsic hallucinations (no connection to the source at all), onlysummary
-key items. The reason we use JSON here is that SQLite does not support array types.
For example:
key | value |
---|---|
embdding_model | "openai/text-embedding-3-small" |
embdding_dimension | 4 |
sample_id | json_meta |
---|---|
0 | {"model":"meta-llama/Meta-Llama-3.1-70B-Instruct","HHEMv1":0.43335,"HHEM-2.1":0.39717,"HHEM-2.1-English":0.90258,"trueteacher":1,"true_nli":0.0,"gpt-3.5-turbo":1,"gpt-4-turbo":1,"gpt-4o":1, "sample_id":727} |
1 | {"model":"openai/GPT-3.5-Turbo","HHEMv1":0.43003,"HHEM-2.1":0.97216,"HHEM-2.1-English":0.92742,"trueteacher":1,"true_nli":1.0,"gpt-3.5-turbo":1,"gpt-4-turbo":1,"gpt-4o":1, "sample_id": 1018} |
0-indexed, the sample_id
column is the sample_id
in the chunks
table. It is local to the ingestion file. The
json_meta
is whatever info other than ingestion columns (source and summary) in the ingestion file.
Mercury implemented a simple OAuth2 authentication. The user logs in with email and password. The server will return a signed JWT token. The server will verify the token for each request. The token will expire in 7 days.
SQLite-vec uses Euclidean distance for vector search. So all embeddings much be normalized to unit length. Fortunately, OpenAI and Sentence-Bert's embeddings are already normalized.
- Suppose the user selects a text span in chunk of global chunk ID
x
. Assume that the text span selection cannot cross sentence boundaries. - Get
x
'sdoc_id
from thechunks
table. - Get
x
's embedding from theembeddings
table bywhere rowid = {chunk_id}
. Denote it asx_embedding
. - Get the
chunk_id
s of all chunks in the opposite document (source ifx
is in summary, and vice versa) bywhere doc_id = {doc_id} and text_type={text_type}
. Denote such chunk IDs asy1, y2, ..., yn
. - Send a query to SQLite like this:
This will find the 5 most similar chunks to
SELECT rowid, distance FROM embeddings WHERE embedding MATCH '{x_embedding}' and rowid in ({y1, y2, ..., yn}) ORDER BY distance LIMIT 5
x
in the opposite document. It limits vector search within the opposite document byrowid in (y1, y2, ..., yn)
. Note thatrowid
,embedding
, anddistance
are predefined bysqlite-vec
.
Here is a running example (using the data above):
- Suppose the data has been ingested. The embedder is
openai/
text-embedding-3-small` and the embedding dimension is 4. - Suppose the user selects
sample_id = 1
andchunk_id = 5
: "The U.S. Constitution." Thetext_type
ofchunk_id = 5
issummary
-- the opposite document is the source. - Let's get the chunk IDs of the source document:
The return is
SELECT chunk_id FROM chunks WHERE sample_id = 1 and text_type = 'source'
2, 3
. - The embedding of "The U.S. Constitution" can be obtained from the
embeddings
table bywhere rowid = 6
. Note that because SQLite uses 1-indexed, so we need to add 1 fromchunk_id
to getrowid
.The return isSELECT embedding FROM embeddings WHERE rowid = 6
[0.08553484082221985, 0.21519172191619873, 0.46908700466156006, 0.8522521257400513]
. - Now We search for its nearest neighbors in its corresponding source chunks of
rowid
4 and 5 -- again, obtained by adding 1 fromchunk_id
2 and 3 obtained in step 3.The return isSELECT rowid, distance FROM embeddings WHERE embedding MATCH '[0.08553484082221985, 0.21519172191619873, 0.46908700466156006, 0.8522521257400513]' and rowid in (4, 5) ORDER BY distance
[(4, 0.3506483733654022), (5, 1.1732779741287231)]
. - Translate the
rowid
back tochunk_id
by subtracting 4 and 5 to get 2 and 3. The closest source chunk is "We the people" (rowid=3
whilechunk_id
=2) which is the most famous three words in the US Constitution.
- OpenAI's embedding endpoint can only embed up to 8192 tokens in each call.
embdding_dimension
is only useful for OpenAI models. Most other models do not support changing the embedding dimension.
multi-qa-mpnet-base-dot-v1
takes about 0.219 second on an x86 CPU to embed one sentence when batch_size is 1. The embedding dimension is 768.BAAI/bge-small-en-v1.5
takes also about 0.202 second on an x86 CPU to embed one sentence when batch_size is 1. The embedding dimension is 384.