Skip to content

Commit

Permalink
refactor: enhance duplicate annotation detection criteria
Browse files Browse the repository at this point in the history
Expand the criteria for identifying and removing duplicate annotations
in the workbook to include `predicate` and `predicate_id` in addition to
`element_xpath`, `object`, and `object_id`. This more comprehensive
approach ensures the removal of truly redundant annotations.
  • Loading branch information
clnsmth authored Nov 12, 2024
1 parent 89ae337 commit 31067f8
Show file tree
Hide file tree
Showing 2 changed files with 10 additions and 12 deletions.
10 changes: 6 additions & 4 deletions src/spinneret/workbook.py
Original file line number Diff line number Diff line change
Expand Up @@ -225,15 +225,17 @@ def delete_duplicate_annotations(
:param workbook: The annotation workbook
:returns: The workbook with duplicate annotations removed
:notes: The function removes duplicate annotations based on the
following columns: `element_xpath`, `object`, `object_id`, `date`. The
most recent annotation is preferred to allow improvements to other
fields set by the annotator.
following columns: `element_xpath`, `predicate`, `predicate_id`,
`object`, and `object_id`. The most recent annotation, based on `date`,
is preferred to allow improvements to other fields set by the
annotator.
"""
wb = workbook.sort_values("date", ascending=False)
wb = wb.drop_duplicates(
subset=[
"element_xpath",
"predicate",
"predicate_id",
"object",
"object_id",
],
Expand Down
12 changes: 4 additions & 8 deletions tests/test_workbook.py
Original file line number Diff line number Diff line change
Expand Up @@ -135,18 +135,14 @@ def test_delete_duplicate_annotations():
elements=["dataset"],
)
# Row 1
wb.loc[0, "predicate"] = "predicate_1"
wb.loc[0, "predicate_id"] = "predicate_id_1"
wb.loc[0, "object"] = "object_1"
wb.loc[0, "object_id"] = "object_id_1"
wb.loc[0, "predicate"] = "predicate"
wb.loc[0, "predicate_id"] = "predicate_id"
wb.loc[0, "object"] = "object"
wb.loc[0, "object_id"] = "object_id"
wb.loc[0, "date"] = pd.Timestamp.now()
# Row 2 is a duplicate annotation of row 1
row = wb.iloc[0].copy()
wb.loc[len(wb)] = row
wb.loc[1, "predicate"] = "predicate_2"
wb.loc[1, "predicate_id"] = "predicate_id_2"
wb.loc[1, "object"] = "object_1"
wb.loc[1, "object_id"] = "object_id_1"
sleep(1) # pause for 1 second to ensure the datetime is different
wb.loc[1, "date"] = pd.Timestamp.now()

Expand Down

0 comments on commit 31067f8

Please sign in to comment.