Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

label between two distinct datasets? #884

Open
havardox opened this issue Aug 17, 2024 · 2 comments
Open

label between two distinct datasets? #884

havardox opened this issue Aug 17, 2024 · 2 comments
Assignees
Labels
question Further information is requested

Comments

@havardox
Copy link

havardox commented Aug 17, 2024

I have two datasets: a "corpus" and a "query" database. I need to do active labeling only between those two datasets as the values themselves are already distinct for each dataset. Is that possible? Here's my current code:

from zingg.client import *
from zingg.pipes import *
import sys

# Set up arguments for Zingg
args = Arguments()

# Phase name to be passed as a command line argument
phase_name = sys.argv[1]

# Define fields that correspond to the SQL table columns
query_id = FieldDefinition("query_id", "string", MatchType.DONT_USE)
corpus_id = FieldDefinition("corpus_id", "string", MatchType.DONT_USE)
title = FieldDefinition("title", "string", MatchType.FUZZY)
year_published = FieldDefinition("year_published", "string", MatchType.NUMERIC, MatchType.EXACT, MatchType.NULL_OR_BLANK)
authors = FieldDefinition("authors", "string", MatchType.FUZZY, MatchType.NULL_OR_BLANK)
part_number = FieldDefinition("part_number", "string", MatchType.FUZZY, MatchType.NULL_OR_BLANK)
isbn = FieldDefinition("isbn", "string", MatchType.FUZZY, MatchType.NULL_OR_BLANK)

# Group fields into a list
fieldDefs = [query_id, corpus_id, title, year_published, authors, part_number, isbn]

# Set field definitions in the arguments
args.setFieldDefinition(fieldDefs)

# Define the input pipe with the `query` table
queryData = Pipe("queryData", "jdbc")
queryData.addProperty(
    "url",
    f"jdbc:postgresql://{os.getenv('DATABASE_HOST')}:{os.getenv('DATABASE_PORT')}/book_linker_test",
)
queryData.addProperty("dbtable", "query")
queryData.addProperty("driver", "org.postgresql.Driver")
queryData.addProperty("user", os.getenv("DATABASE_USER"))
queryData.addProperty("password", os.getenv("DATABASE_PASSWORD"))

# Define the input pipe with the `corpus` table
corpusData = Pipe("corpusData", "jdbc")
corpusData.addProperty(
    "url",
    f"jdbc:postgresql://{os.getenv('DATABASE_HOST')}:{os.getenv('DATABASE_PORT')}/book_linker_test",
)
corpusData.addProperty("dbtable", "corpus")
corpusData.addProperty("driver", "org.postgresql.Driver")
corpusData.addProperty("user", os.getenv("DATABASE_USER"))
corpusData.addProperty("password", os.getenv("DATABASE_PASSWORD"))

# Add the input pipes
args.setData(queryData, corpusData)

# Define the output pipe
booksIdentitiesResolved = Pipe("booksIdentitiesResolved", "jdbc")
booksIdentitiesResolved.addProperty(
    "url",
    f"jdbc:postgresql://{os.getenv('DATABASE_HOST')}:{os.getenv('DATABASE_PORT')}/book_linker_test",
)
booksIdentitiesResolved.addProperty("dbtable", "books_unified")
booksIdentitiesResolved.addProperty("driver", "org.postgresql.Driver")
booksIdentitiesResolved.addProperty("user", os.getenv("DATABASE_USER"))
booksIdentitiesResolved.addProperty("password", os.getenv("DATABASE_PASSWORD"))

# Add the output pipe to arguments
args.setOutput(booksIdentitiesResolved)

# Model and execution settings
args.setModelId("books_model")
args.setZinggDir("test_models")
args.setNumPartitions(4)
args.setLabelDataSampleSize(0.5)

# Zingg execution options
options = ClientOptions([ClientOptions.PHASE, phase_name])

# Execute Zingg with the provided phase
zingg = Zingg(args, options)
zingg.initAndExecute()

Running zingg.sh {zingg.conf} --run {python_file} label only selects samples from the "corpus" as the corpus has about 100k records and the query dataset 9k. That's not what I want, I only care about the differences between the query and corpus database.

@havardox havardox added the question Further information is requested label Aug 17, 2024
@sonalgoyal
Copy link
Member

Currently Zingg does not distinguish between datasets while selecting pairs for labelling. However, if you run findTrainingData, label and then run link, you should be able to get the results you want. You can also force feed some trainingSamples between the two datasets using pre existing training data.

@sonalgoyal
Copy link
Member

@sania-16 can you start looking at this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants