You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have two datasets: a "corpus" and a "query" database. I need to do active labeling only between those two datasets as the values themselves are already distinct for each dataset. Is that possible? Here's my current code:
fromzingg.clientimport*fromzingg.pipesimport*importsys# Set up arguments for Zinggargs=Arguments()
# Phase name to be passed as a command line argumentphase_name=sys.argv[1]
# Define fields that correspond to the SQL table columnsquery_id=FieldDefinition("query_id", "string", MatchType.DONT_USE)
corpus_id=FieldDefinition("corpus_id", "string", MatchType.DONT_USE)
title=FieldDefinition("title", "string", MatchType.FUZZY)
year_published=FieldDefinition("year_published", "string", MatchType.NUMERIC, MatchType.EXACT, MatchType.NULL_OR_BLANK)
authors=FieldDefinition("authors", "string", MatchType.FUZZY, MatchType.NULL_OR_BLANK)
part_number=FieldDefinition("part_number", "string", MatchType.FUZZY, MatchType.NULL_OR_BLANK)
isbn=FieldDefinition("isbn", "string", MatchType.FUZZY, MatchType.NULL_OR_BLANK)
# Group fields into a listfieldDefs= [query_id, corpus_id, title, year_published, authors, part_number, isbn]
# Set field definitions in the argumentsargs.setFieldDefinition(fieldDefs)
# Define the input pipe with the `query` tablequeryData=Pipe("queryData", "jdbc")
queryData.addProperty(
"url",
f"jdbc:postgresql://{os.getenv('DATABASE_HOST')}:{os.getenv('DATABASE_PORT')}/book_linker_test",
)
queryData.addProperty("dbtable", "query")
queryData.addProperty("driver", "org.postgresql.Driver")
queryData.addProperty("user", os.getenv("DATABASE_USER"))
queryData.addProperty("password", os.getenv("DATABASE_PASSWORD"))
# Define the input pipe with the `corpus` tablecorpusData=Pipe("corpusData", "jdbc")
corpusData.addProperty(
"url",
f"jdbc:postgresql://{os.getenv('DATABASE_HOST')}:{os.getenv('DATABASE_PORT')}/book_linker_test",
)
corpusData.addProperty("dbtable", "corpus")
corpusData.addProperty("driver", "org.postgresql.Driver")
corpusData.addProperty("user", os.getenv("DATABASE_USER"))
corpusData.addProperty("password", os.getenv("DATABASE_PASSWORD"))
# Add the input pipesargs.setData(queryData, corpusData)
# Define the output pipebooksIdentitiesResolved=Pipe("booksIdentitiesResolved", "jdbc")
booksIdentitiesResolved.addProperty(
"url",
f"jdbc:postgresql://{os.getenv('DATABASE_HOST')}:{os.getenv('DATABASE_PORT')}/book_linker_test",
)
booksIdentitiesResolved.addProperty("dbtable", "books_unified")
booksIdentitiesResolved.addProperty("driver", "org.postgresql.Driver")
booksIdentitiesResolved.addProperty("user", os.getenv("DATABASE_USER"))
booksIdentitiesResolved.addProperty("password", os.getenv("DATABASE_PASSWORD"))
# Add the output pipe to argumentsargs.setOutput(booksIdentitiesResolved)
# Model and execution settingsargs.setModelId("books_model")
args.setZinggDir("test_models")
args.setNumPartitions(4)
args.setLabelDataSampleSize(0.5)
# Zingg execution optionsoptions=ClientOptions([ClientOptions.PHASE, phase_name])
# Execute Zingg with the provided phasezingg=Zingg(args, options)
zingg.initAndExecute()
Running zingg.sh {zingg.conf} --run {python_file} label only selects samples from the "corpus" as the corpus has about 100k records and the query dataset 9k. That's not what I want, I only care about the differences between the query and corpus database.
The text was updated successfully, but these errors were encountered:
Currently Zingg does not distinguish between datasets while selecting pairs for labelling. However, if you run findTrainingData, label and then run link, you should be able to get the results you want. You can also force feed some trainingSamples between the two datasets using pre existing training data.
I have two datasets: a "corpus" and a "query" database. I need to do active labeling only between those two datasets as the values themselves are already distinct for each dataset. Is that possible? Here's my current code:
Running
zingg.sh {zingg.conf} --run {python_file} label
only selects samples from the "corpus" as the corpus has about 100k records and the query dataset 9k. That's not what I want, I only care about the differences between the query and corpus database.The text was updated successfully, but these errors were encountered: