You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue is being made to identify possible improvements to the training procedure in light of a newfound domain name bias.
Training the record ID model with and without the full URL's results in ~15% difference in evaluation validity (less accurate w/o URL's). I believe this is due to a learned association between domain name (e.g. longbeach.gov) and record type.
For example:
for longbeach.gov samples, 95% are 'Agency-Published Resources'
for delcopa.gov samples, 61% are 'Not Criminal Justice Related'
for cityprotect.com samples, 96% are 'Agency-Published Resources'
Full dataset here: domain-name-label-count.csv. Each row is a unique domain name, each entry is the fraction of times that the column label appears with the given domain in the dataset. Last column is total number of samples for the given domain.
this is similar across a majority of domain names, where one record type dominates. This wouldn't be so bad if each domain truly contained that proportion of each record type, but that is likely not the case. As of now, including the domain name introduces unwanted bias, meaning the model will not generalize well to new samples. It's my view that the ideal model should be able to identify record type regardless of domain name, so I would suggest dropping domain name entirely in training (keeping everything after .gov, .com, etc though).
Initial thoughts on possible improvements:
more samples, with an emphasis on an even distribution of record types. We do not know the true distribution of record types available to us, so an equal distribution of each is our best way to generalize
reducing noise in scraped text
different model architecture
clustering analysis for ideas on alternative labeling schemes
The text was updated successfully, but these errors were encountered:
@bonjarlow
With 1, we are already on the path to doing so at least partially, with the common crawler.
With 2, we have a range of options, from the elimination of stopwords to identifying which categories contribute the most to determining information.
With 3, the simplest alternative (but also the most expensive) would be an LLM to categorize. On a per item basis, such categorization would be fraction of a cents, but it would stack up. But the quality would be higher.
This issue is being made to identify possible improvements to the training procedure in light of a newfound domain name bias.
Training the record ID model with and without the full URL's results in ~15% difference in evaluation validity (less accurate w/o URL's). I believe this is due to a learned association between domain name (e.g. longbeach.gov) and record type.
For example:
Full dataset here: domain-name-label-count.csv. Each row is a unique domain name, each entry is the fraction of times that the column label appears with the given domain in the dataset. Last column is total number of samples for the given domain.
this is similar across a majority of domain names, where one record type dominates. This wouldn't be so bad if each domain truly contained that proportion of each record type, but that is likely not the case. As of now, including the domain name introduces unwanted bias, meaning the model will not generalize well to new samples. It's my view that the ideal model should be able to identify record type regardless of domain name, so I would suggest dropping domain name entirely in training (keeping everything after .gov, .com, etc though).
Initial thoughts on possible improvements:
The text was updated successfully, but these errors were encountered: