Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fastText format for text classification #9

Open
Hironsan opened this issue May 15, 2020 · 3 comments
Open

fastText format for text classification #9

Hironsan opened this issue May 15, 2020 · 3 comments
Labels
enhancement New feature or request

Comments

@Hironsan
Copy link
Member

Example:

__label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe?
__label__food-safety __label__acidity Dangerous pathogens capable of growing in acidic environments
__label__cast-iron __label__stove How do I cover up the white spots on my cast iron stove?
__label__restaurant Michelin Three Star Restaurant; but if the chef is not there
__label__knife-skills __label__dicing Without knife skills, how can I quickly and accurately dice vegetables?
__label__storage-method __label__equipment __label__bread What’s the purpose of a bread box?
__label__baking __label__food-safety __label__substitutions __label__peanuts how to seperate peanut oil from roasted peanuts at home?
__label__chocolate American equivalent for British chocolate terms
__label__baking __label__oven __label__convection Fan bake vs bake
__label__sauce __label__storage-lifetime __label__acidity __label__mayonnaise Regulation and balancing of readymade packed mayonnaise and other sauces
@prokotg
Copy link

prokotg commented May 16, 2020

Hi! Started working on this one. I am going to also use label metadata in order to get label names. Would that be allright?

@Hironsan
Copy link
Member Author

I am going to also use label metadata in order to get label names. Would that be allright?

I agree with you. Where and how does the label metadata pass it?

@prokotg
Copy link

prokotg commented May 16, 2020

Couple of ideas, but here's what comes in my mind:

Personally, as a user, I would prefer to use class method of each Dataset directly so instead of using

dataset = read_jsonl(filepath='example.jsonl', dataset=NERDataset, encoding='utf-8')

I would suggest to directly use

dataset = NERDataset.from_jsonl(filepath='example.jsonl', encoding='utf-8')

and when it comes to TextClassificationDataset (working name), I would just add another optional argument (via **kwargs) ...

dataset = TextClassificationDataset.from_jsonl(annotations_filepath='example.jsonl', labels_filepath='project_1_labels.jsonl', encoding='utf-8)

...optional because without the label metadata filepath, annotations could be still converted with appended label id (and warning for information) like that: __label__1 although I am not sure this is a valid fasttext label (have to check that)

If you decide to stay with the current implementation, labels path could be passed either as **kwargs to read_jsonl function and passed further to Dataset constructor or passed directly to TextClassificationDataset.to_fasttext method (yes, this requires reading label metadata every time you want to perform a conversion so I am not a fan of this solution)

Let me know what you think

@Hironsan Hironsan added the enhancement New feature or request label Nov 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants