fastText format for text classification #9

Hironsan · 2020-05-15T09:55:39Z

Example:

__label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe?
__label__food-safety __label__acidity Dangerous pathogens capable of growing in acidic environments
__label__cast-iron __label__stove How do I cover up the white spots on my cast iron stove?
__label__restaurant Michelin Three Star Restaurant; but if the chef is not there
__label__knife-skills __label__dicing Without knife skills, how can I quickly and accurately dice vegetables?
__label__storage-method __label__equipment __label__bread What’s the purpose of a bread box?
__label__baking __label__food-safety __label__substitutions __label__peanuts how to seperate peanut oil from roasted peanuts at home?
__label__chocolate American equivalent for British chocolate terms
__label__baking __label__oven __label__convection Fan bake vs bake
__label__sauce __label__storage-lifetime __label__acidity __label__mayonnaise Regulation and balancing of readymade packed mayonnaise and other sauces

prokotg · 2020-05-16T12:06:45Z

Hi! Started working on this one. I am going to also use label metadata in order to get label names. Would that be allright?

Hironsan · 2020-05-16T13:10:46Z

I am going to also use label metadata in order to get label names. Would that be allright?

I agree with you. Where and how does the label metadata pass it?

prokotg · 2020-05-16T16:33:27Z

Couple of ideas, but here's what comes in my mind:

Personally, as a user, I would prefer to use class method of each Dataset directly so instead of using

dataset = read_jsonl(filepath='example.jsonl', dataset=NERDataset, encoding='utf-8')

I would suggest to directly use

dataset = NERDataset.from_jsonl(filepath='example.jsonl', encoding='utf-8')

and when it comes to TextClassificationDataset (working name), I would just add another optional argument (via **kwargs) ...

dataset = TextClassificationDataset.from_jsonl(annotations_filepath='example.jsonl', labels_filepath='project_1_labels.jsonl', encoding='utf-8)

...optional because without the label metadata filepath, annotations could be still converted with appended label id (and warning for information) like that: __label__1 although I am not sure this is a valid fasttext label (have to check that)

If you decide to stay with the current implementation, labels path could be passed either as **kwargs to read_jsonl function and passed further to Dataset constructor or passed directly to TextClassificationDataset.to_fasttext method (yes, this requires reading label metadata every time you want to perform a conversion so I am not a fan of this solution)

Let me know what you think

This was referenced May 16, 2020

Add TextClassificationDataset with to_fastext() method #10

Closed

Text classification doccano #11

Draft

Hironsan added the enhancement New feature or request label Nov 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fastText format for text classification #9

fastText format for text classification #9

Hironsan commented May 15, 2020

prokotg commented May 16, 2020 •

edited

Loading

Hironsan commented May 16, 2020

prokotg commented May 16, 2020 •

edited

Loading

fastText format for text classification #9

fastText format for text classification #9

Comments

Hironsan commented May 15, 2020

prokotg commented May 16, 2020 • edited Loading

Hironsan commented May 16, 2020

prokotg commented May 16, 2020 • edited Loading

prokotg commented May 16, 2020 •

edited

Loading

prokotg commented May 16, 2020 •

edited

Loading