Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow non-translation tasks #25

Merged
merged 10 commits into from
Nov 22, 2024
Merged

allow non-translation tasks #25

merged 10 commits into from
Nov 22, 2024

Conversation

lannelin
Copy link
Collaborator

  • move seeding to utils for reuse
  • allow specification of list of languages with no concern for source/target terminology
  • creation of specific fn for translation loading
  • refactor of existing script to use new translation-specific fn

@lannelin lannelin requested a review from J-Dymond November 20, 2024 17:14
@lannelin
Copy link
Collaborator Author

ok @J-Dymond should be good to go, sorry for the false start!


from arc_spice.data import multieurlex_utils

# def extract_articles(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to remove, sorry

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole file? or the `extract_articles'?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just the commented out sig

data_dir="data", level=1, lang_pair=lang_pair
)
train = dataset_dict["train"]
multi_onehot = MultiHot(metadata_params["n_classes"])
test_row = get_test_row(train)
class_labels = multi_onehot(test_row["class_labels"])
return test_row, class_labels, metadata_params


def get_test_row(train_data):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be appropriate to split these functionalities into two functions, or pass a debug_flag argument?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've simply removed the manually entered data here. I assume this script will be superseded in time by something that goes over more than 1 sample



def test_extract_articles_single_lang():
langs = ["en"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we loop over all languages here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only have test data for single lang for english (as the loader works a bit differently). I can create it for others if we want to expand the tests but it should be the same functionality. Thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok is this addressed by the other comment re. testing all languages?

Copy link
Collaborator

@J-Dymond J-Dymond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good to merge! I just had one question in there about get_test_row. Would appreciate a chat on how to use the test multieurlex, I think I'll need to use/adapt that for the inference tests.

tests/test_multieurlex_utils.py Show resolved Hide resolved
@J-Dymond J-Dymond merged commit 92b39c2 into main Nov 22, 2024
5 checks passed
@J-Dymond J-Dymond deleted the refactor_data_preproc branch November 22, 2024 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants