Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Machine Leaning Spam Detection #666

Closed
wants to merge 38 commits into from

Conversation

aimura09
Copy link
Contributor

@aimura09 aimura09 commented Aug 9, 2023

No description provided.

CharlesSheelam and others added 30 commits August 8, 2023 17:44
feat: rewrite UserPipeline to include user id

feat: correct user pipeline for user id

feat: fix user id column in dataframes
… 'email', 'affiliations', and 'bio' of MemberProfile
fix: small fix in all_users_df().

   - Convert df.value from markup to string

   - Fix the name of df.columns
This spam classifier only looks at the MemberProfile's bio and nothing else.
feat:retrieve data from database
Aiko had the idea of using None instead of an extra field which specified if a model has been labelled before.
So, we are going to switch to that.

chore:cleaned up unused code

chore:cleaned up unused code
chore:migrations for altering SpamRecommendation
  - load_labels() will take filepath of dataset and laod a dataframe consist of user__id and is_spam columns. Then it will initialize the SpamRecommendation table.
… by an actual dataset with correct labels

add 'TODO' comments on the parts to be fixed.
…e functions names in SpamClassifier

chore:removed print statements
chore:todo noel

chore:organized imports
…r, change filenames, UserPipeline to UserSpamStatusProcessor, and SpamRecommendation to UserSpamStatus
feat: added model validation in TextClassifier
  manual tests on curator_spam_detection management command passed only for UserMetadataSpamClassifier.
Aiko Muraishi and others added 2 commits August 8, 2023 18:38
…m spam_detection_models.py to spam.py, dataset.csv added under django/curator/

# Create a new UserSpamStatus whenever a new MemberProfile is created
@receiver(post_save, sender=MemberProfile)
def sync_member_profile_spam(sender, instance, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we need to check the created argument https://docs.djangoproject.com/en/4.2/ref/signals/#post-save otherwise this will try and fail to create a second object when a memberprofile gets updated

self.processor = self.detection.processor
self.user_meta_classifier = self.detection.user_meta_classifier
self.text_classifier = self.detection.text_classifier

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may be better done with subcommands/subparsers (https://docs.python.org/3/library/argparse.html#sub-commands) since it seems the intent is to just run one action. Also may want to consider combining common workflows

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still trying to figure out what to fix for this suggestion. I'll ask you again today at our meeting.

predict_text = options["predict_text"]

load_directory = pathlib.Path(DATASET_FILE_PATH)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could potentially shorten this to use dynamic method calls e.g. getattr(self, f"handle_{action})()"

refactor: create new file 'spam_processor.py' for UserSpamStatusProcessor. change name from dataset.csv to spam_detaset.csv
@alee
Copy link
Member

alee commented Aug 18, 2023

apologies for the pivot, but after starting to go through the https://course.fast.ai I'm leaning towards using fast.ai for much of this (and pytorch as well) instead of tensorflow + scikit...

I think it should be a straightforward replacement or additional strategy tbh, and perhaps we can compare the results from both?

alee and others added 3 commits August 22, 2023 16:40
set SPAM_DIR_PATH as a pathlib.Path

Co-authored-by: Aiko Muraishi <[email protected]>
tests should use SpamDetector entrypoint instead of instantiating
individual components to ensure proper initialization

move initial training dataset path into settings
move UserSpamStatusProcessor from detected file into curator/models.py
as a collaborating class of UserSpamStatus. Should consider integrating
more tightly into the UserSpamStatus objects manager
could also convert to sets because there shouldn't be any duplicates
@aimura09 aimura09 closed this Oct 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants