-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Machine Leaning Spam Detection #666
Conversation
feat: rewrite UserPipeline to include user id feat: correct user pipeline for user id feat: fix user id column in dataframes
… 'email', 'affiliations', and 'bio' of MemberProfile
fix: small fix in all_users_df(). - Convert df.value from markup to string - Fix the name of df.columns
This spam classifier only looks at the MemberProfile's bio and nothing else.
feat:retrieve data from database
Aiko had the idea of using None instead of an extra field which specified if a model has been labelled before. So, we are going to switch to that. chore:cleaned up unused code chore:cleaned up unused code
chore:migrations for altering SpamRecommendation
- load_labels() will take filepath of dataset and laod a dataframe consist of user__id and is_spam columns. Then it will initialize the SpamRecommendation table.
… by an actual dataset with correct labels add 'TODO' comments on the parts to be fixed.
…e functions names in SpamClassifier chore:removed print statements
chore:todo noel chore:organized imports
chore:fixed messy imports
…r, change filenames, UserPipeline to UserSpamStatusProcessor, and SpamRecommendation to UserSpamStatus
feat: added model validation in TextClassifier
manual tests on curator_spam_detection management command passed only for UserMetadataSpamClassifier.
… spam detection related files
…m spam_detection_models.py to spam.py, dataset.csv added under django/curator/
|
||
# Create a new UserSpamStatus whenever a new MemberProfile is created | ||
@receiver(post_save, sender=MemberProfile) | ||
def sync_member_profile_spam(sender, instance, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we need to check the created
argument https://docs.djangoproject.com/en/4.2/ref/signals/#post-save otherwise this will try and fail to create a second object when a memberprofile gets updated
self.processor = self.detection.processor | ||
self.user_meta_classifier = self.detection.user_meta_classifier | ||
self.text_classifier = self.detection.text_classifier | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may be better done with subcommands/subparsers (https://docs.python.org/3/library/argparse.html#sub-commands) since it seems the intent is to just run one action. Also may want to consider combining common workflows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still trying to figure out what to fix for this suggestion. I'll ask you again today at our meeting.
predict_text = options["predict_text"] | ||
|
||
load_directory = pathlib.Path(DATASET_FILE_PATH) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could potentially shorten this to use dynamic method calls e.g. getattr(self, f"handle_{action})()"
refactor: create new file 'spam_processor.py' for UserSpamStatusProcessor. change name from dataset.csv to spam_detaset.csv
apologies for the pivot, but after starting to go through the https://course.fast.ai I'm leaning towards using fast.ai for much of this (and pytorch as well) instead of tensorflow + scikit... I think it should be a straightforward replacement or additional strategy tbh, and perhaps we can compare the results from both? |
set SPAM_DIR_PATH as a pathlib.Path Co-authored-by: Aiko Muraishi <[email protected]>
tests should use SpamDetector entrypoint instead of instantiating individual components to ensure proper initialization move initial training dataset path into settings move UserSpamStatusProcessor from detected file into curator/models.py as a collaborating class of UserSpamStatus. Should consider integrating more tightly into the UserSpamStatus objects manager
could also convert to sets because there shouldn't be any duplicates
5c2d38a
to
83c6830
Compare
No description provided.