feat: Machine Leaning Spam Detection #666

aimura09 · 2023-08-09T01:40:23Z

No description provided.

feat: rewrite UserPipeline to include user id feat: correct user pipeline for user id feat: fix user id column in dataframes

… 'email', 'affiliations', and 'bio' of MemberProfile

fix: small fix in all_users_df(). - Convert df.value from markup to string - Fix the name of df.columns

This spam classifier only looks at the MemberProfile's bio and nothing else.

feat:retrieve data from database

Aiko had the idea of using None instead of an extra field which specified if a model has been labelled before. So, we are going to switch to that. chore:cleaned up unused code chore:cleaned up unused code

chore:migrations for altering SpamRecommendation

- load_labels() will take filepath of dataset and laod a dataframe consist of user__id and is_spam columns. Then it will initialize the SpamRecommendation table.

… by an actual dataset with correct labels add 'TODO' comments on the parts to be fixed.

…e functions names in SpamClassifier chore:removed print statements

chore:todo noel chore:organized imports

…created

chore:fixed messy imports

…r, change filenames, UserPipeline to UserSpamStatusProcessor, and SpamRecommendation to UserSpamStatus

feat: added model validation in TextClassifier

manual tests on curator_spam_detection management command passed only for UserMetadataSpamClassifier.

… spam detection related files

…m spam_detection_models.py to spam.py, dataset.csv added under django/curator/

sgfost · 2023-08-14T23:00:36Z

django/curator/models.py

+
+# Create a new UserSpamStatus whenever a new MemberProfile is created
+@receiver(post_save, sender=MemberProfile)
+def sync_member_profile_spam(sender, instance, **kwargs):


I believe we need to check the created argument https://docs.djangoproject.com/en/4.2/ref/signals/#post-save otherwise this will try and fail to create a second object when a memberprofile gets updated

sgfost · 2023-08-14T23:15:04Z

django/curator/management/commands/curator_spam_detection.py

+        self.processor = self.detection.processor
+        self.user_meta_classifier = self.detection.user_meta_classifier
+        self.text_classifier = self.detection.text_classifier
+


I think this may be better done with subcommands/subparsers (https://docs.python.org/3/library/argparse.html#sub-commands) since it seems the intent is to just run one action. Also may want to consider combining common workflows

I'm still trying to figure out what to fix for this suggestion. I'll ask you again today at our meeting.

sgfost · 2023-08-14T23:17:57Z

django/curator/management/commands/curator_spam_detection.py

+        predict_text = options["predict_text"]
+
+        load_directory = pathlib.Path(DATASET_FILE_PATH)
+


could potentially shorten this to use dynamic method calls e.g. getattr(self, f"handle_{action})()"

refactor: create new file 'spam_processor.py' for UserSpamStatusProcessor. change name from dataset.csv to spam_detaset.csv

alee · 2023-08-18T07:34:52Z

apologies for the pivot, but after starting to go through the https://course.fast.ai I'm leaning towards using fast.ai for much of this (and pytorch as well) instead of tensorflow + scikit...

I think it should be a straightforward replacement or additional strategy tbh, and perhaps we can compare the results from both?

set SPAM_DIR_PATH as a pathlib.Path Co-authored-by: Aiko Muraishi <[email protected]>

tests should use SpamDetector entrypoint instead of instantiating individual components to ensure proper initialization move initial training dataset path into settings move UserSpamStatusProcessor from detected file into curator/models.py as a collaborating class of UserSpamStatus. Should consider integrating more tightly into the UserSpamStatus objects manager

could also convert to sets because there shouldn't be any duplicates

CharlesSheelam and others added 30 commits August 8, 2023 17:44

feat: create user pipeline for spam detection

824daeb

feat: rewrite UserPipeline to include user id feat: correct user pipeline for user id feat: fix user id column in dataframes

feat: spam detection based on 'first_name', 'last_name', 'is_active',…

151a167

… 'email', 'affiliations', and 'bio' of MemberProfile

fix: modify the partial_train to use the correct tokenizer

edf4419

fix: update user pipeline to fix latency issues

b5897ff

fix: small fix in all_users_df(). - Convert df.value from markup to string - Fix the name of df.columns

feat:created a model for storing spam recommendations

aff5fef

feat:created a spam classifier which looks at MemberProfile bio

ee48d9d

This spam classifier only looks at the MemberProfile's bio and nothing else.

feat:retrieve data from database

e3d7bd3

feat:retrieve data from database

feat:get all unlabelled users in dataframe

e6d8ad0

feat:used charles' pipeline

2b20dfc

fix:using None instead of an extra column in SpamRecommendation

1087b94

Aiko had the idea of using None instead of an extra field which specified if a model has been labelled before. So, we are going to switch to that. chore:cleaned up unused code chore:cleaned up unused code

fix:fixed SpamRecommendation __str__ function

5219aef

chore:migrations for altering SpamRecommendation

feat:save recommendations

bce8021

feat:save to database from df

3a0ac5d

feat: add load_labels() function to curator/spam_detect.py

fa982a3

- load_labels() will take filepath of dataset and laod a dataframe consist of user__id and is_spam columns. Then it will initialize the SpamRecommendation table.

add a command for spam detection

540e38c

add stub dataset for initial traning. This dataset should be replaced…

88fe78e

… by an actual dataset with correct labels add 'TODO' comments on the parts to be fixed.

refactor: Move BioSpamClassifier to spam_detection_model.py and chang…

0db71b4

…e functions names in SpamClassifier chore:removed print statements

feat:added extra field in SpamRecommendation for user classifier

248f9c1

fix:don't replace data in database

1bcce0d

chore:todo noel chore:organized imports

feat:create a new SpamRecommendation whenever a new MemberProfile is …

fe0bead

…created

feat:updated spam classifier

de40c30

feat:fit text spam classifier

8789477

feat:prediction on ML

ea3d88f

fix: fix conflicts.

71952da

fix: UserPipeline functions and add an abstruct class SpamClassifier

8c75156

chore:fixed messy imports

fix: refactor UserMetadataSpamClassifier, integrate TextSpamClassifie…

ec28169

…r, change filenames, UserPipeline to UserSpamStatusProcessor, and SpamRecommendation to UserSpamStatus

feat: added model metrics saving feature

8a3309f

feat: added model validation in TextClassifier

fix: fix typing issue in df[labelled_by_curator] column

8235be2

manual tests on curator_spam_detection management command passed only for UserMetadataSpamClassifier.

fix: dataset.csv replaced and create a directory in shared folder for…

b8c6617

… spam detection related files

fix: fixed positional argument bug

75a044c

Aiko Muraishi and others added 2 commits August 8, 2023 18:38

feat: unit tests added, comments added, SpamDetection class moved fro…

cd051d0

…m spam_detection_models.py to spam.py, dataset.csv added under django/curator/

fix: fixed KeyError bug in TextSpamClassifier

f0f52b8

sgfost reviewed Aug 14, 2023

View reviewed changes

fix: small fix on textClassifier

8c38e09

refactor: create new file 'spam_processor.py' for UserSpamStatusProcessor. change name from dataset.csv to spam_detaset.csv

alee and others added 3 commits August 22, 2023 16:40

fix: move SPAM_DIR_PATH into settings

5935e24

set SPAM_DIR_PATH as a pathlib.Path Co-authored-by: Aiko Muraishi <[email protected]>

fix: use assertCountEqual for order independent comparison

83c6830

could also convert to sets because there shouldn't be any duplicates

alee force-pushed the spam_aiko__clean_commits branch from 5c2d38a to 83c6830 Compare August 23, 2023 19:12

alee added 2 commits August 31, 2023 13:34

fix: adjust test curator labelling references

670146a

fix: remove last reference to update_labelled_by_curator

996d8b5

aimura09 closed this Oct 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Machine Leaning Spam Detection #666

feat: Machine Leaning Spam Detection #666

aimura09 commented Aug 9, 2023

sgfost Aug 14, 2023

sgfost Aug 14, 2023

aimura09 Aug 22, 2023

sgfost Aug 14, 2023

alee commented Aug 18, 2023 •

edited

Loading

		predict_text = options["predict_text"]

		load_directory = pathlib.Path(DATASET_FILE_PATH)

feat: Machine Leaning Spam Detection #666

feat: Machine Leaning Spam Detection #666

Conversation

aimura09 commented Aug 9, 2023

sgfost Aug 14, 2023

Choose a reason for hiding this comment

sgfost Aug 14, 2023

Choose a reason for hiding this comment

aimura09 Aug 22, 2023

Choose a reason for hiding this comment

sgfost Aug 14, 2023

Choose a reason for hiding this comment

alee commented Aug 18, 2023 • edited Loading

alee commented Aug 18, 2023 •

edited

Loading