-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Spam Detection Feature #693
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
aimura09
force-pushed
the
feat_spam_detection
branch
from
January 16, 2024 22:14
100b49a
to
a2c8118
Compare
aimura09
force-pushed
the
feat_spam_detection
branch
2 times, most recently
from
January 25, 2024 17:02
a2c8118
to
2215d7e
Compare
aimura09
force-pushed
the
feat_spam_detection
branch
from
May 6, 2024 05:10
e5c16cd
to
e3f2f52
Compare
alee
force-pushed
the
feat_spam_detection
branch
from
June 17, 2024 23:30
590a4c4
to
3623378
Compare
feat: rewrite UserPipeline to include user id feat: correct user pipeline for user id feat: fix user id column in dataframes
- use 'first_name', 'last_name', 'is_active', 'email', 'affiliations', and 'bio' from MemberProfile - update user pipeline to fix latency issues - small fix in all_users_df(). - Convert df.value from markup to string - Fix name of df.columns fix: modify the partial_train to use the correct tokenizer
feat: - save to database from df - save recommendations - add load_labels() function to curator/spam_detect.py - load_labels() will take filepath of dataset and laod a dataframe consist of user__id and is_spam columns. Then it will initialize the SpamRecommendation table. - get all unlabelled users in dataframe chore: - migrations for altering SpamRecommendation fix:fixed SpamRecommendation __str__ function fix: - using None instead of an extra column in SpamRecommendation - Aiko had the idea of using None instead of an extra field which specified if a model has been labelled before. So, we are going to switch to that.
refactor: Move BioSpamClassifier to spam_detection_model.py and change functions names in SpamClassifier chore:removed print statements feat: add stub dataset for initial traning. This dataset should be replaced by an actual dataset with correct labels add 'TODO' comments on the parts to be fixed.
chore: todo noel chore: organized imports fix: fix the issue that data in database is not updated
…ile is created feat: - fit text spam classifier - prediction function in classifiers
…lass SpamClassifier refactor: refactored UserMetadataSpamClassifier, integrate TextSpamClassifier, change filenames, UserPipeline to UserSpamStatusProcessor, and SpamRecommendation to UserSpamStatus
feat: added model validation in TextClassifier fix: fixed positional argument bug - fixed positional argument bug - dataset.csv replaced and create a directory in shared folder for spam detection related files fix: fix typing issue in df[labelled_by_curator] column - manual tests on curator_spam_detection management command passed only for UserMetadataSpamClassifier.
…m spam_detection_models.py to spam.py, dataset.csv added under django/curator/
fix: fixed KeyError bug in TextSpamClassifier refactor: - create new file 'spam_processor.py' for UserSpamStatusProcessor. - change name from dataset.csv to spam_detaset.csv
fix: - move SPAM_DIR_PATH into settings - set SPAM_DIR_PATH as a pathlib.Path - remove last reference to update_labelled_by_curator - adjust test curator labelling references - use assertCountEqual for order independent comparison - could also convert to sets because there shouldn't be any duplicates refactor: restructure code and tests - tests should use SpamDetector entrypoint instead of instantiating - individual components to ensure proper initialization - move initial training dataset path into settings - move UserSpamStatusProcessor from detected file into curator/models.py as a collaborating class of UserSpamStatus. Should consider integrating more tightly into the UserSpamStatus objects manager Co-Authored-By: Allen Lee <[email protected]>
…al_fit() because CountVectorizer requires a model to fit the entire training dataset. Fix the management commands and tests accordingly. Also re-generated migration files
also clean up duplicate / dead imports
… to the spam feature. - adding headline comments for the functions. - Cleaning up the management command code and clarifying code responsibilities. - Bettering execution messages.
…d_by_curator=None - using get_all_users_df() instead of get_unlabelled_by_curator_df() to obtain dataframe, because a user previously labelled as ham may turn into spam. - improved the management command messages. - added exception handling for file operations. - replaced MultinomialNB with XGboost.
…ated architecture.
alee
force-pushed
the
feat_spam_detection
branch
from
June 17, 2024 23:52
3623378
to
0074230
Compare
Going to move to an LLM based approach for spam detection |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Attempts to close https://github.com/comses/planning/issues/113
Squashed commits and solved merge conflicts.
Summary
Management commands for Machine Learning spam detection.
Features
Before running the commands, make sure spam_dataset.csv is located in the curator folder.
spam_dataset.csv consist of user_id and is_spam columns. is_spam column contains 1(Spam) or 0(Ham).
XGBoostClassifier() ... Uses XGboost as a classifier. Takes a data frame that has columns "user_id" and "input_data." The "input_data" column is a numerical vector where the selected fields are encoded by an encoder.
CountVectEncoder() ... Uses CountVectorizer as an encoder. Takes selected fields from "user_id," "labelled_by_curator," "first_name," "last_name," "is_active," "email," "affiliations," "bio," "research_interests" of the MemberProfiles as input.
Run the following command to get a list of spam users.
./manage.py curator_spam_detection --predict
options
./manage.py curator_spam_detection --fit
./manage.py curator_spam_detection --get_model_metrics
./manage.py curator_spam_detection --load_labels
Tests
Wrote 16 unit tests using Django tests