Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[24.1] Add username and email deduplication scripts #18492

Closed

Conversation

jdavcs
Copy link
Member

@jdavcs jdavcs commented Jul 3, 2024

Ref #18487

These are 2 scripts that can be run to remove duplicate emails and duplicate usernames from the galaxy_user table.
I am not sure if we want them in 24.1 (this PR) or in dev. It's a feature, not a bug fix; but this feature can be used to fix a data problem on main that's running 24.1, and, potentially, any other database that is sufficiently ancient.

I'll have a separate PR against dev with migrations that add unique constraints to both fields.

I've tested this manually. Automated tests are a pain here, since there is no way that I know of to populate a sqlite database with data that violates a unique constraint (other than removing the constraint, which would require creating a new galaxy_user table just for the test).

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@jdavcs jdavcs added area/database Galaxy's database or data access layer area/scripts labels Jul 3, 2024
@jdavcs
Copy link
Member Author

jdavcs commented Jul 4, 2024

Failures relevant: imports. I'll fix this.

@jdavcs jdavcs marked this pull request as draft July 4, 2024 01:59
self.engine = engine

def deduplicate_emails(self) -> None:
with self.engine.begin() as connection:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't do this, users won't control the new email address.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as deduplication goes, there is not much else we can do: there should be no duplicate emails in the database, and if there are, we can either delete the duplicate records or edit the duplicate values. Another option is to expect admins to do this manually (deleting rows, contacting users, whatever), and add this to the admin release notes. This script modifies the emails on the older duplicate accounts, leaving the newest intact, which of course is not ideal.

Copy link
Member

@mvdbeek mvdbeek Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is not much else we can do:

At a minimum prefix/suffix the email address to something not guessable ? Yes, if there are duplicates it's better to check which account actually has data, it's not unlikely only one of them will have anything at all, since we probably pick one on login, or we've always been denying / failing at the login stage, at which point toggling the delete flag is sufficient.

@jdavcs
Copy link
Member Author

jdavcs commented Jul 4, 2024

Upon further consideration, I think this belongs in dev. I'll add this to #18487.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/database Galaxy's database or data access layer area/scripts
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants