Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplicate by content #128

Open
ccstan99 opened this issue Aug 11, 2023 · 0 comments
Open

Deduplicate by content #128

ccstan99 opened this issue Aug 11, 2023 · 0 comments

Comments

@ccstan99
Copy link
Collaborator

  • Cross-posted blogs and PDFs hosted on multiple domain may have same content with different urls. We want some way to check & flag for duplicate content.

  • A good canonical public-facing url may be different from the source url for extracting content.

  • We've discussed before. This will be a challenge but something to be aware of.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Discussion
Development

No branches or pull requests

1 participant