Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

link_id missing #1

Open
datanizing opened this issue Aug 5, 2021 · 2 comments
Open

link_id missing #1

datanizing opened this issue Aug 5, 2021 · 2 comments

Comments

@datanizing
Copy link

This is a very interesting project.

Could you also provide the link_id in addition to the id itself? This would allow construction of a valid Reddit URL for each comment.

@ausgerechnet
Copy link
Member

Hi datanizing,

Thank you for your interest!

We do indeed provide the link_id in the corpus:

<text id="…" permalink="…" …>
<submission id="…" created_utc="…">
…
</submission>
<comment id="…" link_id="…" …>
…
</comment>
<comment …>
…
</comment>
…
</text>

If you are looking for a link to the original thread, you can also just use www.reddit.com + text['permalink'].

Did you re-build the corpus from scratch or did you access it through our CQPweb instance?

@datanizing
Copy link
Author

Hi Philipp,

that XML file looks interesting and would contain all information for retrieving the article. However, I can't find it in the repository, there is just a file called german-comment-ids.txt.gz which contains the id only. You need either the link_id or the permalink for downloading a Reddit object.

We would like to rebuild the corpus, perform even more language detection and extend it.

Regards
Christian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants