Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hashtag contains linked punctuation #1518

Closed
dmnyc opened this issue Aug 27, 2023 · 27 comments
Closed

Hashtag contains linked punctuation #1518

dmnyc opened this issue Aug 27, 2023 · 27 comments
Assignees
Labels

Comments

@dmnyc
Copy link

dmnyc commented Aug 27, 2023

At the end of the hashtag I included a … character which was not intended to be linked to the tag but Damus included it anyway.

note ID:
note1a8y42nfmeyz4gnvmjxta8k2l8dzl4sa9y6m726aa976ludlzq56sjde5dg

image

@alltheseas
Copy link
Collaborator

Did you add a space between your intended hashtag and the subsequent characters?

@dmnyc
Copy link
Author

dmnyc commented Aug 28, 2023 via email

@alltheseas
Copy link
Collaborator

I put other punctuation
at the end like periods and exclamation marks and they don’t cause this
issue.

On Damus? On Twitter?

@dmnyc
Copy link
Author

dmnyc commented Aug 28, 2023 via email

@alltheseas
Copy link
Collaborator

alltheseas commented Aug 28, 2023

Thanks, that helps.

what happens

Hash tag with one period post hashtag is not added to hashtag.

Hash tag with two periods post hashtag is not added to hashtag.

Hash tag with three periods post hashtag is added to hashtag.

Hash tag with four periods post hashtag: three are added to hashtag, one is not.

image

image

Cannot replicate with commas

image

@dmnyc
Copy link
Author

dmnyc commented Aug 28, 2023

Three periods as an ellipsis character … is what I believe caused the issue.

jonmarrs added a commit to jonmarrs/damus that referenced this issue Aug 30, 2023
…phanumeric characters such as punctuation marks.

Closes: damus-io#1518
@dmnyc
Copy link
Author

dmnyc commented Aug 30, 2023

note12u7dx9gcm3tapau7ljqk7ymkc2lz7zkh0u3w85ndsed7s3kpl8ys943vz7

@jonmarrs
Copy link
Contributor

Here are some examples of how hashtags are displayed with the patch I am proposing (jonmarrs@336215a).

image image image image

Let me know if you would like me to run any other tests.

@jonmarrs
Copy link
Contributor

note12u7dx9gcm3tapau7ljqk7ymkc2lz7zkh0u3w85ndsed7s3kpl8ys943vz7

Here is how this note looks with my proposed code change:

image

@dmnyc
Copy link
Author

dmnyc commented Aug 30, 2023 via email

@alltheseas
Copy link
Collaborator

Here are some examples of how hashtags are displayed with the patch I am proposing (jonmarrs@336215a).

image image image image
Let me know if you would like me to run any other tests.

Thanks @jonmarrs

Have you reviewed the contributing guidelines?

@dmnyc
Copy link
Author

dmnyc commented Aug 30, 2023

Looks perfect.

@jonmarrs
Copy link
Contributor

Here are some examples of how hashtags are displayed with the patch I am proposing (jonmarrs@336215a).
image image image image
Let me know if you would like me to run any other tests.

Thanks @jonmarrs

Have you reviewed the contributing guidelines?

Thanks @alltheseas

I read the contributing guidelines, and submitted the patch to [email protected] using git send-email.

Now I'm working on running some test cases.

@alltheseas alltheseas moved this to In Progress 🏗️ in Damus Roadmap 🛣️ Aug 31, 2023
@jonmarrs
Copy link
Contributor

Three periods as an ellipsis character … is what I believe caused the issue.

Just wanted to give a quick update. I've been looking at how "..." is encoded in Damus, and it is indeed being encoded as an ellipsis rather than three separate periods. Apparently the ellipsis is encoded in UTF-8 as 0xE2 0x80 0xA6 (https://www.compart.com/en/unicode/U+2026). So those three hex values (0xE2 0x80 0xA6) that encode the ellipsis are not being detected as punctuation.
image
Perhaps we should check for an ellipsis when parsing a hashtag.

@jb55
Copy link
Collaborator

jb55 commented Aug 31, 2023 via email

@jonmarrs
Copy link
Contributor

jonmarrs commented Sep 1, 2023

We could try to detect this block of unicode punctuation (https://www.compart.com/en/unicode/block/U+2000), which contains the ellipsis and other punctuation marks.
All these unicode punctuation marks start with 0xE2, and the middle hex value is either 0x80 or 0x81.

@jonmarrs
Copy link
Contributor

jonmarrs commented Sep 1, 2023

Here is my proposed solution for selectively filtering UTF-8 punctuation: master...jonmarrs:damus:2023-08-hashtag-linked-punctuation

@jonmarrs
Copy link
Contributor

jonmarrs commented Sep 1, 2023

My new code passes the hashtag test cases. UTF-8 characters are included in hashtags, but UTF-8 punctuation is removed.
image

@dmnyc
Copy link
Author

dmnyc commented Sep 1, 2023

Out of curiosity, is there any reason to include non-alphanumeric characters in hashtags?

@jonmarrs
Copy link
Contributor

jonmarrs commented Sep 1, 2023

We probably want to enable the broadest use of hashtags as possible, in as many languages as possible. Filtering out the few things we don't want in hashtags will be a more efficient strategy than checking for the many things we do want in hashtags. We want to support hashtags with alphanumeric characters in both ASCII and UTF-8. So I think we should only try to detect ASCII punctuation, UTF-8 punctuation, and whitespace. Basically everything else will be included. Although there is a slight inconsistency in handling currency symbols.
image

@jonmarrs
Copy link
Contributor

jonmarrs commented Sep 1, 2023

Looks like Twitter / X also leaves out currency symbols.
image

@jonmarrs
Copy link
Contributor

jonmarrs commented Sep 1, 2023

Should we filter out currency symbols from hashtags as well?

@jonmarrs
Copy link
Contributor

jonmarrs commented Sep 1, 2023

It seems like the overall strategy should be to decide which UTF-8 categories/blocks we want to filter out from hashtags.
Here is a list of unicode blocks: https://www.compart.com/en/unicode/block

Right now I am filtering out General Punctuation.
I could also filter out Currency Symbols.
Anything else I should filter out?

image

@jonmarrs
Copy link
Contributor

jonmarrs commented Sep 1, 2023

Here are some more test results.

Twitter: https://twitter.com/Martian_BTC/status/1697676619530506353?s=20
image

Damus (iPhone app): note1l6x97pcynhvrfzpfg90pnkxa9tl3dhtwek3dgdray2pqmhr6x3kqadea4z
image

My patch as it is now:
image

Clearly there are some inconsistencies. As far as I know, there is no ISO standard for hashtags. Should the goal be for Damus to replicate how Twitter/X handles hashtags as closely as possible?

jonmarrs added a commit to jonmarrs/damus that referenced this issue Sep 1, 2023
…phanumeric characters such as punctuation marks.

Closes: damus-io#1518
@jb55
Copy link
Collaborator

jb55 commented Sep 2, 2023 via email

@jonmarrs
Copy link
Contributor

Please review my pull request (#1546), which I believe closes this issue.

jonmarrs added a commit to jonmarrs/damus that referenced this issue Sep 13, 2023
Check for UTF-8 punctuation (such as ellipsis) in addition to regular punctuation in hashtags.

Closes: damus-io#1518
@jb55 jb55 closed this as completed in 617dee3 Sep 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants