-
-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
None of the regexes match emoji, and only emoji #174
Comments
We were using emojibase-regex to match emoji within messages. However, the docs (https://emojibase.dev/docs/regex/) state that this regex matches both emoji and text presentation characters. This is not what we want, and will result in false positives for characters like '↔' that could turn into an emoji if paired with a variation selector. Unfortunately, none of the other regexes provided by Emojibase do what we want either (milesj/emojibase#174). In the meantime, browser support for the RGI_Emoji character sequence class has made it feasible to write an emoji regex by hand, so that's what I've done.
We were using emojibase-regex to match emoji within messages. However, the docs (https://emojibase.dev/docs/regex/) state that this regex matches both emoji and text presentation characters. This is not what we want, and will result in false positives for characters like '↔' that could turn into an emoji if paired with a variation selector. Unfortunately, none of the other regexes provided by Emojibase do what we want either (milesj/emojibase#174). In the meantime, browser support for the RGI_Emoji character sequence class has made it feasible to write an emoji regex by hand, so that's what I've done.
We were using emojibase-regex to match emoji within messages. However, the docs (https://emojibase.dev/docs/regex/) state that this regex matches both emoji and text presentation characters. This is not what we want, and will result in false positives for characters like '↔' that could turn into an emoji if paired with a variation selector. Unfortunately, none of the other regexes provided by Emojibase do what we want either (milesj/emojibase#174). In the meantime, browser support for the RGI_Emoji character sequence class has made it feasible to write an emoji regex by hand, so that's what I've done.
I'll be honest, it's been so long since I've worked on this emoji stuff that I've forgotten a lot of how they work. I always have to re-learn the codebase each time I update it. So I'm sure there's bugs everywhere. With that said, I am tinkering with the regex's here: #175 |
So after looking at this post and the code again, this assumption is correct in how it works. It's by design.
I also use regexgen (https://github.com/devongovett/regexgen) to generate the regex pattern, and it does not support negative lookaheads. I'm not aware of another library to handle this and I'm definitely not going to write it from scratch. There is a regex using unicode properties, but I haven't tested it in years: https://emojibase.dev/docs/regex#unicode-property-support |
Been thinking about this more, and I think we could solve this by using functions, like |
Re: the Unicode properties approach, I was happy to discover that the new RegExp /\p{RGI_Emoji}(?!\uFE0E)(?:(?<!\uFE0F)\uFE0F)?/v All major browsers support it, though only as of late 2023. You can get a version that kinda sorta works while only using |
We were using emojibase-regex to match emoji within messages. However, the docs (https://emojibase.dev/docs/regex/) state that this regex matches both emoji and text presentation characters. This is not what we want, and will result in false positives for characters like '↔' that could turn into an emoji if paired with a variation selector. Unfortunately, none of the other regexes provided by Emojibase do what we want either (milesj/emojibase#174). In the meantime, browser support for the RGI_Emoji character sequence class has made it feasible to write an emoji regex by hand, so that's what I've done.
We were using emojibase-regex to match emoji within messages. However, the docs (https://emojibase.dev/docs/regex/) state that this regex matches both emoji and text presentation characters. This is not what we want, and will result in false positives for characters like '↔' that could turn into an emoji if paired with a variation selector. Unfortunately, none of the other regexes provided by Emojibase do what we want either (milesj/emojibase#174). In the meantime, browser support for the RGI_Emoji character sequence class has made it feasible to write an emoji regex by hand, so that's what I've done.
Nice, good to know! Been waiting years for all those to become available. |
We were using emojibase-regex to match emoji within messages. However, the docs (https://emojibase.dev/docs/regex/) state that this regex matches both emoji and text presentation characters. This is not what we want, and will result in false positives for characters like '↔' that could turn into an emoji if paired with a variation selector. Unfortunately, none of the other regexes provided by Emojibase do what we want either (milesj/emojibase#174). In the meantime, browser support for the RGI_Emoji character sequence class has made it feasible to write an emoji regex by hand, so that's what I've done.
We were using emojibase-regex to match emoji within messages. However, the docs (https://emojibase.dev/docs/regex/) state that this regex matches both emoji and text presentation characters. This is not what we want, and will result in false positives for characters like '↔' that could turn into an emoji if paired with a variation selector. Unfortunately, none of the other regexes provided by Emojibase do what we want either (milesj/emojibase#174). In the meantime, browser support for the RGI_Emoji character sequence class has made it feasible to write an emoji regex by hand, so that's what I've done.
We were using emojibase-regex to match emoji within messages. However, the docs (https://emojibase.dev/docs/regex/) state that this regex matches both emoji and text presentation characters. This is not what we want, and will result in false positives for characters like '↔' that could turn into an emoji if paired with a variation selector. Unfortunately, none of the other regexes provided by Emojibase do what we want either (milesj/emojibase#174). In the meantime, browser support for the RGI_Emoji character sequence class has made it feasible to write an emoji regex by hand, so that's what I've done.
* Don't consider textual characters to be emoji We were using emojibase-regex to match emoji within messages. However, the docs (https://emojibase.dev/docs/regex/) state that this regex matches both emoji and text presentation characters. This is not what we want, and will result in false positives for characters like '↔' that could turn into an emoji if paired with a variation selector. Unfortunately, none of the other regexes provided by Emojibase do what we want either (milesj/emojibase#174). In the meantime, browser support for the RGI_Emoji character sequence class has made it feasible to write an emoji regex by hand, so that's what I've done. * Add a fallback for BIGEMOJI_REGEX as well
A regex that matches emoji would be a really useful thing to have in the JS ecosystem! Unfortunately, between Emojibase and emoji-regex, I still haven't seen a package that actually does this. In the case of Emojibase:
emojibase-regex
matches some textual characters such as '↔'.emojibase-regex/emoji
doesn't match emoji without U+FE0F, such as '✨'.emojibase-regex/emoji-loose
matches some textual characters without U+FE0E, such as '↔'.What's missing is a regex that matches exactly those character sequences that are presented to users as emoji. Some characters are defined in Unicode to default to emoji presentation (see the
Emoji_Presentation
section), while others require U+FE0F to change their presentation mode. A correct implementation would account for both of these facts, and use a negative lookahead to avoid matching characters with U+FE0E.The text was updated successfully, but these errors were encountered: