-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Emoji data for Scribe apps #14
Comments
Note that final version of this would be included in update_data.py so that changes to the scripts would be reflected in the apps. |
Adding some easter eggs into this feature could also be a nice touch. These might be best as just autosuggestions and not completions. Some ideas are:
|
Hey @andrewtavis - I'd love to help out with this! I presume that this is unblocked now with the completion of scribe-org/Scribe-iOS#194 and scribe-org/Scribe-iOS#188 - would that be correct? Looking into it, I had some initial thoughts (mostly regarding the source for emoji data):
With all that said, I'd love to continue discussions on this. Curious on further thoughts! P.S. I do like the idea of using nltk for working with word stems 👍 Clarification on some terminology above:
|
Hey @wkyoshida, would be great to get your help with this, and as always just with your research you already have! 😊 Thanks for your efforts already - especially with all the formatting that you do that makes it quite easy to follow 🙏
This is now unblocked, correct 😊
Would be something we'd need to investigate, but there could be a way to do this 👍
I agree with the positive point, and would say that for now let's not worry too much about the edge cases in the planning phase :) There could be a way to only use keywords for very popular emojis or subset in a logical way as we don't need autosuggestions for all emojis necessarily. With Scribe the big thing is keep single edge cases that we're dealing with to a minimum. There might even be some metadata on emojis on Wikidata that we could use 💡🤔 Big thing is that keywoards are going to be very necessary for some, as your example of
I'd say let's limit ourselves to emoji and NLTK for now so that we're not overcomplicating it all :) :) This would definitely be something to look into if multilingual support grows or when Scribe adds the sought after English keyboard. I'd also sadly suggest that we not use the translations here for anything too serious. The whole point of those now is to get the functionality in, as it's very much Where to from here? 😊 I'm away for the weekend, just FYI, but if you wanted to start exploring the generation of an |
Let me know what your preferred involvement level on this would be, @wkyoshida 😊 Am slamming another project atm, but would be happy to set up some base codes for this in the coming days :) |
Hey @andrewtavis! I meant to reply earlier, but wanted to focus on the keyboard variants discussion first as that was higher priority. I can definitely help with the development for this though! Resuming:
For some added clarification, my initial point, perhaps, was more so cause the data file that the emoji project makes use of appears to be limited to only 1 "short name" per human language (some emojis have 2 for English). Browsing the file, to me it felt a little limiting if just going off of emoji, since, as a user, some words that I'd expect would trigger an emoji, would not. On the other hand though, if we look to really just get an MVP out with this issue, then I think that it'd probably be fine. Trigger word improvements could happen over time.
👍 One reason why EmojiTerra piqued my interest was the good amount of data that it has, including for multiple languages. It also has more "keywords", which I thought could make it a little less limiting than only going off of emoji. With that said, one thought that I had was that Scribe could try reaching out to the EmojiTerra team even (their contact) and asking if they'd be interested in helping populate WikiData. No idea if they would, but just a thought.
Completely agree. My earlier note was added more as a potential path, but I too think that the Scribe translation data shouldn't be used for this. As you pointed out, it is still very much in
Sounds great! That'd be welcomed. And again, I can definitely take it from there and help with development 👍 🚀 |
Great to talk to you about all this, @wkyoshida 😊
Let's start down this path, and maybe we'll discover that there are ways of expanding it that are easier :) This tends to be how things work around here, as we planned to just do autocomplete, and then it snowballed into that with autosuggest and everything else that v2.0 had ☃️😅
The detail you put into all this is much appreciated 🤝 Looking into EmojiTerra makes sense :) Let's definitely consider this as the main source for the extra meanings, and reaching out to them about a collaboration or working with Wikidata would also be great! Is there an effective way that we can access the data on EmojiTerra, or would the hope be that a migration to Wikidata happens? I can reach out to the Wikidata community about emojis too as they might be willing to help with a WIP query :)
I'll work on the base file for you this weekend! Hope your week is going well so far 😊 |
😆 Yeah - alright, I think that sounds fair. Let's go with an MVP, and improvements can then happen over time.
That would be the gap, I think. I'm not sure if I saw an easier way to access the data or either the source code for EmojiTerra. I was hoping a dump to Wikidata could be done, so Scribe could maintain its "powered-by-Wikimedia" status throughout the data it uses. As a last resort, of course, Scribe could always just scrape EmojiTerra, though I don't think that would be the ideal path.
That would be awesome! 🙌
Oh for sure - thanks! Hope yours is as well ✌️ |
Hey @andrewtavis, given the discussions had in scribe-org/Scribe-iOS#89 (mainly referring to the idea of using a DB on the back-end), should development for this issue hold off a bit? If the data structure/storage changes, I'm thinking whatever scripts get created would have to be modified anyways (which could likely be true for all other data extraction scripts). As this issue isn't high priority though, I was thinking it could perhaps wait (at least until it's determined if the data structure will change). Curious what your thoughts are on this as well. However, I am also thinking that Scribe could still follow through with trying to contact the EmojiTerra team - to at least get a little ahead. It could happen that by the time Scribe is ready to go develop on this issue, the data is already in Wikidata! 🙌 One can dream! 😆 If you'd like to delegate it off, I could give it a shot to contact EmojiTerra. It could also make sense to have yourself do it though as the main Scribe dev and representative. Just let me know! |
I think it does make sense to hold off on this for now, but reaching out to them could still be beneficial in the short term :) Do you want to reach out to me in my email that's in my profile about that? We can draft an email and I can CC you in it? I think it would make sense that it comes from me, but I'd be happy to loop you in 😊 Btw: new feedback from one of my cowers who tried Scribe today is that he'd like QWERTY for all keyboards, including German. Looks like it'd make sense to have that option available for all keyboards :) :) |
@wkyoshida, scribe-org/Scribe-iOS#241 is the issue opened by a coworker about the keyboard layouts :) Well done predicting the importance of this 👏😊 |
Sounds great, @andrewtavis! I sent you an email just now. Let's draft something up 👍
🚀 I've always just sucked it up with whatever layout there was 😆, but if Scribe is able to provide the flexibility that comes with customization, I think that absolutely helps with those working in multiple languages. One interesting thing though is how this might play out with Scribe-Desktop. Reason being that, with Scribe-iOS and Scribe-Android, changing the keyboard layout can more easily be done since the keyboards are virtual. Scribe-Desktop, likely in most cases, will be dealing with physical keyboards. The bindings can be changed of course, but the physical printed characters, not so much. I still think Scribe Edit: To clarify, I think the ability to customize Scribe-Desktop keyboards to all use the same layout would be beneficial though, as the coworker already exemplified this in scribe-org/Scribe-iOS#241:
I'm thinking this would be even more so necessary in Scribe-Desktop, because the physical layout can't change. The "need or demand" I referred to earlier was more to the ability to customize to have different layouts per Scribe-Desktop keyboard. Made an edit above as well to wording from "could" to "should", because I think Scribe would have to consider the physical constraints of Scribe-Desktop. |
Will write back later in the week, @wkyoshida 😊🚀
Glad that you're keeping Scribe-Desktop in mind! Yes it'll definitely be more about keyboard shortcuts, so this won't be as much of an issue at first, but then for that the speed of it all will be important as it's easy enough to have Google Translate open or go to a website, so we need to give those instantaneous results and keep people in their workflow 💪
You bring up a good point here. Some people are using their keyboards with keys that are not exactly mapped to their keyboards - specifically our users working in a foreign country 😊 Wouldn't we be reading in the characters that the user is typing rather than the keys that they're pressing though? Not sure 🤔 Your thoughts on this would be very welcome :) |
Sounds good! I also replied to the points made here about Scribe-Desktop over on an issue on that repo instead, this discussion. Wanted to make sure we didn't derail too much from the discussion here about the emoji data 😅 |
Checking in with you here as I said via email, @wkyoshida, as it makes sense to keep it in GitHub if we won't reach out to EmojiTerra. Looking into the Unicode and EmojiTerra's sources a bit more, I came across Unicode CLDR, and specifically their annotations and cldr-json files. That looks to be everything we'd want, but then we'll need to check licensing and figure out if there's a good endpoint for all this 😊 Makes sense that there's someplace in code that all this lives :) :) Let me know what your thoughts are on using the above! |
Ah-ha! Alright - great find, @andrewtavis 🙌 Eventually I was able to get there also 😆 but I found Unicode's stuff perhaps not as easy to navigate/browse through imo (not sure if you felt the same). Had to move around subdomains under Anyways - I think those make sense! There does appear to be the ICU, which has official C/C++ and Java libraries that could be used to work with the CLDR data. There is also a list of related wrapper projects for other languages that Unicode links to as well. We could look into those. |
Ya the Unicode stuff was very hard to get through, @wkyoshida 😅 I was at one point trying to put The ICU looks like a good path going forward :) We can maybe look first into pyicu to see if it can work. I see from your profile that you have C/C++ and Java experience, but maybe trying to keep it all under one language right now would be best? |
😆
Oh yes! I would agree. I didn't mean to sound like I was leaning for the C/C++ or Java ones, if it did 😆 Only mentioned them, since they appear to be the officially supported ones by Unicode. I was thinking that the Python wrapper would make sense as well. It does look like it is actively maintained, and according to this statement, I would hope pretty on-par with whatever is available in the official libraries? Keeping with Python would make sense as that's what the other scripts in Scribe-Data already use. |
So we have ourselves a decision on this! 😊 Fantastic. Let's read through the docs, and I'll make us a notebook to work from tomorrow :) |
Hey @andrewtavis, real quick just to clarify something for me. Also sharing some context from our quick email chat just so it's here for others to understand:
From the above, was it suggested that Scribe could use Unicode directly for the emoji data or that Scribe could use Unicode to fill in the gaps in Wikidata (so Scribe is then able to later use the data from Wikidata also)? Mostly asking, because I'm thinking that, with the former, implementing something now could still run into that earlier point that we discussed of potentially having to modify the data extraction scripts later. That would be due to a possible change in the data solution. The question then, I guess, could be: Is the emoji functionality a priority enough today that Scribe is okay with perhaps dealing with data extraction rework later? Or is there enough reason to first more concretely discuss the long-term strategy for the data solution's back-end? We can have discussions in scribe-org/Scribe-iOS#89 as well, wherever the topic makes the most sense. For transparency, I don't mean to sound definitively opposed to working on emoji data atm 😆 more so just trying to better understand in terms of priority and roadmap. |
The suggestion was that we just go directly from Unicode. This would be a mass upload, and this is something that the Wikidata community is very much against. We'd need to be willing to administer the data that was added at the very least, but even then it likely wouldn't get their blessing as the content to admin ratio would be skewed by this. I'd say we still do something now, and we can switch it over to Wikidata later when there's more support. Or maybe we just won't. There's no necessity that we have to use Wikimedia for everything :) With there being such a universal solution for working with Unicode data, it could be years until it's properly implemented in Wikidata. Thanks for checking on this! Will get to that script for us to explore a bit in the next few days 😊 |
Sounds good! Gotchu - that makes sense to me 👍
is on the Scribe data-hosting side - related to the discussion we were having on using GitHub vs Toolforge vs something else, etc.. I was more referring, I guess, to the idea of potentially leveraging a DB on the back-end side also as opposed to JSONs in the Scribe-Data repo as the centralized location where Scribe data is hosted. Are you thinking as well of already going forward with getting the emoji data before? I'm thinking that would be fine; just wanted to throw the following for consideration as well. For any future data extraction work - be it emoji data, new keyboard data, translation data, etc. - could it make sense to prioritize figuring out the centralized data-hosting first, since moving to a DB could impact how Scribe handles, stores, and provisions data? How I think I am understanding, prioritized Scribe work is:
After that, however, I see:
I might also have some of the work prioritization misunderstood; so please keep me honest! 😆 If getting in the emoji feature and/or new keyboards is priority over investigating a possible non-GitHub data-hosting option, let me know. |
Hey @wkyoshida 👋😊 I'm looking into this a bit more now that #21 is all done. Basic thing I can say is that pyicu doesn't seem to be what we want. We're looking for the CLDR data specifically, and that package doesn't have access by the looks of it 🤔 Looking at it, if we want the fastest/easiest way to do this, it might be to simply download the files we need from unicode-org/cldr-json and write the appropriate Python script to invert the JSONs into word -> emoji pairs. As of now it doesn't look like there's a good Python tool, but then I might be missing it. What do you think of this quicker solution? :) |
I'm thinking that a scenario that could call for Scribe to jump more aggressively into SQLite is if downloads actually get too large. As we've discussed, an idea could be that only the diff gets downloaded and Scribe leverages something like a
Yeah.. I took a look as well and wasn't really finding what Scribe needs either, which is unfortunate. I'm also thinking that perhaps just downloading the unicode-org/cldr-json files might be the path-forward here. I have a potential idea on how to do it; I'll open up a PR shortly for it. |
Would be 100% fine with this :) Big thing is that I think it makes sense that we never have a direct connection to Wikidata, but instead are keeping the prepared data somewhere and accessing it. Specifically what I mean is that we're giving people pre-prepared packs, not that one person has the most up to date version of the data whenever they do the update. This is how it works for apps like Open Street Maps wrappers, so I think we should follow this example 😊
Great! Looking forward! 🚀🚀 Anything code based that allows us to update the JSONs we have locally within a workflow would be very very welcome 😊 |
Also, @wkyoshida, 3c111c9 added a roadmap to the readme (I also added it to iOS). I think it makes sense to include it :) Let me know if you have any feedback on it 😊 Main question I have as far as version IDs are concerned is whether adding the ability to choose the base language the user translates from is a major release? As of now I have it as v2.4.0, but then we'll doubtless be doing lots of other features once the menu's done, and even that by itself is maybe 3.0 due to all the refactoring 🤔 |
I agree! 🙌
That's a good thought - hmm. I'm thinking that it potentially could as well actually. It can also get tricky thinking what version number features will end up falling under imo 😆 If the organization As far as what is on the roadmap though - I think it makes sense 👍 |
We have projects for iOS, but as of now I haven't really found them useful... I think that as the team grows the organization wide ones would definitely be the way to go though 🚀 I'm also happy to implement them now if you'd find them helpful, and agree that referencing a project for a point on the roadmap would be more descriptive :)
I'll update it to v3.0.0 right now, but let's definitely keep in mind that this is a WIP presentation of what we want to work on 😊 Happy to expand on it later as we implement projects and other organization tools :) |
Sounds good, @andrewtavis 👍 🙌 Regarding the
I don't think there's a dire need for it - only thought of this because referencing an issues board, in my mind, seemed easier to reference and work with than a README section. |
This does make sense, @wkyoshida 😊 I haven't really looked into the features of the new projects, and referencing that would be easier than constantly needing to update the readme based on how things change :) At least writing it we have a better idea of how to structure the projects going forward 🤓 Will remove the projects from iOS as a first step, and then we can go from there. You'd suggest that all the projects be at the organization level, right? Makes sense to me that we have one place for organizing the various elements, as there doubtless are ways to filter the projects to see what you're interested in :) |
Yea ✌️ Also, since I'm thinking that related work across repos can be under the same groupings too - one can easily see they go together. Like with the below where there is both Scribe-iOS and Scribe-Data work:
|
Sounds great 😊 Give me a few days and I'll try to get them up and running 🚀 |
Brining the discussion back in here, @wkyoshida :) What should we look to do now? It looks like we might be able to work on the basis of scribe-org/Scribe-iOS#51 right now, as the main focus here is getting the order down given popularity? Plan is to definitely include all of this in v2.2.0, btw 😊 Really looking forward to adding such a key feature! |
Yep! I'm definitely thinking that getting the popularity worked in will be enough to give Scribe a good starting base for the emoji feature. I'm planning on looking into that today, and hopefully we can get that this week. Working in nltk and some other enhancements can happen later, I think. |
Looking forward to the PR, @wkyoshida! I’ll upload the datasets to iOS after and then start working on adding the references to them. We definitely need to have the SQLite feature finished to release this as we can’t add more data without it at this point, but having this done will make v2.2.0 all the closer :) I think that referencing NLTK can definitely wait till we have something out that we can test, and then iterate from there 😊 Adding a popularity score into the output would be enough so that we can order it on the Swift side, as I figure ordering the results here would leave us open to it not being maintained when it’s added to the DB. |
@wkyoshida, I believe that ef8b1e5 is the final touch of adding the output path to Scribe-iOS 😊 Will push the data to that repo now, and then we'll be set to start work on scribe-org/Scribe-iOS#51 🚀 Let me know if you can think of anything else that's needed for the MVP for this 🙃 Really great work as always! |
Some thoughts around this are coming to mind now. I'll add them in #26 though
I'm thinking of still adding some functionality here and there, e.g. #28 and nltk and etc, but I think we could be pretty good for the MVP! Or close (could depend on the additional thoughts I mentioned above)! |
Terms
Description
Now that Scribe-iOS is adding autosuggest in #194 and autocomplete in #188, other additions for these features could be considered including emojis. This feature would be to create python scripts that would create arrays of words and emojis that can represent them. When one of these words is entered, the emojis could then serve as autosuggestions or completions (depending on if the user has pressed space or not respectively).
These scripts would make use of emoji for the representations of emojis as words in the given languages, and would also make use of nltk for stemming (not lemmatizing) words to derive
smile
fromsmiling
. Other language packages could be leveraged to derive adjective forms likesmiley
, and convert words like types of trees (aspen, birch, etc) to "tree" so they can also be converted (likely another issue).Contribution
I'd be happy to work with someone on this once the features in Scribe-iOS are finished :)
The corresponding Scribe-iOS issue for this is #51, which would need to be worked on in the foreseeable future for this issue to make sense.
The text was updated successfully, but these errors were encountered: