Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert JSON data to SQLite #96

Closed
2 tasks done
andrewtavis opened this issue Jan 8, 2022 · 67 comments
Closed
2 tasks done

Convert JSON data to SQLite #96

andrewtavis opened this issue Jan 8, 2022 · 67 comments
Assignees
Labels
-next release- Included in the next release bug Something isn't working help wanted Extra attention is needed

Comments

@andrewtavis
Copy link
Member

andrewtavis commented Jan 8, 2022

Terms

Behavior

There’s currently a bug with v1.1.0 where the Russian keyboard will not translate words. The keyboard switches to English as it should, and words can be typed, but then on pressing return the keyboard crashes without anything being inserted.

Device type

iPhone 7

iOS version

iOS 15.1

@andrewtavis andrewtavis added bug Something isn't working help wanted Extra attention is needed labels Jan 8, 2022
@andrewtavis andrewtavis changed the title Russian translation not working Russian translation crashes keyboard Jan 8, 2022
@andrewtavis
Copy link
Member Author

andrewtavis commented Jan 8, 2022

Is not reproducible on an iPhone 7 in Simulator, but then the version is 15.2.

Other translation features work fine, so shouldn't be an issue with accessing the data as the size is relatively similar (although Russian has the largest translation file by 0.6MB).

@andrewtavis andrewtavis added -next release- Included in the next release and removed -next release- Included in the next release labels Jan 8, 2022
@andrewtavis andrewtavis changed the title Russian translation crashes keyboard iPhone 7 Russian translation crashes keyboard Jan 10, 2022
@andrewtavis andrewtavis changed the title iPhone 7 Russian translation crashes keyboard Russian translation crashes keyboard Sep 3, 2022
@japanese-goblinn
Copy link
Contributor

I may take a look on this issue

@andrewtavis
Copy link
Member Author

This would be absolutely amazing! 🚀🚀 I really haven't had the bandwidth to dedicate time to this, but it's the main bug right now that dramatically affects the functionality. Let me know if you have ideas on this!

@japanese-goblinn
Copy link
Contributor

japanese-goblinn commented Oct 5, 2022

Yep, assign me on this please and I'll let you know on my progress with this issue

@andrewtavis
Copy link
Member Author

Assigned! Thank you 😊

@andrewtavis
Copy link
Member Author

andrewtavis commented Oct 11, 2022

@japanese-goblinn, reporting that apparently v2.0 Made this even worse, as basically the Russian keyboard crashes from the start at this point 🤦‍♂️ This is not happening on the emulator, and sadly we don’t have device level testing happening at this point. The crashes that other keyboards were experiencing have actually gotten better.

I’m wondering if this is maybe an issue with the amount of data that Russian’s ingesting? All the other keyboards have similar amounts of data, but Russian has those almost 200k bound. Could be crashing because it can’t handle it?

@japanese-goblinn
Copy link
Contributor

@andrewtavis I've looked intro this crash and will soon describe here what I've found

@andrewtavis
Copy link
Member Author

Thanks so much, @japanese-goblinn! Really looking forward to what you’ve figured out :)

@andrewtavis
Copy link
Member Author

andrewtavis commented Oct 13, 2022

@japanese-goblinn, note that I just checked and the new "lack of" behavior is reproducible sometimes on the Simulator, I just didn't check Russian after all these changes. Am looking into this as well now :)

@japanese-goblinn
Copy link
Contributor

japanese-goblinn commented Oct 13, 2022

Well, I managed to get this error while debugging, but not really sure how to reproduce it (screenshots below). I want to take a deeper look into this bug but unfortunately I don't have time for now.

This is what I found so far:

You sad that file is about 0.6 MB when it's actually 15.1 MB. Here's code how I measured it

// func loadJSONToDict
let data = try Data(contentsOf: url)
let f = ByteCountFormatter()
f.allowedUnits = [.useMB]
f.countStyle = .memory  
print("[DEBUG]: file \(fileName) size \(f.string(fromByteCount: Int64(data.count)))")

It's not much but it's a measurement of raw data, so it's less that Dictionary will occupy (not sure how bad it is, you need to figure it out yourself).

I also tried to use profiler but that was not really helpful, it's only showed that a really slow disc read is happening but you can see it yourself - keyboard is taking 2-3 seconds to startup (at least in dev mode).

As a conclusion I'm not sure if slow app startup is causing sometimes to crash, because you can't be absolutely sure if system will kill your application due to slow start. What I'm absolutely sure about is that you need to rework your data structures and whole approach of working with data (read only batches not whole file, using DB, etc.) and I can sure your that application will become event better and hopefully you'll manage to fix this bug!

I hope you luck in this hard task and thank you for possibility to contribute!

IMAGE 2022-10-13 20:14:23
IMAGE 2022-10-13 20:14:26

@andrewtavis
Copy link
Member Author

andrewtavis commented Oct 14, 2022

@japanese-goblinn, wanted to thank you for your contributions! #231 is an initial step towards working on this in my opinion, as we really should be leveraging a library for JSON loading. Hopefully this will make the read only batches you suggested or another solution for the data process easier going forward 😊

All the best, and your contributions are welcome whenever you have time! :)

andrewtavis added a commit that referenced this issue Oct 16, 2022
#96 add SwiftyJSON for data and refactor code
@andrewtavis
Copy link
Member Author

We'll likely need to be using Core Data for this in the future, but a WIP step to help the situation would be to remove the indentation from all JSON files to make them smaller in size.

@andrewtavis
Copy link
Member Author

@SaurabhJamadagni, FYI for this issue, I have the JSON files formatted to remove spaces, which really does help as far as app size is concerned. Will push those with the QWERTY French keyboard for #229 :)

It sadly doesn’t seem to help to much with load times, as it’s likely the crazy number of lookups being done that’s problematic. I’m going to a Berlin Swift coding group on Thursday and will discuss with people how best to implement Core Data or another solution that will help us now directly load all these dictionaries into memory all at once 😊

@wkyoshida
Copy link
Member

Hey! 👋

I'd say this would be once Wikidata gets to a point where we can get translations from there, right?

Yeah, I'm thinking this could wait until we're able to do cross-translations - in whatever way this is accomplished in scribe-org/Scribe-Data#23. I was more so pointing out perhaps that the current translations tables with the word-translation rows do limit us to only the English-to-target language translation, which is fine for now of course, since that is the only translation data that we have.

Selective updating would also likely be easiest done via .sqlite, or would you suggest something else?

This is what I want to make sure to think through. I think there will be three steps that are more time-heavy or resource-heavy that happen when a data download is done:

  1. Scribe-Server generates the data pack
  2. Scribe-Server sends the pack to the client that requested the pack
  3. The client unpacks/processes the pack into the client

For these steps, the file format should facilitate an efficient download workflow. FWIW I am currently thinking .sqlite though, because, for the aforementioned three steps:

  1. Getting the data needed based off of last_updated I believe will be faster if Scribe-Server is already storing the data in a DB
  2. If using .json, this would mean having to send multiple .json per language. On the other hand, there is one .sqlite per language.
    • However, if multiple languages are getting downloaded, this would still mean multiple .sqlite
  3. If using .json, the client would have to retain the ability to read the data as .json to then write it to .sqlite

I'd say definitely as we extrapolate out we can keep the option for JSONs and .sqlite file exports blush Why not, if it's not too hard to maintain

I guess the reservation that I'm having is more so coming from putting myself in the perspective of a user of Scribe-Data (as a general-use package). Like, would I care enough about the .sqlite option? Would I use it? If I don't, would I be bothered with the extra bloat of SQLite functionality that comes packaged in?

Maybe it's fine actually. I'm not definitely opposed. I'm perhaps more trying to think from the mindset that, as a user of packages, I would like my packages to do what I need them to, be small and compact, and have a clear use/purpose - if that makes sense.

Apps should either be just getting a whole new updated .sqlite files from this point on, or update certain rows of it based on last_updated, but let me know if this sentence makes sense 🙏

I'm with you! 👍

@andrewtavis
Copy link
Member Author

I was more so pointing out perhaps that the current translations tables with the word-translation rows do limit us to only the English-to-target language translation, which is fine for now of course, since that is the only translation data that we have.

I'm definitely of the opinion that as soon as more options are available for translation that instead of translation as a column we'd have a language name which is what the word would be translated into. I think we're on the same page 😊 What'll need to happen is that the relationship between translations will need to be inverted based on what we have right now. Right now we have within the German files a file that links English to German, but what will be there instead will be a file where word is German words and each of the columns is what we've gotten as translations. Right now we are saving translation files with their target, but in the future it'd be those words that are typed on the keyboard that's switched to for during translation :)

Glad to hear that it sounds like .sqlite can be a solid foundation for the future! Let's keep discussing and readjusting, but for now we're sailing ahead ⛵➡️

Maybe it's fine actually. I'm not definitely opposed. I'm perhaps more trying to think from the mindset that, as a user of packages, I would like my packages to do what I need them to, be small and compact, and have a clear use/purpose - if that makes sense.

I think it also depends on who the people are that first want to use it for non-Scribe purposes, but definitely happy to remove the bloat and even have a lot of the multi-use stuff be in Server for a wider community where Scribe-Data is more just what we need :)

I'm with you! 👍

🚀🚀🚀😊

@andrewtavis
Copy link
Member Author

andrewtavis commented Apr 13, 2023

Very simple question as well, @wkyoshida and also @SaurabhJamadagni: does my idea of versioning for all this make sense? Scribe-iOS is going to v2.2.0 as what the user is experiencing is just some new features like the emoji suggestions, etc — i.e. nothing "breaking" per say. The functionality of Scribe-Data is being totally changed though to export something different with tons of breaking changes, hence that's going to v3.0.0.

@andrewtavis
Copy link
Member Author

andrewtavis commented Apr 13, 2023

Oki doki 😊 Soooo, querying database values 😉😉😉

Here are the functions I'm making and a simple example:

import GRDB

// get_iso_code is new and returns the two letter code for the language (ex: Deutsch -> de)

/// Makes a connection to the language database given the value for controllerLanguage.
func openDBQueue() -> DatabaseQueue {
  let dbName = "\(String(describing: get_iso_code(keyboardLanguage: controllerLanguage).uppercased()))LanguageData"
  let dbPath = Bundle.main.path(forResource: dbName, ofType: "sqlite")! // ex: dbName = DELanguageData
  let db = try! DatabaseQueue(
    path: dbPath
  )

  return db
}

/// Returns a value from the language database given a query and arguemtns.
///
/// - Parameters
///   - query: the query to run against the language database.
///   - args: arguments to pass to the query.
///   - outputCols: the columns from which the value should come.
func queryDB(query: String, args: [String], outputCols: [String]) -> [String] {
  var outputValues = [String]()
  do {
    try languageDB.read { db in
      if let row = try Row.fetchOne(db, sql: query, arguments: StatementArguments(args)) {
        for col in outputCols {
          outputValues.append(row[col])
        }
      }
    }
  } catch {}

  return outputValues
}

var languageDB = try! DatabaseQueue() // instantiated for assignment within CommandVariables.swift
languageDB = openDBQueue() // on firstKeyboardLoad == true within loadKeys()

// Example in another file:
let query = "SELECT * FROM emoji_keywords WHERE word = ?"
let args = ["gesicht"]
let outputCols = ["emoji_1"]

spaceBar = queryDB(query: query, args: args, outputCols: outputCols)[0]

Which then gives us 🥁🥁🥁🥁:

Will do a PR soonish with the above a bit more organized 😊

@andrewtavis
Copy link
Member Author

andrewtavis commented Apr 13, 2023

I think that this is a good way of going about this, but feedback would be welcome :) Specifically we'll need to keep track of the index of certain columns passed within outputCols to queryDB to assign strings from the query results. I'd say this is an ok trade off for being able to very simply get individual values and whole rows for things like autosuggestions as well as check for if no results were returned.

@andrewtavis
Copy link
Member Author

andrewtavis commented Apr 15, 2023

Progress so far for me locally: all commands have been switched over 😊 So Translate, Conjugate and Plural are all directly from the SQLite databases, and I've actually removed the translations and verbs variables all together 🚀 Suggestions based on pronouns has also been switched over as that was referencing verbs, and the emoji autosuggestions/completions are also SQLite based 😱😊😊

I'm going to work on switching over autosuggestions right now. A note is that the way we do autocompletions is the thing that will need to make the biggest shift, and will likely need a custom query (totally fine that it would 🙃). We need to get rid of the autocompleteLexicon variable completely in my opinion. As discussed in #286, the creation of this variable in the firstKeyboardLoad step of laodKeys() was what was slowing down the initial load step, which is now solved as this logic has been moved out of the view controller. With that being said, there still is a slowdown of the UI for Russian in particular as the user types because we have this massive array on which the autocomplete checks are being made. Switching this to SQLite will doubtless speed this all up :) :)

@SaurabhJamadagni
Copy link
Collaborator

does my idea of versioning for all this make sense? Scribe-iOS is going to v2.2.0 as what the user is experiencing is just some new features like the emoji suggestions, etc

Hey @andrewtavis! Yeah that makes sense. With all the internal changes going on, Scribe-iOS version also seems like a big jump to me but I can see how the numbering would make sense from a user's perspective. Sounds great 😊

@andrewtavis
Copy link
Member Author

andrewtavis commented Apr 15, 2023

Remaining issues after the commit I'm about to do 😊

  • Test keyboards to make sure everything is working properly
  • Close this issue 😱🥳🥳🥳
  • New Scribe-Data release 🙌

@andrewtavis
Copy link
Member Author

andrewtavis commented Apr 15, 2023

Alrighty then! 3be5a59 is massive 😱😊 It also includes two bug fixes that can be seen in the commit logs for the CHAGELOG :) There are also minor fixes throughout as this required me to go through a lot of the codebase.

What we've all been waiting for 🙃:

  • The GRDB based SQLite functions are now in Keyboards/KeyboardsBase/LoadData.swift
    • I felt like putting the SQL stuff in this file made more sense
  • queryDBRow gets us values (we get a list that's one row of specified columns and can index it for individual values for commands/features)
  • writeDBRow allows us to add rows for data that didn't come through in the ETL/ELT process
    • German compound prepositions and the word Scribe that are added in expandLanguageDataset, specifically
  • createAutocompleteLexicon creates a single column table from the "keys"/primary keys of most of the other tables that we then query from for this feature
  • queryAutocompletions gets us an ordered group of three words from the autocomplete lexicon
  • Fixes throughout the command files and main keyboard view controller to implement all of the above functions

I've been doing lots of testing and it all seems to be working great so far aside from one issue: for some reason I can't get it so that lower case versions of German nouns (that are capitalized) are not in the autocomplete lexicon... This was a problem a long time ago when we first added autocomplete/autosuggest. What I means by this is that there are two versions of most German nouns: Buch and buch, Auto and auto, etc. This is because we get lower case from the autosuggestion keys and capitalized from Wikidata. At this point in createAutocompleteLexicon I'm trying to check if the capitalized or upper case version of a word is also in the nouns, and if so to take the noun instance. I'm likely just doing a simple SQL mistake and can't see it right now 🤷‍♂️ We need to get it fixed on the SQL side, but it can also be implemented another way if someone has an idea.

Most important thing: I really think that this has worked 😊😊😊 We'll see, but before doing all this there was some serious lag still while using the Russian keyboard. In the initial tests since the changes it seems to be very responsive 🔥🚀🔥➡️🌔😅

If someone has suggestions for the nouns I'd be grateful! Glad this is coming to a close 😊

@andrewtavis
Copy link
Member Author

andrewtavis commented Apr 16, 2023

func createAutocompleteLexicon() {

The above is the function for the autocomplete creation. Just to list it, this is the logic that I'm trying to put into the autocomplete lexicon:

  • Create a base column of noun, preposition, autosuggestion and emoji keys.
  • Remove words that have capitalized or upper case versions in the nouns table to make sure that just these versions are in autocomplete.
  • Filter out words that are less than three characters, numbers and hyphenated words.
    • This step also needs to be looked at as the numbers aren't being removed 🤔

We could also edit the drop duplicates function that's called after the UILexicon words (names from Contacts that are unordered) are added:

let dropNonUniqueAutosuggestionsQuery = """

@andrewtavis
Copy link
Member Author

andrewtavis commented Apr 16, 2023

An idea after a night’s sleep 😴💡: a self join to remove the lower case versions of capitalized/upper case words might make sense.

@andrewtavis
Copy link
Member Author

  • Filter out words that are less than three characters, numbers and hyphenated words.
    • This step also needs to be looked at as the numbers aren't being removed 🤔

This has been fixed at this point :) Hyphenated words have been removed, and I'm actually keeping numbers as I think they help match the emoji suggestions for 10 on a clock, etc. We're selecting words that are longer than two characters for everything except emojis for this purpose. We can change it later based on live testing 🙃

@andrewtavis
Copy link
Member Author

af553b4 and d9e3cb0 fixed the last glaring issues in all this 😊 I'm going to test a bit after lunch, then will close and move on to #284 🚀 We can do minor patches to fix any issues that come up, but now everything seems to be working fine :)

@andrewtavis
Copy link
Member Author

d9e3cb0 solution wasn't what we wanted, but 91b4791 fixes it :) As expected it was something pretty standard. Now what's happening in the selection step of the autocomplete lexicon query is we're joining on the nouns table as before, but now we're doing it twice — once when the lexicon word is the same as the noun when it's upper case and once when it's the same as the noun when it's capitalized. From there we do the same CASE WHEN checks we were doing before on the corresponding columns from nouns to only select upper case and capitalized versions if they exist 😊

@andrewtavis
Copy link
Member Author

With 6add031 we removed the JSON data 😊 I've been playing around with it all and it seems to be ok. Sent along some commits today as well that fixed some minor bugs with the emoji implementations, but nothing major 🙃

I think we're good to go! Closing this and switching over to #284!

@github-project-automation github-project-automation bot moved this from In Progress to Done in Scribe Board Apr 16, 2023
@andrewtavis
Copy link
Member Author

andrewtavis commented Apr 16, 2023

Thanks all for your support and ideas on all this! Major milestone for Scribe finished 😊 Here's hoping it'll work once we get it onto devices 🤞

@SaurabhJamadagni
Copy link
Collaborator

Hey @andrewtavis, sorry coming to this a bit late. I wish there was a way to star an issue because I am pretty sure I am going to be coming back here a lot. Loving the process documentation. I hate I was no help but thank you so much for all the changes! This is going to be awesome!! 🚀

@andrewtavis
Copy link
Member Author

There’s a little bit left for this to be pushed later on today, @SaurabhJamadagni, but I’ll send it along soon. There’s so much new data (as mentioned on Matrix 😉) that I actually needed to do further optimizations to what we had before 😊 Will send it along with #284 later :)

@andrewtavis
Copy link
Member Author

andrewtavis commented Apr 19, 2023

A final note on this issue is that the Russian keyboard's translation feature — the original issue that lead to all this — is now functional in v2.2.0 on device 😊😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
-next release- Included in the next release bug Something isn't working help wanted Extra attention is needed
Projects
Archived in project
Development

No branches or pull requests

4 participants