-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Environment Variables #42
Comments
2100 mkdir demo constants.py import os things.py: import demo.constants |
@SantiagoTorres the options to specify the constants are added to the parsers branch. We wonder how will the usage of each variable be explained to users, other than having to look into every scripts? Should we make a doc or is there already one? |
@SahithiKasim @absol27 When we try to test run the publish scripts, it takes about 40 minutes to parse through the publish packages of just one date. We realized that it takes the most amount of time when parsing the main dump file of a certain date. Since we finished our other tasks, should we try to optimize the script's speed or should we leave it as it is? |
In this function it's making a commit to the db after picking up each package. We did some read up on sqlite3 and it said that it can run 50000 statements per second. We wonder if we should try to optimize the code by bundling up statements and reduce the amount of commits, are the current speed is good enough (16 hours for all 5 years of data, on 8GB RAM). We are not super proficient with database so professor @sbrunswi can you take a quick look. |
The main dump file is the bigger file of three(on a date) by a lot. I suspect that more than the processing of the file, the problem is with the downloading of the package list dumps. I've added a condition to retry until it successfully downloads, but I noticed that it could be stuck in the error loop for a while. Terminating and restarting the program solves the issue, I've discussed this with @VinhPham2106 . I wonder if this is the issue more than the processing of the file, please let me know if I'm wrong. I believe that the code could be optimized, but I would recommend checking out what the package list dumps are and their structure, to understand what each component means. (things like why 'parsed_package["provides"]' needed to be processed differently. ) |
Btw committing multiple write queries in a single commit is definitely one optimization, there's no reason to have 1 write query per commit. |
The text was updated successfully, but these errors were encountered: