Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Environment Variables #42

Open
1 task done
SahithiKasim opened this issue Sep 25, 2023 · 6 comments · May be fixed by #45
Open
1 task done

Environment Variables #42

SahithiKasim opened this issue Sep 25, 2023 · 6 comments · May be fixed by #45
Assignees

Comments

@SahithiKasim
Copy link
Collaborator

SahithiKasim commented Sep 25, 2023

  • Add required environment variables to handle the paths to the database.
@SahithiKasim
Copy link
Collaborator Author

2100 mkdir demo
2101 cd demo/
2102 vim constants.py
2103 touch init.py
2104 cd ..
2105 python
2106 vim demo/constants.py
2107 vim things.py
2108 python things.py
2109 LOC="salut" python things.py
2110 export LOC="hola mundo"
2111 python things.py
2112 unset LOC
2113 python things.py

constants.py

import os
if os.getenv("LOC"):
LOC = os.getenv("LOC")
else:
LOC = "hello world"

things.py:

import demo.constants
print(demo.constants.LOC)

@VinhPham2106
Copy link
Collaborator

@SantiagoTorres the options to specify the constants are added to the parsers branch. We wonder how will the usage of each variable be explained to users, other than having to look into every scripts? Should we make a doc or is there already one?

@JorgeH309
Copy link
Collaborator

@SahithiKasim @absol27 When we try to test run the publish scripts, it takes about 40 minutes to parse through the publish packages of just one date. We realized that it takes the most amount of time when parsing the main dump file of a certain date. Since we finished our other tasks, should we try to optimize the script's speed or should we leave it as it is?

@VinhPham2106
Copy link
Collaborator

VinhPham2106 commented Sep 27, 2023

def parse_packagelist(date, ARCH, db_location, DFSG):
    counter = 0
    con = open_db(db_location)
    with open(f'./ingestion/parsers/Packagelist_DUMP/{date}-{ARCH}-{DFSG}_Packages.dump','r', encoding='utf-8') as rf:
        header = ""
        for line in rf:
            if line == "\n":
                parsed_package =  parser.parse_string(header).normalized_dict()
                parsed_package["added_at"] = date
                cur = con.cursor()
                parsed_package["architecture"] = ARCH
                parsed_package["provided_by"] = ""
                insert_package(cur, parsed_package, DFSG)
                provided_by = cur.lastrowid
                for provided_package in parsed_package["provides"]:
                    parsed_package["package"] = provided_package
                    parsed_package["version"] = ""
                    parsed_package["size"] = ""
                    parsed_package["provided_by"] = provided_by
                    insert_package(cur, parsed_package, DFSG)
                con.commit()
                header = ""
            else:
                header += line
    close_db(con)
    return

In this function it's making a commit to the db after picking up each package. We did some read up on sqlite3 and it said that it can run 50000 statements per second. We wonder if we should try to optimize the code by bundling up statements and reduce the amount of commits, are the current speed is good enough (16 hours for all 5 years of data, on 8GB RAM). We are not super proficient with database so professor @sbrunswi can you take a quick look.

@absol27
Copy link
Collaborator

absol27 commented Sep 27, 2023

The main dump file is the bigger file of three(on a date) by a lot. I suspect that more than the processing of the file, the problem is with the downloading of the package list dumps. I've added a condition to retry until it successfully downloads, but I noticed that it could be stuck in the error loop for a while. Terminating and restarting the program solves the issue, I've discussed this with @VinhPham2106 . I wonder if this is the issue more than the processing of the file, please let me know if I'm wrong.

I believe that the code could be optimized, but I would recommend checking out what the package list dumps are and their structure, to understand what each component means. (things like why 'parsed_package["provides"]' needed to be processed differently. )

@absol27
Copy link
Collaborator

absol27 commented Sep 27, 2023

Btw committing multiple write queries in a single commit is definitely one optimization, there's no reason to have 1 write query per commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants