Environment Variables #42

SahithiKasim · 2023-09-25T20:09:34Z

Add required environment variables to handle the paths to the database.

SahithiKasim · 2023-09-25T20:11:46Z

2100 mkdir demo
2101 cd demo/
2102 vim constants.py
2103 touch init.py
2104 cd ..
2105 python
2106 vim demo/constants.py
2107 vim things.py
2108 python things.py
2109 LOC="salut" python things.py
2110 export LOC="hola mundo"
2111 python things.py
2112 unset LOC
2113 python things.py

constants.py

import os
if os.getenv("LOC"):
LOC = os.getenv("LOC")
else:
LOC = "hello world"

things.py:

import demo.constants
print(demo.constants.LOC)

VinhPham2106 · 2023-09-27T20:46:52Z

@SantiagoTorres the options to specify the constants are added to the parsers branch. We wonder how will the usage of each variable be explained to users, other than having to look into every scripts? Should we make a doc or is there already one?

JorgeH309 · 2023-09-27T21:34:58Z

@SahithiKasim @absol27 When we try to test run the publish scripts, it takes about 40 minutes to parse through the publish packages of just one date. We realized that it takes the most amount of time when parsing the main dump file of a certain date. Since we finished our other tasks, should we try to optimize the script's speed or should we leave it as it is?

VinhPham2106 · 2023-09-27T21:48:44Z

def parse_packagelist(date, ARCH, db_location, DFSG):
    counter = 0
    con = open_db(db_location)
    with open(f'./ingestion/parsers/Packagelist_DUMP/{date}-{ARCH}-{DFSG}_Packages.dump','r', encoding='utf-8') as rf:
        header = ""
        for line in rf:
            if line == "\n":
                parsed_package =  parser.parse_string(header).normalized_dict()
                parsed_package["added_at"] = date
                cur = con.cursor()
                parsed_package["architecture"] = ARCH
                parsed_package["provided_by"] = ""
                insert_package(cur, parsed_package, DFSG)
                provided_by = cur.lastrowid
                for provided_package in parsed_package["provides"]:
                    parsed_package["package"] = provided_package
                    parsed_package["version"] = ""
                    parsed_package["size"] = ""
                    parsed_package["provided_by"] = provided_by
                    insert_package(cur, parsed_package, DFSG)
                con.commit()
                header = ""
            else:
                header += line
    close_db(con)
    return

In this function it's making a commit to the db after picking up each package. We did some read up on sqlite3 and it said that it can run 50000 statements per second. We wonder if we should try to optimize the code by bundling up statements and reduce the amount of commits, are the current speed is good enough (16 hours for all 5 years of data, on 8GB RAM). We are not super proficient with database so professor @sbrunswi can you take a quick look.

absol27 · 2023-09-27T22:11:04Z

The main dump file is the bigger file of three(on a date) by a lot. I suspect that more than the processing of the file, the problem is with the downloading of the package list dumps. I've added a condition to retry until it successfully downloads, but I noticed that it could be stuck in the error loop for a while. Terminating and restarting the program solves the issue, I've discussed this with @VinhPham2106 . I wonder if this is the issue more than the processing of the file, please let me know if I'm wrong.

I believe that the code could be optimized, but I would recommend checking out what the package list dumps are and their structure, to understand what each component means. (things like why 'parsed_package["provides"]' needed to be processed differently. )

absol27 · 2023-09-27T22:22:04Z

Btw committing multiple write queries in a single commit is definitely one optimization, there's no reason to have 1 write query per commit.

SahithiKasim assigned VinhPham2106, JorgeH309 and sumukhigupta Sep 25, 2023

VinhPham2106 linked a pull request Oct 3, 2023 that will close this issue

Allow users to specify constant variables, else use defaults provided #45

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Environment Variables #42

Environment Variables #42

SahithiKasim commented Sep 25, 2023 •

edited by VinhPham2106

Loading

SahithiKasim commented Sep 25, 2023

VinhPham2106 commented Sep 27, 2023

JorgeH309 commented Sep 27, 2023

VinhPham2106 commented Sep 27, 2023 •

edited

Loading

absol27 commented Sep 27, 2023

absol27 commented Sep 27, 2023

Environment Variables #42

Environment Variables #42

Comments

SahithiKasim commented Sep 25, 2023 • edited by VinhPham2106 Loading

SahithiKasim commented Sep 25, 2023

VinhPham2106 commented Sep 27, 2023

JorgeH309 commented Sep 27, 2023

VinhPham2106 commented Sep 27, 2023 • edited Loading

absol27 commented Sep 27, 2023

absol27 commented Sep 27, 2023

SahithiKasim commented Sep 25, 2023 •

edited by VinhPham2106

Loading

VinhPham2106 commented Sep 27, 2023 •

edited

Loading