Consider more lightweight DB system #268

lyricnz · 2023-09-21T06:37:12Z

Following on from the work in #255

Ideas for reducing the size of the DB. Perhaps it's time to consider:

using something more lightweight than postgresql: maybe sqllite, compressed and readonly
maybe stored in filesystem rather than docker image
preprocessing the DB into a format more specific to the application (there's only 1 main query)

lyricnz · 2023-09-21T06:55:30Z

Notes (just experimenting):

gnaf-202308.dmp 2.1G
address_principals.csv.gz 306M
address_principals.csv 1.1GB
current DB image 3.01GB
test.db DB per below (no index) 1.1GB
test.db DB per below (with index) 1.4GB
gzip test.db with index 507M

sqlite3 test.db

CREATE TABLE address_principals
(
  gnaf_pid text NOT NULL,
  address text NOT NULL,
  locality_name text NOT NULL,
  postcode text NULL,
  state text NOT NULL,
  latitude numeric(10,8) NOT NULL,
  longitude numeric(11,8) NOT NULL
);

sqlite> .mode csv
sqlite> .import address_principals.csv address_principals

CREATE INDEX address_name_state ON address_principals(locality_name, state);

SELECT gnaf_pid, address, postcode, latitude, longitude FROM address_principals WHERE locality_name = "SOMERVILLE" AND state = "VIC"

lyricnz · 2023-09-22T02:26:11Z

Related idea - host the latest version of the DB as a static+queryable source.
Per https://phiresky.github.io/blog/2021/hosting-sqlite-databases-on-github-pages/

lyricnz · 2023-09-25T01:38:43Z

Possibly a Pandas dataframe saved in parquet file? This is a compressed and queryable format

csv_file = "..../extra/db/address_principals.csv"
df = pd.read_csv(csv_file)
df.to_parquet("rows.parquet")  # 484MB

columns = ['gnaf_pid', 'address', 'postcode', 'latitude', 'longitude']
filters = [
    ('locality_name', '==', 'BLI BLI'),
    ('state', '==', 'QLD'),
]
df2 = pd.read_parquet("rows.parquet", columns=columns, filters=filters)
print(df2.head())
print(len(df2))

emits

2023-09-25 11:37:43,566 INFO     start
         gnaf_pid                       address  ...   latitude   longitude
0  GAQLD720868370              19 WILLOW STREET  ... -26.616855  153.031108
1  GAQLD720861910            21 HILLGROVE COURT  ... -26.618337  153.030099
2  GAQLD720864676               8 WILLOW STREET  ... -26.617568  153.030009
3  GAQLD719559927  UNIT 1, 134 KINGFISHER DRIVE  ... -26.601149  153.015579
4  GAQLD720857697             4 HILLGROVE COURT  ... -26.617756  153.030490

[5 rows x 5 columns]
4844
2023-09-25 11:37:45,287 INFO     elapsed: 0 days 00:00:01.718875

lyricnz · 2023-09-25T01:55:30Z

We can't use git-lfs on github, because it only provides 1GB of storage, and 1GB/mo of transfer. It's $5/mo for 50+50, but if GHA counts, that'd be used inside a day or two.

https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-storage-and-bandwidth-usage

LukePrior added the enhancement New feature or request label Sep 21, 2023

lyricnz mentioned this issue Sep 21, 2023

Change postgres column types to use less space #269

Closed

lyricnz mentioned this issue Sep 25, 2023

Add support for using SQLite as a DB #271

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider more lightweight DB system #268

Consider more lightweight DB system #268

lyricnz commented Sep 21, 2023

lyricnz commented Sep 21, 2023

lyricnz commented Sep 22, 2023

lyricnz commented Sep 25, 2023 •

edited

Loading

lyricnz commented Sep 25, 2023

Consider more lightweight DB system #268

Consider more lightweight DB system #268

Comments

lyricnz commented Sep 21, 2023

lyricnz commented Sep 21, 2023

lyricnz commented Sep 22, 2023

lyricnz commented Sep 25, 2023 • edited Loading

lyricnz commented Sep 25, 2023

lyricnz commented Sep 25, 2023 •

edited

Loading