Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replay Proto-X #32

Merged
merged 102 commits into from
May 30, 2024
Merged
Show file tree
Hide file tree
Changes from 101 commits
Commits
Show all changes
102 commits
Select commit Hold shift + click to select a range
5ea0984
copied replay_mythril.py over
wangpatrick57 Apr 15, 2024
1664890
added replay function
wangpatrick57 Apr 15, 2024
6dce17a
Baseilne -> Baseline
wangpatrick57 Apr 15, 2024
59c3d34
pqt -> query_timeout
wangpatrick57 Apr 15, 2024
ae99b21
repository -> tuning_steps
wangpatrick57 Apr 16, 2024
08829d8
removed logdir entirely
wangpatrick57 Apr 16, 2024
cd92d82
got rid of output_log_dir entirely
wangpatrick57 Apr 16, 2024
207097a
ray results now in dbgym workspace
wangpatrick57 Apr 16, 2024
93b0988
now linking hpo-ed params in symlinks
wangpatrick57 Apr 16, 2024
d2bb709
now linking tuning steps
wangpatrick57 Apr 16, 2024
7cd260f
replay main working
wangpatrick57 Apr 16, 2024
aa9b98f
wrote extract_from_task_run_fordpath
wangpatrick57 Apr 16, 2024
caf0d6c
now finding all replay dirs
wangpatrick57 Apr 17, 2024
64aaf15
added all configs to replay
wangpatrick57 Apr 17, 2024
6ece74d
added replayargs and deleted front of replay_step()
wangpatrick57 Apr 17, 2024
e341956
now copying params.json directly into data/
wangpatrick57 Apr 17, 2024
a531c2b
now copying params.json into tuning_steps
wangpatrick57 Apr 17, 2024
b5f1e8e
merged with integrate-boot
wangpatrick57 Apr 17, 2024
4471787
renamed boot_config_fpath to hpo_boot_config_fpath
wangpatrick57 Apr 17, 2024
315a69f
added hpo config fpath config to tune
wangpatrick57 Apr 17, 2024
4e9bde6
fixed bugs so that hpo runs
wangpatrick57 Apr 17, 2024
17dece3
fixed some comments
wangpatrick57 Apr 17, 2024
2f554e2
made it past first output.log loop
wangpatrick57 Apr 17, 2024
f035e01
now only reading folders in first loop
wangpatrick57 Apr 17, 2024
19310a3
fixed threshold limit
wangpatrick57 Apr 17, 2024
e084cc7
can now build PostgresEnv
wangpatrick57 Apr 17, 2024
0c3c146
now resetting and getting min reward
wangpatrick57 Apr 17, 2024
bac5238
single to double quotes
wangpatrick57 Apr 18, 2024
79cee72
maximal fixed
wangpatrick57 Apr 18, 2024
86acc80
num lines
wangpatrick57 Apr 18, 2024
a3038b5
initial fix to run_sample()
wangpatrick57 Apr 18, 2024
551fd67
fixed all parsing errors
wangpatrick57 Apr 18, 2024
0e2486b
run raw csv path fixed
wangpatrick57 Apr 18, 2024
55faf8e
maximal_only fixed
wangpatrick57 Apr 18, 2024
92020c3
now properly ignoring baseline
wangpatrick57 Apr 18, 2024
98d549b
now parsing action.json
wangpatrick57 Apr 18, 2024
41a4ac1
now reading prior_state.pkl correctly
wangpatrick57 Apr 18, 2024
132fb16
now outputting IndexAction instead of SQL string to action.txt
wangpatrick57 Apr 18, 2024
2646f73
done with combining index acts from action and previous
wangpatrick57 Apr 18, 2024
d49bdef
done with combining index acts from action and previous
wangpatrick57 Apr 18, 2024
0e3777f
done with creating index_modifaction_sqls
wangpatrick57 Apr 18, 2024
da8d078
done with shift_state
wangpatrick57 Apr 18, 2024
41a9059
run_sample running
wangpatrick57 Apr 18, 2024
1fcca5a
removed indexes from constraints
wangpatrick57 Apr 18, 2024
00b0c87
0.1 experiments
wangpatrick57 Apr 18, 2024
5cf5ef6
only stashing results for tune, and setting idx_name based on index c…
wangpatrick57 Apr 18, 2024
ca0bf85
added some comments about idx_name
wangpatrick57 Apr 18, 2024
d92707c
now always dumping page cache
wangpatrick57 Apr 18, 2024
82f9ea1
removed print statements
wangpatrick57 Apr 18, 2024
c2ad745
duration -> trial_duration
wangpatrick57 Apr 18, 2024
31f0521
added separate CLI arg for tune duration
wangpatrick57 Apr 18, 2024
9dcd36b
added print statements to investigate replay behavior
wangpatrick57 Apr 18, 2024
3175716
timeout -> workload_timeout
wangpatrick57 Apr 19, 2024
43ba9c3
got rid of modifying workload_timeout
wangpatrick57 Apr 19, 2024
019b4fc
added tuningmode enum
wangpatrick57 Apr 19, 2024
02049c4
is_hpo -> tuning_mode
wangpatrick57 Apr 19, 2024
e7012e1
replaced replay in pg_env with tuning_mode
wangpatrick57 Apr 19, 2024
243411c
changed HPO params to use enums instead of having different names
wangpatrick57 Apr 19, 2024
cbb87a7
hpo, tune, and replay all now not crashing
wangpatrick57 Apr 19, 2024
d88fd98
added workload timeout during replay param
wangpatrick57 Apr 19, 2024
bc66526
fixed race condition in multiple threads writing to pg.log
wangpatrick57 Apr 19, 2024
f09d38f
now linking to params.json for manual run_*/ traversal
wangpatrick57 Apr 19, 2024
867aa92
renamed reward in replay.py
wangpatrick57 Apr 19, 2024
74d70ff
more renaming
wangpatrick57 Apr 19, 2024
0e3aa51
comment changes
wangpatrick57 Apr 19, 2024
fbbee89
got rid of the 2 maximal params, 2 threshold params, and the 'samples…
wangpatrick57 Apr 19, 2024
106d4ea
got rid of extra row at bottom
wangpatrick57 Apr 19, 2024
7fc0bee
has_timeout -> did_any_query_timeout_in_original
wangpatrick57 Apr 19, 2024
b11d3f4
comment
wangpatrick57 Apr 19, 2024
84bcd79
refactored codebase so that all symlinks end with .link. full benchma…
wangpatrick57 Apr 19, 2024
8ee373d
now writing all holon action variations to action.pkl
wangpatrick57 Apr 22, 2024
d46c5a9
now checking equality with the index space
wangpatrick57 Apr 23, 2024
95be6fa
added comments describing why query timeout and workload timeout aren…
wangpatrick57 Apr 23, 2024
a909f1b
now reliably getting did_any_query_time_out_in_original
wangpatrick57 Apr 23, 2024
586b9a3
fixed did_workload_time_out_in_original and ignoring penalty in origi…
wangpatrick57 Apr 23, 2024
d91cc65
changes to scripts
wangpatrick57 Apr 23, 2024
aa1bca0
merge
wangpatrick57 Apr 23, 2024
cc11a8d
removing breaking after 10 iterations
wangpatrick57 Apr 23, 2024
c200e46
workload_time -> workload_runtime_accum
wangpatrick57 Apr 23, 2024
6beea71
workload_timeout -> this_execution_workload_timeout
wangpatrick57 Apr 23, 2024
006cc4a
removed time_left since it's redundant with workload_runtime_accum
wangpatrick57 Apr 23, 2024
a12348d
removed disable_pg_hint code
wangpatrick57 Apr 23, 2024
7320e06
removed noop index dead code
wangpatrick57 Apr 23, 2024
e1c3f07
removed dead var
wangpatrick57 Apr 23, 2024
9c45bf7
renamed BestQueryRun.timeout to timed_out
wangpatrick57 Apr 23, 2024
509f7dc
renamed stop_running to workload_timed_out
wangpatrick57 Apr 23, 2024
22617e0
refactored execute_workload() to separately return whether the worklo…
wangpatrick57 Apr 23, 2024
bf5fe73
replaced workload_runtime_accum with compute_total_workload_runtime()
wangpatrick57 Apr 23, 2024
6d237ec
now seeing whether workload or query timed out in replay
wangpatrick57 Apr 23, 2024
5bd43c6
now logging this_step_run_data before validity checks
wangpatrick57 Apr 23, 2024
c6b15dd
added replay_all_variations option
wangpatrick57 Apr 24, 2024
d0ed37f
added comments to _mutilate_action_with_metrics
wangpatrick57 Apr 24, 2024
b64fda2
added comment about best observed in replay.py
wangpatrick57 Apr 24, 2024
4fab4f2
changed bool of queries timed out to an actual num
wangpatrick57 Apr 24, 2024
a35a576
added info for num executed queries
wangpatrick57 Apr 25, 2024
6016334
reset now doesn't overwrite the results from step
wangpatrick57 Apr 25, 2024
4736315
wrote load_per_machine_envvars.sh
wangpatrick57 Apr 25, 2024
9849a99
added build_space_good_for_boot option
wangpatrick57 Apr 25, 2024
d2fb275
resolved some PR comments
wangpatrick57 Apr 28, 2024
af33bc7
added comment about tune
wangpatrick57 May 27, 2024
474d7ee
different tune trials during hpo now name their tuning_steps dir diff…
wangpatrick57 May 27, 2024
a6e00b9
now logging during HPO for both baseline and tuning steps
wangpatrick57 May 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 38 additions & 36 deletions benchmark/tpch/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

import click

from misc.utils import DBGymConfig, get_scale_factor_string, workload_name_fn
from misc.utils import DBGymConfig, get_scale_factor_string, link_result, workload_name_fn
from util.shell import subprocess_run
from util.pg import *

Expand Down Expand Up @@ -56,68 +56,71 @@ def _get_queries_dname(seed: int, scale_factor: float) -> str:


def _clone(dbgym_cfg: DBGymConfig):
symlink_dir = dbgym_cfg.cur_symlinks_build_path("tpch-kit")
if symlink_dir.exists():
benchmark_tpch_logger.info(f"Skipping clone: {symlink_dir}")
expected_symlink_dpath = dbgym_cfg.cur_symlinks_build_path(mkdir=True) / "tpch-kit.link"
if expected_symlink_dpath.exists():
benchmark_tpch_logger.info(f"Skipping clone: {expected_symlink_dpath}")
return

benchmark_tpch_logger.info(f"Cloning: {symlink_dir}")
benchmark_tpch_logger.info(f"Cloning: {expected_symlink_dpath}")
real_build_path = dbgym_cfg.cur_task_runs_build_path()
subprocess_run(
f"./tpch_setup.sh {real_build_path}", cwd=dbgym_cfg.cur_source_path()
)
subprocess_run(
f"ln -s {real_build_path / 'tpch-kit'} {dbgym_cfg.cur_symlinks_build_path(mkdir=True)}"
)
benchmark_tpch_logger.info(f"Cloned: {symlink_dir}")
symlink_dpath = link_result(dbgym_cfg, real_build_path / "tpch-kit")
assert os.path.samefile(expected_symlink_dpath, symlink_dpath)
benchmark_tpch_logger.info(f"Cloned: {expected_symlink_dpath}")


def _generate_queries(dbgym_cfg: DBGymConfig, seed_start: int, seed_end: int, scale_factor: float):
build_path = dbgym_cfg.cur_symlinks_build_path()
assert build_path.exists()
def _get_tpch_kit_dpath(dbgym_cfg: DBGymConfig) -> Path:
tpch_kit_dpath = (dbgym_cfg.cur_symlinks_build_path() / "tpch-kit.link").resolve()
assert tpch_kit_dpath.exists() and tpch_kit_dpath.is_absolute() and not tpch_kit_dpath.is_symlink()
return tpch_kit_dpath


def _generate_queries(dbgym_cfg: DBGymConfig, seed_start: int, seed_end: int, scale_factor: float):
tpch_kit_dpath = _get_tpch_kit_dpath(dbgym_cfg)
data_path = dbgym_cfg.cur_symlinks_data_path(mkdir=True)
benchmark_tpch_logger.info(
f"Generating queries: {data_path} [{seed_start}, {seed_end}]"
)
for seed in range(seed_start, seed_end + 1):
symlinked_seed = data_path / _get_queries_dname(seed, scale_factor)
if symlinked_seed.exists():
expected_queries_symlink_dpath = data_path / (_get_queries_dname(seed, scale_factor) + ".link")
if expected_queries_symlink_dpath.exists():
continue

real_dir = dbgym_cfg.cur_task_runs_data_path(_get_queries_dname(seed, scale_factor), mkdir=True)
for i in range(1, 22 + 1):
target_sql = (real_dir / f"{i}.sql").resolve()
subprocess_run(
f"DSS_QUERY=./queries ./qgen {i} -r {seed} -s {scale_factor} > {target_sql}",
cwd=build_path / "tpch-kit" / "dbgen",
cwd=tpch_kit_dpath / "dbgen",
verbose=False,
)
subprocess_run(f"ln -s {real_dir} {data_path}", verbose=False)
queries_symlink_dpath = link_result(dbgym_cfg, real_dir)
assert os.path.samefile(queries_symlink_dpath, expected_queries_symlink_dpath)
benchmark_tpch_logger.info(
f"Generated queries: {data_path} [{seed_start}, {seed_end}]"
)


def _generate_data(dbgym_cfg: DBGymConfig, scale_factor: float):
build_path = dbgym_cfg.cur_symlinks_build_path()
assert build_path.exists()

tpch_kit_dpath = _get_tpch_kit_dpath(dbgym_cfg)
data_path = dbgym_cfg.cur_symlinks_data_path(mkdir=True)
symlink_dir = data_path / f"tables_sf{get_scale_factor_string(scale_factor)}"
if symlink_dir.exists():
benchmark_tpch_logger.info(f"Skipping generation: {symlink_dir}")
expected_tables_symlink_dpath = data_path / f"tables_sf{get_scale_factor_string(scale_factor)}.link"
if expected_tables_symlink_dpath.exists():
benchmark_tpch_logger.info(f"Skipping generation: {expected_tables_symlink_dpath}")
return

benchmark_tpch_logger.info(f"Generating: {symlink_dir}")
benchmark_tpch_logger.info(f"Generating: {expected_tables_symlink_dpath}")
subprocess_run(
f"./dbgen -vf -s {scale_factor}", cwd=build_path / "tpch-kit" / "dbgen"
f"./dbgen -vf -s {scale_factor}", cwd=tpch_kit_dpath / "dbgen"
)
real_dir = dbgym_cfg.cur_task_runs_data_path(f"tables_sf{get_scale_factor_string(scale_factor)}", mkdir=True)
subprocess_run(f"mv ./*.tbl {real_dir}", cwd=build_path / "tpch-kit" / "dbgen")
subprocess_run(f"mv ./*.tbl {real_dir}", cwd=tpch_kit_dpath / "dbgen")

subprocess_run(f"ln -s {real_dir} {data_path}")
benchmark_tpch_logger.info(f"Generated: {symlink_dir}")
tables_symlink_dpath = link_result(dbgym_cfg, real_dir)
assert os.path.samefile(tables_symlink_dpath, expected_tables_symlink_dpath)
benchmark_tpch_logger.info(f"Generated: {expected_tables_symlink_dpath}")


def _generate_workload(
Expand All @@ -129,9 +132,9 @@ def _generate_workload(
):
symlink_data_dir = dbgym_cfg.cur_symlinks_data_path(mkdir=True)
workload_name = workload_name_fn(scale_factor, seed_start, seed_end, query_subset)
workload_symlink_path = symlink_data_dir / workload_name
expected_workload_symlink_dpath = symlink_data_dir / (workload_name + ".link")

benchmark_tpch_logger.info(f"Generating: {workload_symlink_path}")
benchmark_tpch_logger.info(f"Generating: {expected_workload_symlink_dpath}")
real_dir = dbgym_cfg.cur_task_runs_data_path(
workload_name, mkdir=True
)
Expand All @@ -147,13 +150,12 @@ def _generate_workload(
with open(real_dir / "order.txt", "w") as f:
for seed in range(seed_start, seed_end + 1):
for qnum in queries:
sqlfile = symlink_data_dir / _get_queries_dname(seed, scale_factor) / f"{qnum}.sql"
assert sqlfile.exists()
output = ",".join([f"S{seed}-Q{qnum}", str(sqlfile)])
sql_fpath = (symlink_data_dir / (_get_queries_dname(seed, scale_factor) + ".link")).resolve() / f"{qnum}.sql"
assert sql_fpath.exists() and not sql_fpath.is_symlink() and sql_fpath.is_absolute(), "We should only write existent real absolute paths to a file"
output = ",".join([f"S{seed}-Q{qnum}", str(sql_fpath)])
print(output, file=f)
# TODO(WAN): add option to deep-copy the workload.

if workload_symlink_path.exists():
os.remove(workload_symlink_path)
subprocess_run(f"ln -s {real_dir} {workload_symlink_path}")
benchmark_tpch_logger.info(f"Generated: {workload_symlink_path}")
workload_symlink_dpath = link_result(dbgym_cfg, real_dir)
assert workload_symlink_dpath == expected_workload_symlink_dpath
benchmark_tpch_logger.info(f"Generated: {expected_workload_symlink_dpath}")
15 changes: 7 additions & 8 deletions benchmark/tpch/load_info.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from dbms.load_info_base_class import LoadInfoBaseClass
from misc.utils import get_scale_factor_string
from misc.utils import DBGymConfig, get_scale_factor_string


TPCH_SCHEMA_FNAME = "tpch_schema.sql"
Expand All @@ -22,7 +22,7 @@ class TpchLoadInfo(LoadInfoBaseClass):
"lineitem",
]

def __init__(self, dbgym_cfg, scale_factor):
def __init__(self, dbgym_cfg: DBGymConfig, scale_factor: float):
# schema and constraints
schema_root_dpath = dbgym_cfg.dbgym_repo_path
for component in TpchLoadInfo.CODEBASE_PATH_COMPONENTS[
Expand All @@ -39,13 +39,12 @@ def __init__(self, dbgym_cfg, scale_factor):
), f"self._constraints_fpath ({self._constraints_fpath}) does not exist"

# tables
data_root_dpath = (
dbgym_cfg.dbgym_symlinks_path / TpchLoadInfo.CODEBASE_DNAME / "data"
)
tables_dpath = data_root_dpath / f"tables_sf{get_scale_factor_string(scale_factor)}"
data_root_dpath = dbgym_cfg.dbgym_symlinks_path / TpchLoadInfo.CODEBASE_DNAME / "data"
tables_symlink_dpath = data_root_dpath / f"tables_sf{get_scale_factor_string(scale_factor)}.link"
tables_dpath = tables_symlink_dpath.resolve()
assert (
tables_dpath.exists()
), f"tables_dpath ({tables_dpath}) does not exist. Make sure you have generated the TPC-H data"
tables_dpath.exists() and tables_dpath.is_absolute() and not tables_dpath.is_symlink()
), f"tables_dpath ({tables_dpath}) should be an existent real absolute path. Make sure you have generated the TPC-H data"
self._tables_and_fpaths = []
for table in TpchLoadInfo.TABLES:
table_fpath = tables_dpath / f"{table}.tbl"
Expand Down
47 changes: 24 additions & 23 deletions benchmark/tpch/tpch_constraints.sql
Original file line number Diff line number Diff line change
Expand Up @@ -7,26 +7,27 @@ ALTER TABLE orders ADD CONSTRAINT orders_o_custkey_fkey FOREIGN KEY (o_custkey)
ALTER TABLE lineitem ADD CONSTRAINT lineitem_l_orderkey_fkey FOREIGN KEY (l_orderkey) REFERENCES orders (o_orderkey) ON DELETE CASCADE;
ALTER TABLE lineitem ADD CONSTRAINT lineitem_l_partkey_l_suppkey_fkey FOREIGN KEY (l_partkey, l_suppkey) REFERENCES partsupp (ps_partkey, ps_suppkey) ON DELETE CASCADE;

CREATE UNIQUE INDEX r_rk ON region (r_regionkey ASC);
CREATE UNIQUE INDEX n_nk ON nation (n_nationkey ASC);
CREATE INDEX n_rk ON nation (n_regionkey ASC);
CREATE UNIQUE INDEX p_pk ON part (p_partkey ASC);
CREATE UNIQUE INDEX s_sk ON supplier (s_suppkey ASC);
CREATE INDEX s_nk ON supplier (s_nationkey ASC);
CREATE INDEX ps_pk ON partsupp (ps_partkey ASC);
CREATE INDEX ps_sk ON partsupp (ps_suppkey ASC);
CREATE UNIQUE INDEX ps_pk_sk ON partsupp (ps_partkey ASC, ps_suppkey ASC);
CREATE UNIQUE INDEX ps_sk_pk ON partsupp (ps_suppkey ASC, ps_partkey ASC);
CREATE UNIQUE INDEX c_ck ON customer (c_custkey ASC);
CREATE INDEX c_nk ON customer (c_nationkey ASC);
CREATE UNIQUE INDEX o_ok ON orders (o_orderkey ASC);
CREATE INDEX o_ck ON orders (o_custkey ASC);
CREATE INDEX o_od ON orders (o_orderdate ASC);
CREATE INDEX l_ok ON lineitem (l_orderkey ASC);
CREATE INDEX l_pk ON lineitem (l_partkey ASC);
CREATE INDEX l_sk ON lineitem (l_suppkey ASC);
CREATE INDEX l_sd ON lineitem (l_shipdate ASC);
CREATE INDEX l_cd ON lineitem (l_commitdate ASC);
CREATE INDEX l_rd ON lineitem (l_receiptdate ASC);
CREATE INDEX l_pk_sk ON lineitem (l_partkey ASC, l_suppkey ASC);
CREATE INDEX l_sk_pk ON lineitem (l_suppkey ASC, l_partkey ASC);
-- We don't create any indexes so that there's a clean slate for tuning
-- CREATE UNIQUE INDEX r_rk ON region (r_regionkey ASC);
-- CREATE UNIQUE INDEX n_nk ON nation (n_nationkey ASC);
-- CREATE INDEX n_rk ON nation (n_regionkey ASC);
-- CREATE UNIQUE INDEX p_pk ON part (p_partkey ASC);
-- CREATE UNIQUE INDEX s_sk ON supplier (s_suppkey ASC);
-- CREATE INDEX s_nk ON supplier (s_nationkey ASC);
-- CREATE INDEX ps_pk ON partsupp (ps_partkey ASC);
-- CREATE INDEX ps_sk ON partsupp (ps_suppkey ASC);
-- CREATE UNIQUE INDEX ps_pk_sk ON partsupp (ps_partkey ASC, ps_suppkey ASC);
-- CREATE UNIQUE INDEX ps_sk_pk ON partsupp (ps_suppkey ASC, ps_partkey ASC);
-- CREATE UNIQUE INDEX c_ck ON customer (c_custkey ASC);
-- CREATE INDEX c_nk ON customer (c_nationkey ASC);
-- CREATE UNIQUE INDEX o_ok ON orders (o_orderkey ASC);
-- CREATE INDEX o_ck ON orders (o_custkey ASC);
-- CREATE INDEX o_od ON orders (o_orderdate ASC);
-- CREATE INDEX l_ok ON lineitem (l_orderkey ASC);
-- CREATE INDEX l_pk ON lineitem (l_partkey ASC);
-- CREATE INDEX l_sk ON lineitem (l_suppkey ASC);
-- CREATE INDEX l_sd ON lineitem (l_shipdate ASC);
-- CREATE INDEX l_cd ON lineitem (l_commitdate ASC);
-- CREATE INDEX l_rd ON lineitem (l_receiptdate ASC);
-- CREATE INDEX l_pk_sk ON lineitem (l_partkey ASC, l_suppkey ASC);
-- CREATE INDEX l_sk_pk ON lineitem (l_suppkey ASC, l_partkey ASC);
24 changes: 12 additions & 12 deletions dbms/postgres/cli.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
'''
"""
At a high level, this file's goal is to (1) install+build postgres and (2) create pgdata.
On the other hand, the goal of tune.protox.env.util.postgres is to provide helpers to manage
a Postgres instance during agent tuning.
util.pg provides helpers used by *both* of the above files (as well as other files).
'''
"""
import logging
import os
import shutil
Expand Down Expand Up @@ -84,11 +84,11 @@ def postgres_pgdata(dbgym_cfg: DBGymConfig, benchmark_name: str, scale_factor: f


def _get_pgbin_symlink_path(dbgym_cfg: DBGymConfig) -> Path:
return dbgym_cfg.cur_symlinks_build_path("repo", "boot", "build", "postgres", "bin")
return dbgym_cfg.cur_symlinks_build_path("repo.link", "boot", "build", "postgres", "bin")


def _get_repo_symlink_path(dbgym_cfg: DBGymConfig) -> Path:
return dbgym_cfg.cur_symlinks_build_path("repo")
return dbgym_cfg.cur_symlinks_build_path("repo.link")


def _build_repo(dbgym_cfg: DBGymConfig, rebuild):
Expand Down Expand Up @@ -143,7 +143,7 @@ def _create_pgdata(dbgym_cfg: DBGymConfig, benchmark_name: str, scale_factor: fl
# Create .tgz file.
# Note that you can't pass "[pgdata].tgz" as an arg to cur_task_runs_data_path() because that would create "[pgdata].tgz" as a dir.
pgdata_tgz_real_fpath = dbgym_cfg.cur_task_runs_data_path(
".", mkdir=True
mkdir=True
) / get_pgdata_tgz_name(benchmark_name, scale_factor)
# We need to cd into pgdata_dpath so that the tar file does not contain folders for the whole path of pgdata_dpath.
subprocess_run(f"tar -czf {pgdata_tgz_real_fpath} .", cwd=pgdata_dpath)
Expand All @@ -156,21 +156,21 @@ def _create_pgdata(dbgym_cfg: DBGymConfig, benchmark_name: str, scale_factor: fl

def _generic_pgdata_setup(dbgym_cfg: DBGymConfig):
# get necessary vars
pgbin_symlink_dpath = _get_pgbin_symlink_path(dbgym_cfg)
assert pgbin_symlink_dpath.exists()
pgbin_real_dpath = _get_pgbin_symlink_path(dbgym_cfg).resolve()
assert pgbin_real_dpath.exists()
dbgym_pguser = DBGYM_POSTGRES_USER
dbgym_pgpass = DBGYM_POSTGRES_PASS
pgport = DEFAULT_POSTGRES_PORT

# Create user
save_file(dbgym_cfg, pgbin_symlink_dpath / "psql")
save_file(dbgym_cfg, pgbin_real_dpath / "psql")
subprocess_run(
f"./psql -c \"create user {dbgym_pguser} with superuser password '{dbgym_pgpass}'\" {DEFAULT_POSTGRES_DBNAME} -p {pgport} -h localhost",
cwd=pgbin_symlink_dpath,
cwd=pgbin_real_dpath,
)
subprocess_run(
f'./psql -c "grant pg_monitor to {dbgym_pguser}" {DEFAULT_POSTGRES_DBNAME} -p {pgport} -h localhost',
cwd=pgbin_symlink_dpath,
cwd=pgbin_real_dpath,
)

# Load shared preload libraries
Expand All @@ -179,14 +179,14 @@ def _generic_pgdata_setup(dbgym_cfg: DBGymConfig):
# You have to use TO and you can't put single quotes around the libraries (https://postgrespro.com/list/thread-id/2580120)
# The method I wrote here works for both one library and multiple libraries
f"./psql -c \"ALTER SYSTEM SET shared_preload_libraries TO {SHARED_PRELOAD_LIBRARIES};\" {DEFAULT_POSTGRES_DBNAME} -p {pgport} -h localhost",
cwd=pgbin_symlink_dpath,
cwd=pgbin_real_dpath,
)

# Create the dbgym database. since one pgdata dir maps to one benchmark, all benchmarks will use the same database
# as opposed to using databases named after the benchmark
subprocess_run(
f"./psql -c \"create database {DBGYM_POSTGRES_DBNAME} with owner = '{dbgym_pguser}'\" {DEFAULT_POSTGRES_DBNAME} -p {pgport} -h localhost",
cwd=pgbin_symlink_dpath,
cwd=pgbin_real_dpath,
)


Expand Down
11 changes: 11 additions & 0 deletions experiments/load_per_machine_envvars.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/bin/bash
host=$(hostname)

if [ "$host" == "dev4" ]; then
export PGDATA_PARENT_DPATH=/mnt/nvme1n1/phw2/dbgym_tmp/
elif [ "$host" == "dev6" ]; then
export PGDATA_PARENT_DPATH=/mnt/nvme0n1/phw2/dbgym_tmp/
else
echo "Did not recognize host \"$host\""
exit 1
fi
33 changes: 33 additions & 0 deletions experiments/protox_tpch_sf0point1/main.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
#!/bin/bash

set -euxo pipefail

SCALE_FACTOR=0.1
INTENDED_PGDATA_HARDWARE=ssd
. ./experiments/load_per_machine_envvars.sh
echo $PGDATA_PARENT_DPATH

# space for testing. uncomment this to run individual commands from the script (copy pasting is harder because there are envvars)
# python3 task.py --no-startup-check tune protox agent hpo tpch --scale-factor $SCALE_FACTOR --num-samples 4 --max-concurrent 4 --workload-timeout 100 --query-timeout 15 --tune-duration-during-hpo 0.1 --intended-pgdata-hardware $INTENDED_PGDATA_HARDWARE --pgdata-parent-dpath $PGDATA_PARENT_DPATH
python3 task.py --no-startup-check tune protox agent tune tpch --scale-factor $SCALE_FACTOR --tune-duration-during-tune 0.2
python3 task.py --no-startup-check tune protox agent replay tpch --scale-factor $SCALE_FACTOR
exit 0

# benchmark
python3 task.py --no-startup-check benchmark tpch data $SCALE_FACTOR
python3 task.py --no-startup-check benchmark tpch workload --scale-factor $SCALE_FACTOR

# postgres
python3 task.py --no-startup-check dbms postgres build
python3 task.py --no-startup-check dbms postgres pgdata tpch --scale-factor $SCALE_FACTOR --intended-pgdata-hardware $INTENDED_PGDATA_HARDWARE --pgdata-parent-dpath $PGDATA_PARENT_DPATH

exit 0

# embedding
python3 task.py --no-startup-check tune protox embedding datagen tpch --scale-factor $SCALE_FACTOR --override-sample-limits "lineitem,32768" --intended-pgdata-hardware $INTENDED_PGDATA_HARDWARE --pgdata-parent-dpath $PGDATA_PARENT_DPATH # long datagen so that train doesn't crash
python3 task.py --no-startup-check tune protox embedding train tpch --scale-factor $SCALE_FACTOR --iterations-per-epoch 1 --num-points-to-sample 1 --num-batches 1 --batch-size 64 --start-epoch 15 --num-samples 4 --train-max-concurrent 4 --num-curate 2

# agent
python3 task.py --no-startup-check tune protox agent hpo tpch --scale-factor $SCALE_FACTOR --num-samples 4 --max-concurrent 4 --workload-timeout 100 --query-timeout 15 --tune-duration-during-hpo 1 --intended-pgdata-hardware $INTENDED_PGDATA_HARDWARE --pgdata-parent-dpath $PGDATA_PARENT_DPATH --build-space-good-for-boot
python3 task.py --no-startup-check tune protox agent tune tpch --scale-factor $SCALE_FACTOR
python3 task.py --no-startup-check tune protox agent replay tpch --scale-factor $SCALE_FACTOR
10 changes: 7 additions & 3 deletions experiments/protox_tpch_sf10/main.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,14 @@ set -euxo pipefail

SCALE_FACTOR=10
INTENDED_PGDATA_HARDWARE=ssd
PGDATA_PARENT_DPATH=/mnt/nvme1n1/phw2/dbgym_tmp/
. ./experiments/load_per_machine_envvars.sh

# space for testing. uncomment this to run individual commands from the script (copy pasting is harder because there are envvars)
python3 task.py --no-startup-check tune protox agent tune tpch --scale-factor $SCALE_FACTOR --enable-boot-during-tune
python3 task.py --no-startup-check tune protox agent hpo tpch --scale-factor $SCALE_FACTOR --max-concurrent 4 --tune-duration-during-hpo 4 --intended-pgdata-hardware $INTENDED_PGDATA_HARDWARE --pgdata-parent-dpath $PGDATA_PARENT_DPATH --build-space-good-for-boot
# python3 task.py --no-startup-check tune protox agent tune tpch --scale-factor $SCALE_FACTOR --tune-duration-during-tune 4
# python3 task.py --no-startup-check tune protox agent tune tpch --scale-factor $SCALE_FACTOR --enable-boot-during-tune --tune-duration-during-tune 4
# python3 task.py --no-startup-check tune protox agent replay tpch --scale-factor $SCALE_FACTOR
# python3 task.py --no-startup-check tune protox agent replay tpch --scale-factor $SCALE_FACTOR --boot-enabled-during-tune
exit 0

# benchmark
Expand All @@ -23,5 +27,5 @@ python3 task.py --no-startup-check tune protox embedding datagen tpch --scale-fa
python3 task.py --no-startup-check tune protox embedding train tpch --scale-factor $SCALE_FACTOR --train-max-concurrent 10

# agent
python3 task.py --no-startup-check tune protox agent hpo tpch --scale-factor $SCALE_FACTOR --max-concurrent 4 --duration 4 --intended-pgdata-hardware $INTENDED_PGDATA_HARDWARE --pgdata-parent-dpath $PGDATA_PARENT_DPATH --enable-boot-during-hpo
python3 task.py --no-startup-check tune protox agent hpo tpch --scale-factor $SCALE_FACTOR --max-concurrent 4 --tune-duration-during-hpo 4 --intended-pgdata-hardware $INTENDED_PGDATA_HARDWARE --pgdata-parent-dpath $PGDATA_PARENT_DPATH --build-space-good-for-boot
python3 task.py --no-startup-check tune protox agent tune tpch --scale-factor $SCALE_FACTOR
Loading
Loading