Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conveyor service initial implementation #3

Merged
merged 45 commits into from
Feb 7, 2024
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
2dda88b
dev/conveyor-1: data conveyor with generation of samples by alloy tem…
renbou Jan 28, 2024
95c4078
dev/conveyor-1: re-enable safe attribute usage on the server
renbou Feb 1, 2024
559976c
dev/conveyor-1: remove DataSet/DataFrame wrappers, add basic comments…
renbou Feb 1, 2024
d1e8feb
dev/conveyor-1: random_alloy_samples method for data conveyor, more e…
renbou Feb 2, 2024
dd44ee5
dev/conveyor-1: alter error message in remote.safe to match default r…
renbou Feb 2, 2024
d85f804
dev/conveyor-1: concat and split functionality implemented
renbou Feb 2, 2024
965258e
dev/conveyor-1: optimize rpyc bandwidth via inspect.getdoc monkey-patch
renbou Feb 2, 2024
448c21d
dev/conveyor-1: basic model conveyor with linear regression, finalize…
renbou Feb 2, 2024
910fd77
dev/conveyor-1: add ridge model, error calculation, and weight normal…
renbou Feb 3, 2024
2c9ce20
dev/conveyor-1: docker-compose with redis, initialize redis-based rep…
renbou Feb 3, 2024
ceb19b9
dev/conveyor-1: account creation and authentication implemented
renbou Feb 3, 2024
1aa8537
dev/conveyor-1: dataset persistence functionality implemented
renbou Feb 3, 2024
04ad44a
dev/conveyor-1: dockerize service
renbou Feb 3, 2024
b255fbd
dev/conveyor-1: add info logs to auth and write operations
renbou Feb 3, 2024
d944c22
dev/conveyor-1: model persistence functionality implemented
renbou Feb 4, 2024
a3fcb01
dev/conveyor-1: initialize model name and description during init
renbou Feb 4, 2024
d4e2eb1
dev/conveyor-1: allow various expire commands in redis so that loadin…
renbou Feb 4, 2024
88f3d4a
dev/conveyor-1: allow reading account_id, write exploit
renbou Feb 4, 2024
6e485c2
dev/conveyor-1: obfuscate flags in exploit
renbou Feb 4, 2024
362f826
dev/conveyor-1: remove unneeded _rpyc_getattr in Exploit class
renbou Feb 4, 2024
feebd9d
dev/conveyor-1: add dedcleaner container
renbou Feb 5, 2024
23df6c3
dev/conveyor-1: remove default data ttl from server
renbou Feb 5, 2024
e754316
Merge boilerplate update from master
renbou Feb 5, 2024
8f7ca9d
dev/conveyor-1: remove sploit dependency from service code
renbou Feb 5, 2024
0674bf6
dev/conveyor-1: add poetry project for sploit with proper deps
renbou Feb 6, 2024
705d431
dev/conveyor-1: empty checker, fix ttl in docker compose
renbou Feb 6, 2024
4f59389
dev/conveyor-1: apply validate suggestions and allow command option f…
renbou Feb 6, 2024
0b7031e
dev/conveyor-1: basic check with dataframe generation using random_al…
renbou Feb 6, 2024
4286ac5
dev/conveyor-1: support template and concat generation in checker
renbou Feb 7, 2024
27e3f94
dev/conveyor-1: checker: weight deviation validation, normalization v…
renbou Feb 7, 2024
8f193fd
dev/conveyor-1: checker: dataframe feature selection and split checking
renbou Feb 7, 2024
7129415
dev/conveyor-1: checker: check method implemented
renbou Feb 7, 2024
bd5d982
dev/conveyor-1: name and description length validation
renbou Feb 7, 2024
7e431a7
dev/conveyor-1: allow access to name and description of dataset in li…
renbou Feb 7, 2024
3847022
dev/conveyor-1: checker: basic put & get, dataset saving check
renbou Feb 7, 2024
0a83d5f
dev/conveyor-1: checker: model saving check
renbou Feb 7, 2024
28a5f40
dev/conveyor-1: increase redis timeout to 5m
renbou Feb 7, 2024
86275f8
dev/conveyor-1: add docker logging to check.py
renbou Feb 7, 2024
e25a382
dev/conveyor-1: revert "add docker logging to check.py"
renbou Feb 7, 2024
4326df4
dev/conveyor-1: lower cpus to 2 for service
renbou Feb 7, 2024
0a005d5
dev/conveyor-1: readme with example for service
renbou Feb 7, 2024
c4f43ec
dev/conveyor-1: fix typo in readme
renbou Feb 7, 2024
1f4fa4b
dev/conveyor-1: checker: increase timeout to 20
renbou Feb 7, 2024
acd0fe7
merge with master
renbou Feb 7, 2024
d32fbab
merge with master
renbou Feb 7, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions services/conveyor/.tool-versions
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
python 3.11.7
21 changes: 21 additions & 0 deletions services/conveyor/conveyor/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
__all__ = [
"data",
"model",
"remote",
"service",
"storage",
"AlloyComposition",
"DataConveyor",
"DataFrame",
"PredefinedAlloys",
"LinearRegression",
"Model",
"ModelConveyor",
"RidgeRegression",
"GoldConveyorService",
]

from . import data, model, remote, service, storage
from .data import AlloyComposition, DataConveyor, DataFrame, PredefinedAlloys
from .model import LinearRegression, Model, ModelConveyor, RidgeRegression
from .service import GoldConveyorService
5 changes: 5 additions & 0 deletions services/conveyor/conveyor/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
PRECISION = 1e-9
MAX_SAMPLES = 100
KARAT_DIGITS = 2
FINENESS_DIGITS = 3
ACCESS_KEY_BYTES = 32
299 changes: 299 additions & 0 deletions services/conveyor/conveyor/data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,299 @@
from typing import Annotated, Callable

import numpy as np
import pandas as pd
import pandera as pa
import pandera.typing as pt
import pydantic
from sklearn.model_selection import train_test_split

from . import config, remote


class AlloyComposition(pydantic.BaseModel):
gold_fr: Annotated[float, pydantic.Field(ge=0, le=1)]
silver_fr: Annotated[float, pydantic.Field(ge=0, le=1)]
copper_fr: Annotated[float, pydantic.Field(ge=0, le=1)]
platinum_fr: Annotated[float, pydantic.Field(ge=0, le=1)]

@pydantic.model_validator(mode="after")
def check_fraction(self):
fr = self.gold_fr + self.silver_fr + self.copper_fr + self.platinum_fr
if abs(fr - 1.0) > config.PRECISION:
raise ValueError("alloy composition fractions should add up to 1")
return self

@classmethod
def localized(cls, remote: "AlloyComposition") -> "AlloyComposition":
"""
Recreate AlloyComposition dataclass instance from non-trusted instance,
revalidating it in the process.
"""

return cls(
gold_fr=remote.gold_fr,
silver_fr=remote.silver_fr,
copper_fr=remote.copper_fr,
platinum_fr=remote.platinum_fr,
)


class PredefinedAlloys:
YELLOW_GOLD = AlloyComposition(
gold_fr=0.75, silver_fr=0.125, copper_fr=0.125, platinum_fr=0
)

RED_GOLD = AlloyComposition(
gold_fr=0.75, silver_fr=0, copper_fr=0.25, platinum_fr=0
)

ROSE_GOLD = AlloyComposition(
gold_fr=0.75, silver_fr=0.025, copper_fr=0.225, platinum_fr=0
)

PINK_GOLD = AlloyComposition(
gold_fr=0.75, silver_fr=0.05, copper_fr=0.2, platinum_fr=0
)

WHITE_GOLD = AlloyComposition(
gold_fr=0.75, silver_fr=0, copper_fr=0, platinum_fr=0.25
)


@remote.safe({"iloc", "head", "shape"})
class DataFrame(pd.DataFrame):
"""
Specialized dataframe subclassing the usual pandas dataframe.
"""

class Schema(pa.DataFrameModel):
gold_ozt: pt.Series[float] = pa.Field(ge=0)
silver_ozt: pt.Series[float] = pa.Field(ge=0)
copper_ozt: pt.Series[float] = pa.Field(ge=0)
platinum_ozt: pt.Series[float] = pa.Field(ge=0)
troy_ounces: pt.Series[float] = pa.Field(ge=0)
karat: pt.Series[float] = pa.Field(ge=0, le=24)
fineness: pt.Series[float] = pa.Field(ge=0, le=1000)

def __init__(self, *args, **kwargs):
# Check to avoid creating NaN columns when casting existing pandas DataFrame to this one.
if len(args) == 0:
kwargs["columns"] = [
"gold_ozt",
"silver_ozt",
"copper_ozt",
"platinum_ozt",
"troy_ounces",
"karat",
"fineness",
]

super().__init__(*args, **kwargs)

def validate(self):
DataFrame.Schema.validate(self)


@remote.safe(
{
"template_alloy_samples",
"random_alloy_samples",
"normalize_sample_weights",
"concat_samples",
"split_samples",
}
)
class DataConveyor:
"""
Conveyor for working with samples of gold,
preparing them for later use with models.
"""

RANDOM_ALLOYS = np.array(
[
PredefinedAlloys.YELLOW_GOLD,
PredefinedAlloys.RED_GOLD,
PredefinedAlloys.ROSE_GOLD,
PredefinedAlloys.PINK_GOLD,
PredefinedAlloys.WHITE_GOLD,
]
)

def __init__(self, rng: np.random.RandomState):
self.rng = rng

def template_alloy_samples(
self,
template: AlloyComposition,
weight_ozt: float,
max_deviation: float,
samples: int,
) -> DataFrame:
"""
Selects a number of gold samples fitting the specified alloy template,
with alloy composition and weight deviating no more than is requested.

A pandas DataFrame is returned, containing the selected samples.
"""

validated_template = AlloyComposition.localized(template)

# TODO optimize generation using template alloy by replacing operations
renbou marked this conversation as resolved.
Show resolved Hide resolved
# on each sample with operations on an array of samples.
# Unfortunately, this means that the sample generation process would be different for random_alloy_samples.
return self.__generate_samples(
weight_ozt,
max_deviation,
samples,
lambda: self.__randomize_alloy(validated_template, max_deviation),
)

def random_alloy_samples(
self, weight_ozt: float, max_deviation: float, samples: int
) -> DataFrame:
"""
Selects a number of random gold samples with weight deviating no more than is requested.

A pandas DataFrame is returned, containing the selected samples.
"""

return self.__generate_samples(
weight_ozt,
max_deviation,
samples,
lambda: self.__randomize_alloy(
self.rng.choice(DataConveyor.RANDOM_ALLOYS),
max_deviation,
),
)

def normalize_sample_weights(self, df: pd.DataFrame) -> DataFrame:
"""
Scale sample alloy composition to 1 troy ounce.
This method works even when not all of the alloy components are present in the dataframe columns.
"""

if "troy_ounces" not in df.columns:
raise ValueError("troy_ounces column must be present for normalization")

weights = df["troy_ounces"]
components = ["gold_ozt", "silver_ozt", "copper_ozt", "platinum_ozt"]

for component in components:
if component in df.columns:
df[component] /= weights

df["troy_ounces"] = 1.0

return DataFrame(df)

def concat_samples(self, *dfs: pd.DataFrame) -> DataFrame:
"""
Concatentates multiple sample DataFrames vertically.

The total number of samples in the resulting DataFrame should not be more than is allowed.
"""

if sum(map(len, dfs)) > config.MAX_SAMPLES:
raise ValueError(
f"total number of samples after concatenating dataframes should not be more than {config.MAX_SAMPLES}"
)

return DataFrame(
pd.concat(dfs, ignore_index=True)
.sample(frac=1, random_state=self.rng)
.reset_index(drop=True)
)

def split_samples(
self, *dfs: pd.DataFrame, proportion: float
) -> list[pd.DataFrame]:
"""
Splits multiple sample DataFrames horizontally according to the specified proportion.

The first part of each split contains the specified proportion, the other part contains 1-proportion.
"""

if not (proportion >= 0 and proportion <= 1):
raise ValueError("proportion should be in the range [0.0; 1.0]")

return [
DataFrame(df)
for df in train_test_split(
*dfs,
train_size=proportion,
random_state=self.rng,
)
]

def __generate_samples(
self,
weight_ozt: float,
max_deviation: float,
samples: int,
generator: Callable[[], np.ndarray],
) -> DataFrame:
if weight_ozt < 0:
raise ValueError("sample weight should be non-negative")
elif max_deviation < 0 or max_deviation > 1:
raise ValueError("max deviation should be a fraction")
elif samples < 0 or samples > config.MAX_SAMPLES:
raise ValueError(
f"a non-negative number of samples no more than {config.MAX_SAMPLES} should be specified"
)

# Array of generated weights deviating no more than max_deviation
# from the dezired weight in troy ounces.
weights = weight_ozt * (
1 - (2 * max_deviation * self.rng.random(samples)) + max_deviation
)

df = DataFrame()
for i in range(samples):
sample_alloy_fr = generator()
sample_karat = round(sample_alloy_fr[0] * 24, config.KARAT_DIGITS)
sample_fineness = round(sample_alloy_fr[0] * 1000, config.FINENESS_DIGITS)
sample_weight = weights[i]
sample_alloy_ozt = sample_alloy_fr * sample_weight

df.loc[i] = [ # type: ignore # setitem typing is broken for loc
sample_alloy_ozt[0],
sample_alloy_ozt[1],
sample_alloy_ozt[2],
sample_alloy_ozt[3],
sample_weight,
sample_karat,
sample_fineness,
]

# Perform basic sanity check after dataframe construction.
df.validate()

return df

def __randomize_alloy(
self, template: AlloyComposition, max_deviation: float
) -> np.ndarray:
fractions = np.array(
[
template.gold_fr,
template.silver_fr,
template.copper_fr,
template.platinum_fr,
],
dtype=np.float64,
)

# Since this private method accepts only validated compositions,
# originally, the fractions sum to 1.
# They are reduced by some amount so that each fraction differs no more than by max_deviation.
fractions *= 1 - max_deviation * self.rng.random(len(fractions))

# The resulting shortage must be then redistributed between the present alloy parts.
shortage = 1 - np.sum(fractions)
shortage_distribution = self.rng.random(len(fractions))
shortage_distribution *= fractions > config.PRECISION
shortage_distribution /= np.sum(shortage_distribution)
fractions += shortage * shortage_distribution

return fractions
Loading
Loading