As of commit ffe30eb32b71241f6168b00e4c340818bde0b311, the program structure is as follows:
project/
├── base
│ ├── config.py
│ ├── __init__.py
│ └── rules.py
├── cli.py
├── __init__.py
├── models
│ ├── classic
│ │ ├── aims.py
│ │ ├── amos.py
│ │ └── __init__.py
│ ├── data
│ │ ├── aims.py
│ │ ├── amos.py
│ │ └── serializable.py
│ ├── declarative
│ │ ├── aims.py
│ │ ├── amos.py
│ │ ├── __init__.py
│ │ └── mixins.py
│ └── __init__.py
├── providers
│ ├── __init__.py
│ └── utils.py
└── scripts
├── db_utils.py
├── generate.py
└── __init__.py
A full diagram of the classes implemented is available in a pyreverse autogerenated image.
It contains the module for creating a Config
object from a BaseConfig
class. This object is passed to the AircraftGenerator
to control how data will be generated.
The project.base.rules
is not being used at the moment.
It contains the models used by SQLAlchemy, a python ORM (Object Relational mapper) that translates SQL to python objects, handles transactions and everything that an ORM does.
These models are separated by schema, e.g. AMOS and AIMS and also by the type of mapping used declarative (more abstract) and classical (more explicit). Only the declarative models are being used, the classical were the first iteration.
There are also the models from project.models.non_orm.serializable
that describe the data that is stored as CSV files, e.g. Manufacturer
and Reporter
Controls the CLI (command line interface) implemented in the program.
Implements an AirportProvider
, an extension of the Faker.BaseProvider
class. Basically, an instance of AirportProvider
exposes several methods used to produce contextually valid random data, like
An instance of this provider can be imported from project.providers.airport.fake_airport
.
Most of these methods implement a quality
attribute, which can one from the set ("good","noisy","bad")
from acme_data_generation.providers.airport import fake_airport
>>> fake_airport.airport_code(quality="good") # valid random value
'TIV'
>>> fake_airport.airport_code(quality="noisy") # introduces noise
' tiV '
>>> fake_airport.airport_code(quality="bad") # non valid random value: airport code can't have numbers
'4ys'
By default, if not implemented, all bad values are strings of length 5 containing a combination of letters, numbers, and non-alphanumeric characters.
from acme_data_generation.providers.airport import fake_airport
>>> fake_airport.random_string()
'H\\<O}'>>>
>>> fake_airport.random_string(10)
'SC^JrjjyY='
Implements the AircraftGenerator
class, which consumes both a Config
and the AirportProvider
objects, and assists in the generation and insertion of a collection of datasets.
Through the project.config.BaseConfig
object, data is sampled from different qualities, with some probabilities. By default they are 70% probability of getting a good quality, 20% of getting noisy quality, and 10% of getting broken data.
Also, BaseConfig.size
controls how many rows are produced, as well as the database credentials that will be used by SQLAlchemy. Check the code for more info.
the generator implements three important methods,
populate
reads the configuration object and produces lists of random objects accordingly. They are stored as properties of the generator, e.g.generator.flight_slots
.to_csv
takes these generated elements frompopulate
in thegenerator
and saves them to CSV files.to_sql
(should be calledto_db
, I know) dumps the random data into a database defined in the config.
Most method in AirportProvider
that produce random data expose a quality
parameter. The method outputs a random number accordingly, using a thin dispatcher, like
def _quality_dispatcher(self, mapping, quality):
mapping["noisy"] = self.make_noisy(mapping["good"])
return mapping[quality]
def frequency_units_kind(self, quality: str = "good") -> str:
mapping = {
"good": self.random_element(self._frequency_units_kinds),
"bad": self.random_string(random.randint(2, 5)),
}
return self._quality_dispatcher(mapping, quality)
This is the part where what a good and a bad value is defined in relation to each specific field. In this case, we say that a good value is sampled from the collection of frequency unit kinds, and a bad value is a string of random characters and random length between 2 and 5.
>>> fake_airport.frequency_units_kind()
'Miles'
>>> fake_airport.frequency_units_kind(quality = "noisy")
' mIles '
>>> fake_airport.frequency_units_kind(quality = "bad")
'hWsz'
A noisy value is a modified version of a good value, that is meant to be fixed by trimming whitespace and updating to some conformed standard. For the case of frequency units, this is
- trim whitespace
- convert to lowercase
- convert first char to uppercase
In python, something like this
>>> noisy_str
' mIles '
>>> noisy_str = noisy_str.strip()
>>> noisy_str
'mIles'
>>> noisy_str = noisy_str[0].upper() + noisy_str[1:].lower()
>>> noisy_str
'Miles'
Some important considerations regarding implementation
- A bad value should be impossible to reconstruct (or very hard) and should not conform to the rules.
- In general, the bad version implemented is arbitrary.
- There should be only one way to add noise to objects depending on their type