Skip to content

Commit

Permalink
Merge branch 'main' into feat/pipeline/integration-fredo2
Browse files Browse the repository at this point in the history
  • Loading branch information
hlecuyer authored Jul 16, 2024
2 parents 53d95b0 + 16da257 commit bca26d5
Show file tree
Hide file tree
Showing 45 changed files with 908 additions and 261 deletions.
7 changes: 5 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,9 +36,12 @@ After a few seconds, the services should be avaible as follow:
| airflow UI | [http://localhost:8080](http://localhost:8080) | user: `airflow` pass: `airflow` |
| data.inclusion | [http://localhost:8000](http://localhost:8000/api/v0/docs) | token must be generated |

### `minio` client

Optional, but it allows you to interact with the datalake from the commandline.
### `minio` Client

This is optional but allows you to interact with the datalake locally from the command line.

Cf [DEPLOYMENT.md](DEPLOYMENT.md) if you also wich to interact with staging and prod bucket.

See installation instructions [here](https://min.io/docs/minio/linux/reference/minio-mc.html).

Expand Down
53 changes: 52 additions & 1 deletion DEPLOYMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,55 @@
* The project is deployable on the Scalingo platform.
* Each service (pipeline, api, etc.) is deployed in its own application.
* It is made possible using the [`PROJECT_DIR`](https://doc.scalingo.com/platform/getting-started/common-deployment-errors#project-in-a-subdirectory) env variable defined in each app.
* Services are configured through the environment.
* Services are configured through the environment.



### Scaleway

If you need to interact with Scaleway, once you have your access with the right IAM configuration:

1. Install [Scaleway CLI](https://www.scaleway.com/en/docs/developer-tools/scaleway-cli/quickstart/#how-to-install-the-scaleway-cli-locally).
2. Generate an [SSH key](https://www.scaleway.com/en/docs/identity-and-access-management/organizations-and-projects/how-to/create-ssh-key/#how-to-upload-the-public-ssh-key-to-the-scaleway-interface) (if you don't already have one).
3. Upload it on [Scaleway](https://www.scaleway.com/en/docs/identity-and-access-management/organizations-and-projects/how-to/create-ssh-key/#how-to-upload-the-public-ssh-key-to-the-scaleway-interface).
4. Generate two API keys, one for the production bucket and one for the staging bucket.
5. You can then create two profiles for the Scaleway CLI with the following command:
```bash
scw init -p staging \
access-key={youraccesskey} \
secret-key={yoursecretkey} \
organization-id={organization} \
project-id={projectid}
```

### `minio` Client

This is optional but allows you to interact with the datalake from the command line (staging and prod).
It can be usefull for debug purposes.

See installation instructions [here](https://min.io/docs/minio/linux/reference/minio-mc.html).

You can then create aliases for Scaleway S3 staging and production, as well as one for your local Minio server. For your local server, you need to first create your API key. After launching Docker Compose, go to the [console](http://localhost:9001), click on the `Access Keys` tab, and create an access key.

You can add aliases with the following command:
```bash
mc alias set dev http://localhost:9000 {youraccesskey} {yoursecretkey}
```

Do the same for staging and production (replace the access key and the secret key with the API key you created in Scaleway):
```bash
mc alias set prod https://s3.fr-par.scw.cloud {youraccesskey} {yoursecretkey} --api S3v4
mc alias set staging https://s3.fr-par.scw.cloud {youraccesskey} {yoursecretkey} --api S3v4
```

You can test it out, and you should have results that look like this:
```bash
$ mc ls prod
[2024-04-22 13:33:54 CEST] 0B data-inclusion-datalake-prod-grand-titmouse/
$ mc ls staging
[2024-04-10 19:45:43 CEST] 0B data-inclusion-datalake-staging-sincere-buzzard/
$ mc ls dev
[2024-06-11 10:08:06 CEST] 0B data-inclusion-lake/
```

You can now easily interact with all the buckets.
60 changes: 60 additions & 0 deletions api/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,43 @@ alembic upgrade head
uvicorn data_inclusion.api.app:app --reload
```

## Initialize the Database with Data from staging or prod

### Prerequisites:
1. Launch Docker Compose.
2. Set up MinIO alias.

Check the [Deployment Guide](../DEPLOYMENT.md) for more details.

```bash
# Copy staging (or production) data mart to your local MinIO instance
mc cp --recursive staging/data-inclusion-datalake-staging-sincere-buzzard/data/marts/2024-06-12/ dev/data-inclusion-lake/data/marts/2024-06-12

# Activate the virtual environment and install dependencies
source .venv/bin/activate

# Launch command to import the Admin Express database
python src/data_inclusion/api/cli.py import_admin_express

# Launch command to import data
python src/data_inclusion/api/cli.py load_inclusion_data
```

## Initialize the Database with data compute by airflow locally

You can also run locally airflow (with potentially less sources or only the sources that interest you).
After running the main dag:
```bash
# Activate the virtual environment and install dependencies
source .venv/bin/activate

# Launch command to import the Admin Express database
python src/data_inclusion/api/cli.py import_admin_express

# Launch command to import data
python src/data_inclusion/api/cli.py load_inclusion_data
```

## Running the test suite

```bash
Expand All @@ -54,3 +91,26 @@ make
```bash
make upgrade all
```

### Infrastructure

The app is deployed on Scalingo. Make sure you have access to the console.

Just like Scaleway, it can be useful to install the [CLI](https://doc.scalingo.com/platform/cli/start).

You also need to upload your [public key](https://www.scaleway.com/en/docs/dedibox-console/account/how-to/upload-an-ssh-key/) for SSH connection. You can use the same key as Scaleway.

Here are three useful commands (example for staging):

```bash
# Open psql
scalingo -a data-inclusion-api-staging pgsql-console

# Launch a one-off container
scalingo -a data-inclusion-api-staging run bash

# Open a tunnel
scalingo -a data-inclusion-api-staging db-tunnel SCALINGO_POSTGRESQL_URL
```

Once you launch the tunnel, you need a user to finish opening the connection. You can create one from the DB dashboard in the user tab.
48 changes: 43 additions & 5 deletions api/requirements/dev-requirements.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# This file was autogenerated by uv via the following command:
# uv pip compile setup.py --extra=dev --output-file=requirements/dev-requirements.txt
alembic==1.13.1
# via data-inclusion-api (setup.py)
annotated-types==0.6.0
# via pydantic
anyio==4.3.0
Expand All @@ -21,6 +22,7 @@ cachetools==5.3.3
# via tox
certifi==2024.2.2
# via
# data-inclusion-api (setup.py)
# fiona
# httpcore
# httpx
Expand All @@ -40,6 +42,7 @@ charset-normalizer==3.3.2
# via requests
click==8.1.7
# via
# data-inclusion-api (setup.py)
# click-plugins
# cligj
# fiona
Expand All @@ -51,8 +54,11 @@ cligj==0.7.2
colorama==0.4.6
# via tox
cryptography==42.0.5
# via python-jose
data-inclusion-schema==0.14.0
# via
# data-inclusion-api (setup.py)
# python-jose
data-inclusion-schema==0.15.0
# via data-inclusion-api (setup.py)
distlib==0.3.8
# via virtualenv
dnspython==2.6.1
Expand All @@ -62,25 +68,33 @@ ecdsa==0.19.0
email-validator==2.1.1
# via pydantic
faker==24.11.0
# via data-inclusion-api (setup.py)
fastapi==0.110.2
# via
# data-inclusion-api (setup.py)
# fastapi-debug-toolbar
# fastapi-pagination
# sentry-sdk
fastapi-debug-toolbar==0.6.2
# via data-inclusion-api (setup.py)
fastapi-pagination==0.12.23
# via data-inclusion-api (setup.py)
filelock==3.13.4
# via
# tox
# virtualenv
fiona==1.9.6
# via geopandas
furl==2.1.3
# via data-inclusion-api (setup.py)
geoalchemy2==0.14.7
# via data-inclusion-api (setup.py)
geopandas==0.14.3
# via data-inclusion-api (setup.py)
greenlet==3.0.3
# via sqlalchemy
gunicorn==22.0.0
# via data-inclusion-api (setup.py)
h11==0.14.0
# via
# httpcore
Expand All @@ -90,6 +104,7 @@ httpcore==1.0.5
httptools==0.6.1
# via uvicorn
httpx==0.27.0
# via data-inclusion-api (setup.py)
identify==2.5.36
# via pre-commit
idna==3.7
Expand All @@ -109,12 +124,14 @@ markupsafe==2.1.5
# jinja2
# mako
minio==7.2.5
# via data-inclusion-api (setup.py)
multivolumefile==0.2.3
# via py7zr
nodeenv==1.8.0
# via pre-commit
numpy==1.26.4
# via
# data-inclusion-api (setup.py)
# pandas
# pyarrow
# shapely
Expand All @@ -128,19 +145,25 @@ packaging==24.0
# pyproject-api
# tox
pandas==2.2.2
# via geopandas
# via
# data-inclusion-api (setup.py)
# geopandas
platformdirs==4.2.0
# via
# tox
# virtualenv
pluggy==1.5.0
# via tox
pre-commit==3.7.0
# via data-inclusion-api (setup.py)
psutil==5.9.8
# via py7zr
psycopg2==2.9.9
# via data-inclusion-api (setup.py)
py7zr==0.21.0
# via data-inclusion-api (setup.py)
pyarrow==16.0.0
# via data-inclusion-api (setup.py)
pyasn1==0.6.0
# via
# python-jose
Expand All @@ -155,6 +178,7 @@ pycryptodomex==3.20.0
# via py7zr
pydantic==2.7.0
# via
# data-inclusion-api (setup.py)
# data-inclusion-schema
# fastapi
# fastapi-debug-toolbar
Expand All @@ -166,7 +190,9 @@ pydantic-core==2.18.1
pydantic-extra-types==2.6.0
# via fastapi-debug-toolbar
pydantic-settings==2.2.1
# via fastapi-debug-toolbar
# via
# data-inclusion-api (setup.py)
# fastapi-debug-toolbar
pyinstrument==4.6.2
# via fastapi-debug-toolbar
pyppmd==1.1.0
Expand All @@ -181,22 +207,29 @@ python-dateutil==2.9.0.post0
# pandas
python-dotenv==1.0.1
# via
# data-inclusion-api (setup.py)
# pydantic-settings
# uvicorn
python-jose==3.3.0
# via data-inclusion-api (setup.py)
pytz==2024.1
# via pandas
# via
# data-inclusion-api (setup.py)
# pandas
pyyaml==6.0.1
# via
# pre-commit
# uvicorn
pyzstd==0.15.10
# via py7zr
requests==2.31.0
# via data-inclusion-api (setup.py)
rsa==4.9
# via python-jose
ruff==0.4.1
# via data-inclusion-api (setup.py)
sentry-sdk==1.45.0
# via data-inclusion-api (setup.py)
setuptools==69.5.1
# via nodeenv
shapely==2.0.4
Expand All @@ -214,6 +247,7 @@ sniffio==1.3.1
# httpx
sqlalchemy==2.0.29
# via
# data-inclusion-api (setup.py)
# alembic
# geoalchemy2
sqlparse==0.5.0
Expand All @@ -223,7 +257,9 @@ starlette==0.37.2
texttable==1.7.0
# via py7zr
tox==4.14.2
# via data-inclusion-api (setup.py)
tqdm==4.66.2
# via data-inclusion-api (setup.py)
typing-extensions==4.11.0
# via
# alembic
Expand All @@ -241,7 +277,9 @@ urllib3==2.2.1
# requests
# sentry-sdk
uv==0.1.35
# via data-inclusion-api (setup.py)
uvicorn==0.29.0
# via data-inclusion-api (setup.py)
uvloop==0.19.0
# via uvicorn
virtualenv==20.25.3
Expand Down
Loading

0 comments on commit bca26d5

Please sign in to comment.