feat(pipeline): refactor geocoding orchestration #270

vmttn · 2024-08-12T08:26:27Z

Ajout d'une fonction plpython qui permet de gérer le géocodage avec dbt directement en db. Création d'un model dédié pour géocoder de manière incrémentale. Plus de détails en commentaires de commit

datawarehouse/.dockerignore

datawarehouse/Dockerfile

datawarehouse/processings/Makefile

datawarehouse/processings/pyproject.toml

vperron · 2024-08-21T14:36:48Z

pipeline/dags/main.py

+            "source:*",
+            "path:models/staging/sources/**/*.sql",
+            "path:models/intermediate/sources/**/*.sql",
+            "path:models/intermediate/*.sql",


je me demande si c'est pas le moment de les mettres dans un dossier union ?

(je parle des fichiers à la racine du dossier intermediate car il y a en gors la famille des "union" qui sont de toutes façons plus ou moins tous à générer en meme temps, et le plausible_emails qui n'a pas grand chose à voir.

j'ai bien envie de simplifier :

int__union_services => int__services

etc

et clarifier enhanced qui est fourre-tout

Je te rejoins, c'est peut etre la bonne PR pour mettre un troisième commit de renommage tout simple, avant de déléguer l'orchestration à cosmos.

hmm je disais que c'est pe mieux de voir ça dans une PR distincte. Besoin de réfléchir au découpage des étapes de modélisation (union/validation/géocodages/etc.). Si je fais ça maintenant, ça sera très probablement pas satisfaisant.

pipeline/requirements/airflow/requirements.in

datawarehouse/processings/scripts/create_udfs.sh

pipeline/dbt/models/intermediate/sources/dora/_dora__models.yml

pipeline/dbt/models/intermediate/sources/dora/int_dora__errors.sql

vmttn · 2024-08-25T18:58:01Z

j'ai rebased pour prendre en compte les changements que tu as apporté sur le geocoding @vperron
j'ai déplacé les models pour qu'ils s'appliquent à toutes les sources. Contrairement aux models structures, services et adresses, il n'y a pas d'obligation à le faire séparément pour chaque source et pas encore convaincu que ça justifie d'introduire 2*nb de sources models avec des macros. A discuter

pipeline/dbt/models/intermediate/_models.yml

vperron

OK pour le premier commit, avec quues remarques mineures et j'aimerais bcp voir un unit test sur le modèle incrémental.

J'aurais perso ajouté quelque chose pour éviter que le test ban_api ne soit lancé en local par défaut (surtout sur un repo public) mais à toi de voir, surement que on peut s'en fiche pour l'instant.

Et j'aurais tellement aimé voir partir l'étape intermédiaire de gestion des adresses en batch (et du coup le faire source par source pendant leurs traitements) mais bon, je suppose que ce sera dans une autre PR ou avec cosmos.

datawarehouse/processings/pyproject.toml

datawarehouse/processings/scripts/create_udfs.sh

pipeline/dbt/models/intermediate/int__geocodages.sql

vmttn · 2024-09-16T11:58:49Z

j'ai pris en compte tes retours @vperron. C'est fonctionnel en staging

vperron

Nits de curiosité. J'adore le cleanup du second commit en particulier <3

vperron · 2024-09-16T13:48:18Z

pipeline/dbt/macros/create_udfs.sql

@@ -9,6 +9,10 @@ Another way would be to use the `on-run-start` hook, but it does not play nicely

 {% set sql %}

+CREATE SCHEMA IF NOT EXISTS processings;
+
+{{ udf__geocode() }}


tout petit nit mais pourquoi ne pas garder la convention create_udf__xxx ? Ou inversement changer les suivantes ?
Je sais que tu fais rarement quelque chose au hasard donc ^^ curiosité

oui je vais nettoyer les suivantes, dans les prochains jours. J'avais peur de tirer un fil et de perdre de vue le but de la PR 😅

datawarehouse/processings/tests/integration/test_geocoding.py

vperron · 2024-09-16T13:56:42Z

pipeline/dbt/models/intermediate/_models.yml

+      - name: code_insee
+        data_tests:
+          - dbt_utils.not_empty_string
+      - name: latitude


ça donnerait envie d'introduire https://github.com/calogica/dbt-expectations#expect_column_values_to_be_between ^^ peut etre sur une autre PR ? Je fais un ticket, tu mets un TODO ?

partant pour explorer ça sur une autre PR

(je skip le TODO, pcq y'a vmt plein d'endroits où on pourrait en mettre)

sur la stratégie de tests, j'avais cette référence intéressante en tête : https://www.datafold.com/blog/7-dbt-testing-best-practices?#2-add-essential-dbt-tests

jusqu'à présent je me suis concentré sur les tests simples : not_null, relationship, dbt_utils.not_constant, dbt_utils.not_empty_string, plus rarement accepted_values et très rarement un check avec une expression.

Je trouve que c'était utile sans surchargé. Dans l'idéal si on resserre, il faudrait qu'on se fixe une stratégie claire, pcq j'ai peur qu'on puisse aller très loin, avoir plusieurs façons de faire dans la codebase. Ce qui finira par devenir ingérable 😅

vperron · 2024-09-16T14:00:37Z

pipeline/dbt/models/intermediate/_models.yml

+        data_tests:
+          - not_null:
+              config:
+                severity: warn


meme remarque pour une valeur entre 0 et 1 avec dbt-expectations ?

curiosité : Et donc le score peut effectivement etre NULL parfois ? Dans quels cas ?

pipeline/dbt/models/intermediate/_models.yml

The trick is to leverage plpython to do the geocoding inside the database. Doing so, geocoding can now be modelled as a dbt model and orchestrate as such. The geocoding implementation has been moved to an actual package maintained next to the datawarehouse image. The plpython udf simply wraps the call. Is it worth it ? * it heavely simplifies the flow and set clearer concerns between airflow and dbt. Dbt does the transformation, airflow orchestrate it. * less error prone since we do not have to pull data from the db and map it to python ourselves. * we can even leverage dbt to to the geocoding incrementally on inputs that have not been seen before. This will drastically reduce our carbon footprint... There are a few enhancements we would probably want : * obviously clean up * define a macro with the geocoding model that can be used for all sources * re-do geocodings for relatively old inputs * do the geocoding per source ?

vmttn · 2024-09-17T10:37:43Z

J'ai modifié pour permettre le regéocodage des adresses qui ont eu un résultat nul + un unit_test case

pipeline/dbt/models/intermediate/int__geocodages.sql

pipeline/dbt/models/intermediate/_models.yml

vperron · 2024-09-17T11:36:45Z

On a oublié potentiellement un aspect, qui serait de faire des full refresh du geocodage une fois tous les ... 12 mois par exemple. Mais bon je suis sur que d'ici un an on y aura retouché, donc...

L'idée est de compter sur la BAN pour devenir plus complète / meilleure dans ses recherches et que potentiellement sa résolution devient plus précise (faisant potentiellement passer des geocodages enregistrés mais avec un score de, par exemple, 0.3 en un résultat avec un meilleur score et une meilleure cohérence)

vmttn requested a review from vperron August 12, 2024 08:26

vmttn force-pushed the vmttn/chore/geocoding-as-dbt branch from 1b84641 to 619c250 Compare August 19, 2024 07:25

vperron reviewed Aug 21, 2024

View reviewed changes

vmttn force-pushed the vmttn/chore/geocoding-as-dbt branch from 619c250 to c6c9937 Compare August 25, 2024 10:15

vperron reviewed Sep 2, 2024

View reviewed changes

pipeline/dbt/models/intermediate/_models.yml Show resolved Hide resolved

vperron approved these changes Sep 9, 2024

View reviewed changes

datawarehouse/processings/pyproject.toml Outdated Show resolved Hide resolved

datawarehouse/processings/scripts/create_udfs.sh Outdated Show resolved Hide resolved

pipeline/dbt/models/intermediate/int__geocodages.sql Outdated Show resolved Hide resolved

vmttn force-pushed the vmttn/chore/geocoding-as-dbt branch 2 times, most recently from 4b0c4bc to c3847dd Compare September 11, 2024 14:10

This was referenced Sep 11, 2024

chore(pipeline): reuse schema validation in pipeline #290

Draft

chore(pipeline): generate main dag from dbt, using cosmos #289

Draft

vmttn force-pushed the vmttn/chore/geocoding-as-dbt branch 2 times, most recently from 6a17d79 to dc637ca Compare September 13, 2024 15:49

vmttn marked this pull request as ready for review September 13, 2024 16:05

vmttn force-pushed the vmttn/chore/geocoding-as-dbt branch from dc637ca to 84826d4 Compare September 13, 2024 16:06

vmttn temporarily deployed to staging September 13, 2024 16:11 — with GitHub Actions Inactive

vmttn had a problem deploying to prod September 13, 2024 16:11 — with GitHub Actions Failure

vmttn temporarily deployed to staging September 13, 2024 16:34 — with GitHub Actions Inactive

vmttn had a problem deploying to prod September 13, 2024 16:34 — with GitHub Actions Failure

vmttn force-pushed the vmttn/chore/geocoding-as-dbt branch from 19986d3 to 9b4c059 Compare September 13, 2024 19:57

vmttn temporarily deployed to staging September 13, 2024 20:02 — with GitHub Actions Inactive

vmttn had a problem deploying to prod September 13, 2024 20:02 — with GitHub Actions Failure

vmttn force-pushed the vmttn/chore/geocoding-as-dbt branch from 9b4c059 to b17d307 Compare September 16, 2024 07:49

vmttn temporarily deployed to staging September 16, 2024 07:54 — with GitHub Actions Inactive

vmttn had a problem deploying to prod September 16, 2024 07:54 — with GitHub Actions Failure

vmttn force-pushed the vmttn/chore/geocoding-as-dbt branch from b17d307 to c08ce6c Compare September 16, 2024 08:28

vmttn temporarily deployed to staging September 16, 2024 08:32 — with GitHub Actions Inactive

vmttn had a problem deploying to prod September 16, 2024 08:32 — with GitHub Actions Failure

vmttn force-pushed the vmttn/chore/geocoding-as-dbt branch from c08ce6c to f7d8e95 Compare September 16, 2024 09:05

vmttn temporarily deployed to staging September 16, 2024 09:08 — with GitHub Actions Inactive

vmttn had a problem deploying to prod September 16, 2024 09:08 — with GitHub Actions Failure

vmttn temporarily deployed to staging September 16, 2024 10:22 — with GitHub Actions Inactive

vmttn had a problem deploying to prod September 16, 2024 10:22 — with GitHub Actions Failure

vmttn force-pushed the vmttn/chore/geocoding-as-dbt branch from 48ce257 to 1e9f0f3 Compare September 16, 2024 10:51

vmttn temporarily deployed to staging September 16, 2024 10:55 — with GitHub Actions Inactive

vmttn had a problem deploying to prod September 16, 2024 10:55 — with GitHub Actions Failure

vperron approved these changes Sep 16, 2024

View reviewed changes

vmttn added 2 commits September 17, 2024 12:30

chore(pipeline): clean up old geocoding task

d8b3f5a

vmttn force-pushed the vmttn/chore/geocoding-as-dbt branch from 1e9f0f3 to d8b3f5a Compare September 17, 2024 10:36

vmttn temporarily deployed to staging September 17, 2024 10:40 — with GitHub Actions Inactive

vmttn had a problem deploying to prod September 17, 2024 10:40 — with GitHub Actions Failure

vperron approved these changes Sep 17, 2024

View reviewed changes

pipeline/dbt/models/intermediate/int__geocodages.sql Show resolved Hide resolved

pipeline/dbt/models/intermediate/_models.yml Show resolved Hide resolved

hlecuyer mentioned this pull request Sep 17, 2024

feat(mes-aides): integration aides permis velo #296

Merged

vmttn merged commit 34bcac0 into main Sep 18, 2024
8 of 9 checks passed

vmttn deleted the vmttn/chore/geocoding-as-dbt branch September 18, 2024 08:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pipeline): refactor geocoding orchestration #270

feat(pipeline): refactor geocoding orchestration #270

vmttn commented Aug 12, 2024 •

edited

Loading

vperron Aug 21, 2024

vperron Sep 2, 2024

vmttn Sep 13, 2024

vperron Sep 16, 2024

vmttn Sep 17, 2024

vmttn commented Aug 25, 2024

vperron left a comment

vmttn commented Sep 16, 2024

vperron left a comment

vperron Sep 16, 2024

vmttn Sep 16, 2024

vperron Sep 16, 2024

vmttn Sep 16, 2024

vmttn Sep 16, 2024

vperron Sep 16, 2024

vmttn commented Sep 17, 2024

vperron commented Sep 17, 2024

feat(pipeline): refactor geocoding orchestration #270

feat(pipeline): refactor geocoding orchestration #270

Conversation

vmttn commented Aug 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vmttn commented Aug 25, 2024

vperron left a comment

Choose a reason for hiding this comment

vmttn commented Sep 16, 2024

vperron left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vmttn commented Sep 17, 2024

vperron commented Sep 17, 2024

vmttn commented Aug 12, 2024 •

edited

Loading