Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(pipeline) : Ajout d'un premier modèle pour monitorer la qualité des données #259

Merged
merged 6 commits into from
Aug 6, 2024

Conversation

vperron
Copy link
Contributor

@vperron vperron commented Jul 22, 2024

Première version d'un modèle de "qualité" qui permet d'afficher la qualité des données.

Output actuel sur les données du 22 Juillet:

data-inclusion=# SELECT * FROM public_intermediate.int_quality__stats ORDER BY source, stream;
        date         |        source         |         stream         | count_raw | count_stg | count_int | count_marts | count_api 
---------------------+-----------------------+------------------------+-----------+-----------+-----------+-------------+-----------
 2024-07-22 02:00:00 | action_logement       | services               |        26 |        23 |      2760 |        2760 |      2760
 2024-07-22 02:00:00 | action_logement       | structures             |       123 |       120 |       120 |         120 |       120
 2024-07-22 02:00:00 | agefiph               | services               |        31 |        31 |        27 |          27 |        27
 2024-07-22 02:00:00 | cd35                  | organisations          |      3545 |      3545 |      3545 |        3544 |      3540
 2024-07-22 02:00:00 | cd72                  | services               |       474 |       463 |       463 |         260 |         0
 2024-07-22 02:00:00 | cd72                  | structures             |       217 |       217 |       217 |         213 |       457
 2024-07-22 02:00:00 | data_inclusion        | services               |        47 |        44 |        44 |          44 |        44
 2024-07-22 02:00:00 | data_inclusion        | structures             |        22 |        19 |        19 |          19 |        19
 2024-07-22 02:00:00 | dora                  | services               |     17717 |     11707 |     11707 |       11036 |     11034
 2024-07-22 02:00:00 | dora                  | structures             |      8554 |      8554 |      8554 |        8545 |      8538
 2024-07-22 02:00:00 | emplois_de_linclusion | organisations          |      8589 |      8589 |     15824 |       15821 |     15821
 2024-07-22 02:00:00 | emplois_de_linclusion | siaes                  |      7235 |      7235 |     15824 |       15821 |     15821
 2024-07-22 02:00:00 | france_travail        | agences                |       888 |       888 |       888 |         888 |       886
 2024-07-22 02:00:00 | france_travail        | services               |        28 |        25 |     22200 |       22200 |     22150
 2024-07-22 02:00:00 | fredo                 | structures             |      1080 |      1080 |        -1 |           0 |         0
 2024-07-22 02:00:00 | mediation_numerique   | services               |     20798 |     19445 |     19445 |       19424 |     19417
 2024-07-22 02:00:00 | mediation_numerique   | structures             |     20798 |     19445 |     19445 |       19424 |     19417
 2024-07-22 02:00:00 | mes_aides             | aides                  |       690 |       690 |       960 |         947 |       916
 2024-07-22 02:00:00 | mes_aides             | garages                |       908 |       908 |       870 |         863 |       847
 2024-07-22 02:00:00 | monenfant             | creches                |     84597 |     13481 |     13481 |       13479 |     13466
 2024-07-22 02:00:00 | odspep                | DD009_RES_PARTENARIALE |     28262 |        -1 |      6438 |        6428 |      6329
 2024-07-22 02:00:00 | reseau_alpha          | formations             |       388 |       388 |       388 |         388 |       297
 2024-07-22 02:00:00 | reseau_alpha          | structures             |       757 |       757 |       757 |         752 |       752
 2024-07-22 02:00:00 | soliguide             | lieux                  |     22028 |     22028 |     22028 |       22028 |     21980
 2024-07-22 02:00:00 | un_jeune_une_solution | benefits               |       974 |       974 |        -1 |           0 |         0
 2024-07-22 02:00:00 | un_jeune_une_solution | institutions           |       713 |       713 |        -1 |           0 |         0
(26 rows)

Il n'y a plus trop de bizarreries, mais il reste un petit souci sur les données API du CD72 puisque le fix est dans cette PR ^^

Comme d'habitude, il vaut vraiment mieux relire dans l'ordre et commit par commit, ils sont théoriquement tous atomiques et indépendants. Les 4 premiers sont plutôt utilitaires/nettoyage.

Le modèle int_quality__stats est "intermédiaire" car:

  • ce n'est pas une donnée finale au sens "des consommateurs vont l'utiliser"
  • les modèles dans marts ont une obligation de respect d'un contrat de données

Si cela convient dans les grandes lignes, je propose les étapes suivantes:

  • ajout de colonnes supplémentaires pour le décompte des données importantes (contacts, adresse, etc)
  • historisation (faisable en quelques minutes via snapshot sur la colonne de date)
  • ajout en étape finale au DAG main (ou au DAG import_data_inclusion_api pour que les données API soient celles du jour et pas J-1?)

pipeline/dags/import_data_inclusion_api.py Show resolved Hide resolved
docker-compose.yml Outdated Show resolved Hide resolved
docker-compose.yml Outdated Show resolved Hide resolved
pipeline/dbt/macros/quality.sql Outdated Show resolved Hide resolved
pipeline/dbt/macros/quality.sql Outdated Show resolved Hide resolved
@vperron vperron force-pushed the vperron/quality-dashboard branch from 51cb444 to eb27546 Compare July 25, 2024 10:44
@vperron vperron force-pushed the vperron/quality-dashboard branch from eb27546 to e81d717 Compare July 26, 2024 10:57
@vperron vperron changed the base branch from main to vperron/chores July 26, 2024 10:57
@vperron vperron force-pushed the vperron/quality-dashboard branch 2 times, most recently from 508c602 to dd46e04 Compare July 26, 2024 13:40
Base automatically changed from vperron/chores to main July 31, 2024 12:12
@vperron vperron force-pushed the vperron/quality-dashboard branch 2 times, most recently from c6792cc to fae1755 Compare August 2, 2024 16:22
Copy link
Contributor

@vmttn vmttn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quelques petits nits, je te laisse voir

vperron added 6 commits August 5, 2024 16:08
A refactoring commit 7 months ago did deactivate those by mistake.
Let's re-enable them.
The 'services' table has an implicit dependency to the 'structures' mart
as its constraint enforces a check on its structure_id key towards that
table.

Make sure DBT knows about it so it generates the structures first.

Referenced by dbt-labs/dbt-core#8062,
might be fixed in dbt-labs/dbt-common#163
last week but:
- not released
- not documented
- not sure the commit will actually help when I read it, needs more
  changes I suppose
Those will be useful for our further analysis, or even if we wanted to
base our Metabase requests on the actual API contents instead of the
marts.
A prior investigation showed that for two different sources, we could
find trailing spaces in quite a few lines.

This is now cleaned, but would also be nice to have a test.
It makes more sense this way and keeps the rest of the sources
consistent.
First working version before we attemp going further (snapshotting,
improvements on the ODSPEP or other sources, values per column counts,
etc)

data-inclusion=# SELECT * FROM public_intermediate.int_quality__stats ORDER BY source, stream;
  date_day  |        source         |         stream         | count_raw | count_stg | count_int | count_marts | count_api | count_contacts | count_addresses
------------+-----------------------+------------------------+-----------+-----------+-----------+-------------+-----------+----------------+-----------------
 2024-08-02 | action_logement       | services               |        26 |        23 |      2760 |        2760 |      2760 |              0 |            2760
 2024-08-02 | action_logement       | structures             |       123 |       120 |       120 |         120 |       120 |              0 |             120
 2024-08-02 | agefiph               | services               |        31 |        31 |        27 |          27 |        27 |              0 |              27
 2024-08-02 | cd35                  | organisations          |      3545 |      3545 |      3545 |        3544 |      3540 |           2594 |            3422
 2024-08-02 | cd72                  | services               |       474 |       463 |       463 |         260 |         0 |              0 |               0
 2024-08-02 | cd72                  | structures             |       217 |       217 |       217 |         213 |       457 |             41 |             394
 2024-08-02 | data_inclusion        | services               |        47 |        44 |        44 |          44 |        44 |             17 |              21
 2024-08-02 | data_inclusion        | structures             |        22 |        19 |        19 |          19 |        19 |              0 |              19
 2024-08-02 | dora                  | services               |     17717 |     11707 |     11707 |       11036 |     11034 |           8430 |           10160
 2024-08-02 | dora                  | structures             |      8554 |      8554 |      8554 |        8545 |      8538 |           3260 |            8342
 2024-08-02 | emplois_de_linclusion | organisations          |      8589 |      8589 |     15824 |       15821 |     15821 |           7041 |           15712
 2024-08-02 | emplois_de_linclusion | siaes                  |      7235 |      7235 |     15824 |       15821 |     15821 |           7041 |           15712
 2024-08-02 | france_travail        | agences                |       888 |       888 |       888 |         888 |       886 |              0 |             886
 2024-08-02 | france_travail        | services               |        28 |        25 |     22200 |       22200 |     22150 |              0 |           22150
 2024-08-02 | mediation_numerique   | services               |     20798 |     19445 |     19445 |       19424 |     19417 |           8158 |           19417
 2024-08-02 | mediation_numerique   | structures             |     20798 |     19445 |     19445 |       19424 |     19417 |           8158 |           19417
 2024-08-02 | mes_aides             | aides                  |       690 |       690 |       960 |         947 |       916 |            217 |             914
 2024-08-02 | mes_aides             | garages                |       908 |       908 |       870 |         863 |       847 |            163 |             845
 2024-08-02 | monenfant             | creches                |     84597 |     13481 |     13481 |       13479 |     13466 |          11098 |           13466
 2024-08-02 | odspep                | DD009_RES_PARTENARIALE |     28262 |      6438 |      9001 |        8717 |      8618 |              0 |            6591
 2024-08-02 | reseau_alpha          | formations             |       388 |       388 |       388 |         388 |       297 |            239 |             269
 2024-08-02 | reseau_alpha          | structures             |       757 |       757 |       757 |         752 |       752 |            538 |             667
 2024-08-02 | soliguide             | lieux                  |     22028 |     22028 |     22028 |       22028 |     21980 |              0 |           21980
@vperron vperron force-pushed the vperron/quality-dashboard branch from b9df406 to 3600b92 Compare August 5, 2024 14:36
Copy link
Contributor

@vmttn vmttn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

top

@vperron vperron merged commit e9f889e into main Aug 6, 2024
4 of 8 checks passed
@vperron vperron deleted the vperron/quality-dashboard branch August 6, 2024 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants