Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement structure to save Inspection digitalization efficiency #147

Open
2 tasks done
Tracked by #120
Francois-Werbrouck opened this issue Sep 5, 2024 · 3 comments · May be fixed by #160
Open
2 tasks done
Tracked by #120

Implement structure to save Inspection digitalization efficiency #147

Francois-Werbrouck opened this issue Sep 5, 2024 · 3 comments · May be fixed by #160
Assignees
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@Francois-Werbrouck
Copy link
Contributor

Francois-Werbrouck commented Sep 5, 2024

Context

With Fertiscan being a AI powered solution, we need to quantify the efficiency of the models. To do so, it is necessary to numerically compare the original_dataset with the user verified data. We will evaluate the inspection with multiple Levenshtein distance scores

TODO

  • Create column in the inspection_factual to save efficiency
  • Create trigger to calculate the efficiency and save the data

Doc

erDiagram  
inspection_factual {
    uuid inspection_id PK
    uuid inspector_id
    uuid label_info_id
    uuid time_id FK
    uuid sample_id
    uuid company_id
    uuid manufacturer_id
    uuid picture_set_id
    timestamp inspection_date
    json original_dataset
    uuid verification_id
  }

    verification_dimension {
        uuid id PK
        int score
        int label_info_lev_total
        int label_name_lev
        int label_reg_num_lev
        int label_lot_num_lev
        int metrics_lists_modif
        int metrics_lev
        int manufacturer_field_edited
        int manufacturer_lev_total
        int company_field_edited
        int company_lev_total
        int instructions_en_lists_modif
        int instructions_fr_lists_modif
        int instructions_en_lev
        int instructions_fr_lev
        int cautions_en_lists_modif
        int cautions_fr_lists_modif
        int cautions_en_lev
        int cautions_fr_lev
        int guaranteeds_en_lists_modif
        int guaranteeds_fr_lists_modif
        int guaranteeds_en_lev
        int guaranteeds_fr_lev
    }
    
    inspection_factual ||--|| verification_dimension : "evaluate"
    
Loading
@Francois-Werbrouck
Copy link
Contributor Author

Current known 'issues' that still need to be resolved:

  • text over 255 character cant be compare and throw errors
  • We should find a way to make the edited boolean useable
  • Score calculation still not implemented

Francois-Werbrouck added a commit that referenced this issue Oct 2, 2024
@Francois-Werbrouck
Copy link
Contributor Author

Francois-Werbrouck commented Oct 4, 2024

text over 255 character cant be compare and throw errors

Avenue found here if we dont want to partition our data. I'm also facing role/permission issues, I've open a ticket with the Database Server Admins

We should find a way to make the edited boolean useable

I've started experimenting with pg_trgm I still need to find a relevant threshold and how to deal with new additions into the arrays

@ChromaticPanic
Copy link

ChromaticPanic commented Oct 31, 2024

As a data analyst. I'm not sure the utility of storing this data. I think this might be unnecessary increase in database complexity. There are multiple ways to look at data. If we encode this in the database then it would be too cumbersome to try out different different evaluation metrics. A lot of algorithms are already implemented in Pandas or R. It's much easier to pull data and run the needed analytics in Jupyter Notebooks. It would be much faster to iterate algorithm changes. Much easier to make dashboards to look at trends too.

So this issue specifically is just for levenshtein distances. I think this is something easy enough to calculate when we want to see this information. Run a jupyter notebook once a week if we need to look at trends. We also do not want to unnecessarily increase our storage footprint. If recent metrics are more important (current model performance) then storing all the extra old data is just wasted space.

There are other metrics we could evaluate on. For example we could have a metric that detects if fields are being swapped.

So the trade off here in terms of storage vs runtime compute. This calculation is cheap enough even on bulk data that it doesn't make sense to pre calculate and store in the database.

The other trade off is flexibility. In terms of having multiple metrics and making changes to metrics. What happens when we change the schema? Then all the previously precalculated metrics become incomparable to new metrics.

Trade of in scaling is another thing. Compute is easy to scale vertically and horizontally. While our db instances can scale vertically, they're not set up and much harder to scale horizontally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
Status: Todo
Status: In progress
Development

Successfully merging a pull request may close this issue.

3 participants