diff --git a/README.md b/README.md
index 71c8ae18..43fd2972 100644
--- a/README.md
+++ b/README.md
@@ -16,11 +16,11 @@ This repository contains a curated list of awesome open source libraries that wi
 | [⚔ Adversarial Robustness](#adversarial-robustness) | [🔴 Anomaly Detection](#anomaly-detection) | [🤖 AutoML](#automl) |
 | [🗺️ Computation Load Distribution](#computation-load-distribution) | [🏷️ Data Labelling & Synthesis](#data-labelling-and-synthesis) | [🧵 Data Pipeline](#data-pipeline) |
 | [📓 Data Science Notebook](#ds-notebook) | [💾 Data Storage Optimisation](#data-storage-optimisation) | [💸 Data Stream Processing](#data-stream-processing) |
-| [📈 Evaluation & Monitoring](#evaluation-and-monitoring) | [🔍 Explainability & Interpretability](#explainability-and-interpretability) | [🎁 Feature Store](#feature-store) |
-| [👁️ Industry-strength Computer Vision](#industry-strength-cv) | [🔠 Industry-strength Natural Language Processing](#industry-strength-nlp) | [🙌 Industry-strength Recommender System](#industry-strength-recsys) |
+| [🔍 Explainability & Interpretability](#explainability-and-interpretability) | [🎁 Feature Store](#feature-store) | [👁️ Industry-strength Computer Vision](#industry-strength-cv) |
+| [📈 Industry-strength Evaluation](#industry-strength-evaluation) | [🔠 Industry-strength Natural Language Processing](#industry-strength-nlp) | [🙌 Industry-strength Recommender System](#industry-strength-recsys) |
 | [🍕 Industry-strength Reinforcement Learning](#industry-strength-rl) | [📊 Industry-strength Visualisation](#industry-strength-visualisation) | [📅 Metadata Management](#metadata-management) |
 | [📜 Model, Data & Experiment Tracking](#model-data-and-experiment-tracking) | [🔩 Model Compilation, Compression & Optimization](#model-compilation-compression-and-optimization) | [📥 Model Serialisation](#model-serialisation) |
-| [💪 Model Serving](#model-serving) | [🏁 Model Training Orchestration](#model-training-orchestration) | [🔥 Neural Search](#neural-search) |
+| [💪 Model Serving & Monitoring](#model-serving-and-monitoring) | [🏁 Model Training Orchestration](#model-training-orchestration) | [🔥 Neural Search](#neural-search) |
 | [🧮 Optimized Computation](#optimized-computation) | [🔏 Privacy & Security](#privacy-security) | [💰 Commercial Platform](#commercial-platform) |
 
 ## Contributing to the list
@@ -265,58 +265,6 @@ Please review our [CONTRIBUTING.md](https://github.com/EthicalML/awesome-product
 * [RobustBench](https://github.com/RobustBench/robustbench) ![](https://img.shields.io/github/stars/RobustBench/robustbench.svg?style=social) - another robustness resource maintained by some of the leading names in adversarial ML. They specifically focus on defenses, and onesa standardized adversarial robustness benchmark.
 
 
-## Evaluation and Monitoring
-* [AlpacaEval](https://github.com/tatsu-lab/alpaca_eval) ![](https://img.shields.io/github/stars/tatsu-lab/alpaca_eval.svg?style=social) - AlpacaEval is an automatic evaluator for instruction-following language models.
-* [ARES](https://github.com/openml/automlbenchmark) ![](https://img.shields.io/github/stars/openml/automlbenchmark.svg?style=social) - ARES is a framework for automatically evaluating Retrieval-Augmented Generation (RAG) models.
-* [AutoML Benchmark](https://github.com/openml/automlbenchmark) ![](https://img.shields.io/github/stars/openml/automlbenchmark.svg?style=social) - AutoML Benchmark is a framework for evaluating and comparing open-source AutoML systems.
-* [Banana-lyzer](https://github.com/reworkd/bananalyzer) ![](https://img.shields.io/github/stars/reworkd/bananalyzer.svg?style=social) - Banana-lyzer is an open source AI Agent evaluation framework and dataset for web tasks with Playwright.
-* [Code Generation LM Evaluation Harness](https://github.com/bigcode-project/bigcode-evaluation-harness) ![](https://img.shields.io/github/stars/bigcode-project/bigcode-evaluation-harness.svg?style=social) - Code Generation LM Evaluation Harness is a framework for the evaluation of code generation models.
-* [Deepchecks](https://github.com/deepchecks/deepchecks) ![](https://img.shields.io/github/stars/deepchecks/deepchecks.svg?style=social) - Deepchecks is a holistic open-source solution for all of your AI & ML validation needs, enabling you to thoroughly test your data and models from research to production.
-* [DeepEval](https://github.com/confident-ai/deepeval) ![](https://img.shields.io/github/stars/confident-ai/deepeval.svg?style=social) - DeepEval is a simple-to-use, open-source evaluation framework for LLM applications.
-* [EvalAI](https://github.com/Cloud-CV/EvalAI) ![](https://img.shields.io/github/stars/Cloud-CV/EvalAI.svg?style=social) - EvalAI is an open-source platform for evaluating and comparing AI algorithms at scale.
-* [Evals](https://github.com/openai/evals) ![](https://img.shields.io/github/stars/openai/evals.svg?style=social) - Evals is a framework for evaluating OpenAI models and an open-source registry of benchmarks.
-* [EvalScope](https://github.com/modelscope/evalscope) ![](https://img.shields.io/github/stars/modelscope/evalscope.svg?style=social) - EvalScope is a streamlined and customizable framework for efficient large model evaluation and performance benchmarking.
-* [EvalPlus](https://github.com/evalplus/evalplus) ![](https://img.shields.io/github/stars/evalplus/evalplus.svg?style=social) - EvalPlus is a rigorous evaluation framework for LLM4Code.
-* [Evaluate](https://github.com/huggingface/evaluate) ![](https://img.shields.io/github/stars/huggingface/evaluate.svg?style=social) - Evaluate is a library that makes evaluating and comparing models and reporting their performance easier and more standardized.
-* [Evalverse](https://github.com/UpstageAI/evalverse) ![](https://img.shields.io/github/stars/UpstageAI/evalverse.svg?style=social) - Evalverse is a framework to effortlessly evaluate and report LLMs with no-code requests and comprehensive reports.
-* [Evidently](https://github.com/evidentlyai/evidently) ![](https://img.shields.io/github/stars/evidentlyai/evidently.svg?style=social) - Evidently is an open-source framework to evaluate, test and monitor ML and LLM-powered systems.
-* [FlagEval](https://github.com/FlagOpen/FlagEval) ![](https://img.shields.io/github/stars/FlagOpen/FlagEval.svg?style=social) - FlagEval is an open-source evaluation toolkit as well as an open platform for evaluation of large models.
-* [FMBench](https://github.com/aws-samples/foundation-model-benchmarking-tool) ![](https://img.shields.io/github/stars/aws-samples/foundation-model-benchmarking-tool.svg?style=social) - FMBench is a tool for running performance benchmarks for any Foundation Model (FM) deployed on any AWS Generative AI service, be it Amazon SageMaker, Amazon Bedrock, Amazon EKS, or Amazon EC2.
-* [Giskard](https://github.com/Giskard-AI/giskard)![](https://img.shields.io/github/stars/Giskard-AI/giskard.svg?style=social) - Giskard is an evaluation & testing framework for LLMs & ML models.
-* [HarmBench](https://github.com/centerforaisafety/HarmBench) ![](https://img.shields.io/github/stars/centerforaisafety/HarmBench.svg?style=social) - HarmBench is a fast and scalable framework for evaluating automated red teaming methods and LLM attacks/defenses.
-* [Helicone](https://github.com/Helicone/helicone) ![](https://img.shields.io/github/stars/Helicone/helicone.svg?style=social) - Helicone is an observability platform for LLMs.
-* [HELM (Holistic Evaluation of Language Models)](https://github.com/stanford-crfm/helm) ![](https://img.shields.io/github/stars/stanford-crfm/helm.svg?style=social) - crfm-helm provides tools for the holistic evaluation of language models, including standardized datasets, a unified API for various models, diverse metrics, robustness, and fairness perturbations, a prompt construction framework, and a proxy server for unified model access.
-* [Inspect](https://github.com/UKGovernmentBEIS/inspect_ai) ![](https://img.shields.io/github/stars/UKGovernmentBEIS/inspect_ai.svg?style=social) - Inspect is a framework for large language model evaluations.
-* [InterCode](https://github.com/princeton-nlp/intercode) ![](https://img.shields.io/github/stars/princeton-nlp/intercode.svg?style=social) - InterCode is a lightweight, flexible, and easy-to-use framework for designing interactive code environments to evaluate language agents that can code.
-* [Langfuse](https://github.com/langfuse/langfuse) ![](https://img.shields.io/github/stars/langfuse/langfuse.svg?style=social) - Langfuse is an observability & analytics solution for LLM-based applications.
-* [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) ![](https://img.shields.io/github/stars/EleutherAI/lm-evaluation-harness.svg?style=social) - Language Model Evaluation Harness is a framework to test generative language models on a large number of different evaluation tasks.
-* [LightEval](https://github.com/huggingface/lighteval) ![](https://img.shields.io/github/stars/huggingface/lighteval.svg?style=social) - LightEval is a lightweight LLM evaluation suite.
-* [LLMonitor](https://github.com/lunary-ai/lunary) ![](https://img.shields.io/github/stars/lunary-ai/lunary.svg?style=social) - LLMonitor is an observability & analytics for AI apps and agents.
-* [LLMPerf](https://github.com/ray-project/llmperf) ![](https://img.shields.io/github/stars/ray-project/llmperf.svg?style=social) - LLMPerf is a tool for evaulation the performance of LLM APIs.
-* [LLM AutoEval](https://github.com/mlabonne/llm-autoeval) ![](https://img.shields.io/github/stars/mlabonne/llm-autoeval.svg?style=social) - LLM AutoEval simplifies the process of evaluating LLMs using a convenient Colab notebook. You just need to specify the name of your model, a benchmark, a GPU, and press run!
-* [mltrace](https://github.com/loglabs/mltrace) ![](https://img.shields.io/github/stars/loglabs/mltrace.svg?style=social) - mltrace is a lightweight, open-source Python tool to get "bolt-on" observability in ML pipelines.
-* [MTEB](https://github.com/embeddings-benchmark/mteb) ![](https://img.shields.io/github/stars/embeddings-benchmark/mteb.svg?style=social) - Massive Text Embedding Benchmark (MTEB) is a comprehensive benchmark of text embeddings.
-* [NannyML](https://github.com/NannyML/nannyml) ![](https://img.shields.io/github/stars/nannyml/nannyml.svg?style=social) - NannyML is a library that allows you to estimate post-deployment model performance (without access to targets), detect data drift, and intelligently link data drift alerts back to changes in model performance.
-* [OLMo-Eval](https://github.com/allenai/OLMo-Eval) ![](https://img.shields.io/github/stars/allenai/OLMo-Eval.svg?style=social) - OLMo-Eval is a framework for evaluating open language models.
-* [OpenCompass](https://github.com/open-compass/OpenCompass) ![](https://img.shields.io/github/stars/open-compass/OpenCompass.svg?style=social) - OpenCompass is an LLM evaluation platform, supporting a wide range of models (LLaMA, LLaMa2, ChatGLM2, ChatGPT, Claude, etc) over 50+ datasets.
-* [Optimum-Benchmark](https://github.com/huggingface/optimum-benchmark) ![](https://img.shields.io/github/stars/huggingface/optimum-benchmark.svg?style=social) - A unified multi-backend utility for benchmarking Transformers and Diffusers with support for Optimum's arsenal of hardware optimizations/quantization schemes.
-* [Overcooked-AI](https://github.com/HumanCompatibleAI/overcooked_ai) ![](https://img.shields.io/github/stars/HumanCompatibleAI/overcooked_ai.svg?style=social) - Overcooked-AI is a benchmark environment for fully cooperative human-AI task performance, based on the wildly popular video game Overcooked.
-* [PhaseLLM](https://github.com/wgryc/phasellm) ![](https://img.shields.io/github/stars/wgryc/phasellm.svg?style=social) - PhaseLLM is a large language model evaluation and workflow framework.
-* [Phoenix](https://github.com/Arize-ai/phoenix) ![](https://img.shields.io/github/stars/arize-ai/phoenix?style=social) - Phoenix is an ML observability in a notebook to validate, monitor, and fine-tune your generative LLM, CV, and tabular models.
-* [PromptBench](https://github.com/microsoft/promptbench) ![](https://img.shields.io/github/stars/microsoft/promptbench.svg?style=social) - PromptBench is a unified evaluation framework for large language models
-* [Prometheus-Eval](https://github.com/prometheus-eval/prometheus-eval) ![](https://img.shields.io/github/stars/prometheus-eval/prometheus-eval.svg?style=social) - Prometheus-Eval is a collection of tools for training, evaluating, and using language models specialized in evaluating other language models.
-* [Ragas](https://github.com/explodinggradients/ragas) ![](https://img.shields.io/github/stars/explodinggradients/ragas.svg?style=social) - Ragas is a framework to evaluate RAG pipelines.
-* [Rageval](https://github.com/gomate-community/rageval) ![](https://img.shields.io/github/stars/gomate-community/rageval.svg?style=social) - Rageval is a tool to evaluate RAG system.
-* [RewardBench](https://github.com/allenai/reward-bench) ![](https://img.shields.io/github/stars/allenai/reward-bench.svg?style=social) - RewardBench is a benchmark designed to evaluate the capabilities and safety of reward models.
-* [RobotPerf](https://github.com/robotperf/benchmarks) ![](https://img.shields.io/github/stars/robotperf/benchmarks.svg?style=social) - RobotPerf is an open reference benchmarking suite that is used to evaluate robotics computing performance fairly with ROS 2 as its common baseline so that robotic architects can make informed decisions about the hardware and software components of their robotic systems.
-* [TensorFlow Model Analysis](https://github.com/tensorflow/model-analysis) ![](https://img.shields.io/github/stars/tensorflow/model-analysis.svg?style=social) - TensorFlow Model Analysis (TFMA) is a library for evaluating TensorFlow models on large amounts of data in a distributed manner, using the same metrics defined in their trainer.
-* [TruLens](https://github.com/truera/trulens) ![](https://img.shields.io/github/stars/truera/trulens.svg?style=social) - TruLens provides a set of tools for evaluating and tracking LLM experiments.
-* [TrustLLM](https://github.com/HowieHwong/TrustLLM) ![](https://img.shields.io/github/stars/HowieHwong/TrustLLM.svg?style=social) - TrustLLM is a comprehensive framework to evaluate the trustworthiness of large language models, which includes principles, surveys, and benchmarks.
-* [UpTrain](https://github.com/uptrain-ai/uptrain) ![](https://img.shields.io/github/stars/uptrain-ai/uptrain.svg?style=social) - UpTrain is an open-source tool to evaluate LLM applications.
-* [VBench](https://github.com/Vchitect/VBench) ![](https://img.shields.io/github/stars/Vchitect/VBench.svg?style=social) - VBench is a comprehensive benchmark suite for video generative models.
-* [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) ![](https://img.shields.io/github/stars/open-compass/VLMEvalKit.svg?style=social) - VLMEvalKit is an open-source evaluation toolkit of large vision-language models (LVLMs).
-
-
 ## Explainability and Interpretability
 * [Aequitas](https://github.com/dssg/aequitas) ![](https://img.shields.io/github/stars/dssg/aequitas.svg?style=social) - An open-source bias audit toolkit for data scientists, machine learning researchers, and policymakers to audit machine learning models for discrimination and bias, and to make informed and equitable decisions around developing and deploying predictive risk-assessment tools.
 * [AI Explainability 360](https://github.com/Trusted-AI/AIX360) ![](https://img.shields.io/github/stars/Trusted-AI/AIX360.svg?style=social) - Interpretability and explainability of data and machine learning models including a comprehensive set of algorithms that cover different dimensions of explanations along with proxy explainability metrics.
@@ -351,7 +299,6 @@ Please review our [CONTRIBUTING.md](https://github.com/EthicalML/awesome-product
 * [TreeInterpreter](https://github.com/andosa/treeinterpreter) ![](https://img.shields.io/github/stars/andosa/treeinterpreter.svg?style=social) - Package for interpreting scikit-learn's decision tree and random forest predictions. Allows decomposing each prediction into bias and feature contribution components as described [here](http://blog.datadive.net/interpreting-random-forests).
 * [WhatIf](https://github.com/pair-code/what-if-tool) ![](https://img.shields.io/github/stars/pair-code/what-if-tool.svg?style=social) - An easy-to-use interface for expanding understanding of a black-box classification or regression ML model.
 * [woe](https://github.com/boredbird/woe) ![](https://img.shields.io/github/stars/boredbird/woe.svg?style=social) - Tools for WoE Transformation mostly used in ScoreCard Model for credit rating
-* [XAI - eXplainableAI](https://github.com/EthicalML/xai) ![](https://img.shields.io/github/stars/EthicalML/XAI.svg?style=social) - An eXplainability toolbox for machine learning.
 
 
 ## Feature Store
@@ -371,10 +318,54 @@ Please review our [CONTRIBUTING.md](https://github.com/EthicalML/awesome-product
 * [MMDetection](https://github.com/open-mmlab/mmdetection) ![](https://img.shields.io/github/stars/open-mmlab/mmdetection.svg?style=social) - MMDetection is an open source object detection toolbox based on PyTorch.
 * [SCEPTER](https://github.com/modelscope/scepter) ![](https://img.shields.io/github/stars/modelscope/scepter.svg?style=social) - SCEPTER is an open-source code repository dedicated to generative training, fine-tuning, and inference, encompassing a suite of downstream tasks such as image generation, transfer, editing.
 * [SuperGradients](https://github.com/Deci-AI/super-gradients) ![](https://img.shields.io/github/stars/Deci-AI/super-gradients.svg?style=social) - SuperGradients is an open-source library for training PyTorch-based computer vision models.
-* [Supervision](https://github.com/roboflow/supervision) ![](https://img.shields.io/github/stars/roboflow/supervision.svg?style=social) - Supervision is a Python library designed for efficient computer vision pipeline management, providing tools for annotation, visualization, and monitoring of models.
 * [VISSL](https://github.com/facebookresearch/vissl) ![](https://img.shields.io/github/stars/facebookresearch/vissl.svg?style=social) - VISSL is FAIR's library of extensible, modular and scalable components for SOTA Self-Supervised Learning with images.
 
 
+## Industry Strength Evaluation
+* [AlpacaEval](https://github.com/tatsu-lab/alpaca_eval) ![](https://img.shields.io/github/stars/tatsu-lab/alpaca_eval.svg?style=social) - AlpacaEval is an automatic evaluator for instruction-following language models.
+* [ARES](https://github.com/openml/automlbenchmark) ![](https://img.shields.io/github/stars/openml/automlbenchmark.svg?style=social) - ARES is a framework for automatically evaluating Retrieval-Augmented Generation (RAG) models.
+* [AutoML Benchmark](https://github.com/openml/automlbenchmark) ![](https://img.shields.io/github/stars/openml/automlbenchmark.svg?style=social) - AutoML Benchmark is a framework for evaluating and comparing open-source AutoML systems.
+* [Banana-lyzer](https://github.com/reworkd/bananalyzer) ![](https://img.shields.io/github/stars/reworkd/bananalyzer.svg?style=social) - Banana-lyzer is an open source AI Agent evaluation framework and dataset for web tasks with Playwright.
+* [Code Generation LM Evaluation Harness](https://github.com/bigcode-project/bigcode-evaluation-harness) ![](https://img.shields.io/github/stars/bigcode-project/bigcode-evaluation-harness.svg?style=social) - Code Generation LM Evaluation Harness is a framework for the evaluation of code generation models.
+* [DeepEval](https://github.com/confident-ai/deepeval) ![](https://img.shields.io/github/stars/confident-ai/deepeval.svg?style=social) - DeepEval is a simple-to-use, open-source evaluation framework for LLM applications.
+* [EvalAI](https://github.com/Cloud-CV/EvalAI) ![](https://img.shields.io/github/stars/Cloud-CV/EvalAI.svg?style=social) - EvalAI is an open-source platform for evaluating and comparing AI algorithms at scale.
+* [Evals](https://github.com/openai/evals) ![](https://img.shields.io/github/stars/openai/evals.svg?style=social) - Evals is a framework for evaluating OpenAI models and an open-source registry of benchmarks.
+* [EvalScope](https://github.com/modelscope/evalscope) ![](https://img.shields.io/github/stars/modelscope/evalscope.svg?style=social) - EvalScope is a streamlined and customizable framework for efficient large model evaluation and performance benchmarking.
+* [EvalPlus](https://github.com/evalplus/evalplus) ![](https://img.shields.io/github/stars/evalplus/evalplus.svg?style=social) - EvalPlus is a rigorous evaluation framework for LLM4Code.
+* [Evaluate](https://github.com/huggingface/evaluate) ![](https://img.shields.io/github/stars/huggingface/evaluate.svg?style=social) - Evaluate is a library that makes evaluating and comparing models and reporting their performance easier and more standardized.
+* [Evalverse](https://github.com/UpstageAI/evalverse) ![](https://img.shields.io/github/stars/UpstageAI/evalverse.svg?style=social) - Evalverse is a framework to effortlessly evaluate and report LLMs with no-code requests and comprehensive reports.
+* [Evidently](https://github.com/evidentlyai/evidently) ![](https://img.shields.io/github/stars/evidentlyai/evidently.svg?style=social) - Evidently is an open-source framework to evaluate, test and monitor ML and LLM-powered systems.
+* [FlagEval](https://github.com/FlagOpen/FlagEval) ![](https://img.shields.io/github/stars/FlagOpen/FlagEval.svg?style=social) - FlagEval is an open-source evaluation toolkit as well as an open platform for evaluation of large models.
+* [FMBench](https://github.com/aws-samples/foundation-model-benchmarking-tool) ![](https://img.shields.io/github/stars/aws-samples/foundation-model-benchmarking-tool.svg?style=social) - FMBench is a tool for running performance benchmarks for any Foundation Model (FM) deployed on any AWS Generative AI service, be it Amazon SageMaker, Amazon Bedrock, Amazon EKS, or Amazon EC2.
+* [HarmBench](https://github.com/centerforaisafety/HarmBench) ![](https://img.shields.io/github/stars/centerforaisafety/HarmBench.svg?style=social) - HarmBench is a fast and scalable framework for evaluating automated red teaming methods and LLM attacks/defenses.
+* [HELM (Holistic Evaluation of Language Models)](https://github.com/stanford-crfm/helm) ![](https://img.shields.io/github/stars/stanford-crfm/helm.svg?style=social) - crfm-helm provides tools for the holistic evaluation of language models, including standardized datasets, a unified API for various models, diverse metrics, robustness, and fairness perturbations, a prompt construction framework, and a proxy server for unified model access.
+* [Inspect](https://github.com/UKGovernmentBEIS/inspect_ai) ![](https://img.shields.io/github/stars/UKGovernmentBEIS/inspect_ai.svg?style=social) - Inspect is a framework for large language model evaluations.
+* [InterCode](https://github.com/princeton-nlp/intercode) ![](https://img.shields.io/github/stars/princeton-nlp/intercode.svg?style=social) - InterCode is a lightweight, flexible, and easy-to-use framework for designing interactive code environments to evaluate language agents that can code.
+* [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) ![](https://img.shields.io/github/stars/EleutherAI/lm-evaluation-harness.svg?style=social) - Language Model Evaluation Harness is a framework to test generative language models on a large number of different evaluation tasks.
+* [LightEval](https://github.com/huggingface/lighteval) ![](https://img.shields.io/github/stars/huggingface/lighteval.svg?style=social) - LightEval is a lightweight LLM evaluation suite.
+* [LLMPerf](https://github.com/ray-project/llmperf) ![](https://img.shields.io/github/stars/ray-project/llmperf.svg?style=social) - LLMPerf is a tool for evaulation the performance of LLM APIs.
+* [LLM AutoEval](https://github.com/mlabonne/llm-autoeval) ![](https://img.shields.io/github/stars/mlabonne/llm-autoeval.svg?style=social) - LLM AutoEval simplifies the process of evaluating LLMs using a convenient Colab notebook. You just need to specify the name of your model, a benchmark, a GPU, and press run!
+* [MTEB](https://github.com/embeddings-benchmark/mteb) ![](https://img.shields.io/github/stars/embeddings-benchmark/mteb.svg?style=social) - Massive Text Embedding Benchmark (MTEB) is a comprehensive benchmark of text embeddings.
+* [OLMo-Eval](https://github.com/allenai/OLMo-Eval) ![](https://img.shields.io/github/stars/allenai/OLMo-Eval.svg?style=social) - OLMo-Eval is a framework for evaluating open language models.
+* [OpenCompass](https://github.com/open-compass/OpenCompass) ![](https://img.shields.io/github/stars/open-compass/OpenCompass.svg?style=social) - OpenCompass is an LLM evaluation platform, supporting a wide range of models (LLaMA, LLaMa2, ChatGLM2, ChatGPT, Claude, etc) over 50+ datasets.
+* [Optimum-Benchmark](https://github.com/huggingface/optimum-benchmark) ![](https://img.shields.io/github/stars/huggingface/optimum-benchmark.svg?style=social) - A unified multi-backend utility for benchmarking Transformers and Diffusers with support for Optimum's arsenal of hardware optimizations/quantization schemes.
+* [Overcooked-AI](https://github.com/HumanCompatibleAI/overcooked_ai) ![](https://img.shields.io/github/stars/HumanCompatibleAI/overcooked_ai.svg?style=social) - Overcooked-AI is a benchmark environment for fully cooperative human-AI task performance, based on the wildly popular video game Overcooked.
+* [PhaseLLM](https://github.com/wgryc/phasellm) ![](https://img.shields.io/github/stars/wgryc/phasellm.svg?style=social) - PhaseLLM is a large language model evaluation and workflow framework.
+* [PromptBench](https://github.com/microsoft/promptbench) ![](https://img.shields.io/github/stars/microsoft/promptbench.svg?style=social) - PromptBench is a unified evaluation framework for large language models
+* [Prometheus-Eval](https://github.com/prometheus-eval/prometheus-eval) ![](https://img.shields.io/github/stars/prometheus-eval/prometheus-eval.svg?style=social) - Prometheus-Eval is a collection of tools for training, evaluating, and using language models specialized in evaluating other language models.
+* [Ragas](https://github.com/explodinggradients/ragas) ![](https://img.shields.io/github/stars/explodinggradients/ragas.svg?style=social) - Ragas is a framework to evaluate RAG pipelines.
+* [Rageval](https://github.com/gomate-community/rageval) ![](https://img.shields.io/github/stars/gomate-community/rageval.svg?style=social) - Rageval is a tool to evaluate RAG system.
+* [RewardBench](https://github.com/allenai/reward-bench) ![](https://img.shields.io/github/stars/allenai/reward-bench.svg?style=social) - RewardBench is a benchmark designed to evaluate the capabilities and safety of reward models.
+* [RobotPerf](https://github.com/robotperf/benchmarks) ![](https://img.shields.io/github/stars/robotperf/benchmarks.svg?style=social) - RobotPerf is an open reference benchmarking suite that is used to evaluate robotics computing performance fairly with ROS 2 as its common baseline so that robotic architects can make informed decisions about the hardware and software components of their robotic systems.
+* [TensorFlow Model Analysis](https://github.com/tensorflow/model-analysis) ![](https://img.shields.io/github/stars/tensorflow/model-analysis.svg?style=social) - TensorFlow Model Analysis (TFMA) is a library for evaluating TensorFlow models on large amounts of data in a distributed manner, using the same metrics defined in their trainer.
+* [TrustLLM](https://github.com/HowieHwong/TrustLLM) ![](https://img.shields.io/github/stars/HowieHwong/TrustLLM.svg?style=social) - TrustLLM is a comprehensive framework to evaluate the trustworthiness of large language models, which includes principles, surveys, and benchmarks.
+* [UpTrain](https://github.com/uptrain-ai/uptrain) ![](https://img.shields.io/github/stars/uptrain-ai/uptrain.svg?style=social) - UpTrain is an open-source tool to evaluate LLM applications.
+* [VBench](https://github.com/Vchitect/VBench) ![](https://img.shields.io/github/stars/Vchitect/VBench.svg?style=social) - VBench is a comprehensive benchmark suite for video generative models.
+* [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) ![](https://img.shields.io/github/stars/open-compass/VLMEvalKit.svg?style=social) - VLMEvalKit is an open-source evaluation toolkit of large vision-language models (LVLMs).
+* [XAI - eXplainableAI](https://github.com/EthicalML/xai) ![](https://img.shields.io/github/stars/EthicalML/XAI.svg?style=social) - An eXplainability toolbox for machine learning.
+* [supervision](https://github.com/roboflow/supervision) ![](https://img.shields.io/github/stars/roboflow/supervision.svg?style=social) - We write your reusable computer vision tools. Whether you need to load your dataset from your hard drive, draw detections on an image or video, or count how many detections are in a zone. You can count on us!
+
+
 ## Industry Strength NLP
 * [Blackstone](https://github.com/ICLRandD/Blackstone) ![](https://img.shields.io/github/stars/ICLRandD/Blackstone.svg?style=social) - Blackstone is a spaCy model and library for processing long-form, unstructured legal text. Blackstone is an experimental research project from the Incorporated Council of Law Reporting for England and Wales' research lab, ICLR&D.
 * [Coqui STT](https://github.com/coqui-ai/STT) ![](https://img.shields.io/github/stars/coqui-ai/STT.svg?style=social) - Coqui STT is a fast, open-source, multi-platform, deep-learning toolkit for training and deploying speech-to-text models.
@@ -490,7 +481,6 @@ Please review our [CONTRIBUTING.md](https://github.com/EthicalML/awesome-product
 * [TensorBoard](https://github.com/tensorflow/tensorboard) ![](https://img.shields.io/github/stars/tensorflow/tensorboard.svg?style=social) - TensorBoard is a visualization toolkit for machine learning experimentation that makes it easy to host, track, and share ML experiments.
 * [Transformer Explainer](https://github.com/poloclub/transformer-explainer) ![](https://img.shields.io/github/stars/poloclub/transformer-explainer.svg?style=social) - Transformer Explainer is an interactive visualization tool designed to help anyone learn how Transformer-based models like GPT work.
 * [Vega-Altair](https://github.com/vega/altair) ![](https://img.shields.io/github/stars/vega/altair.svg?style=social) - Vega-Altair is a declarative statistical visualization library for Python.
-* [ydata-profiling](https://github.com/ydataai/ydata-profiling) ![](https://img.shields.io/github/stars/ydataai/ydata-profiling.svg?style=social) - ydata-profiling provides a one-line Exploratory Data Analysis (EDA) experience in a consistent and fast solution.
 
 
 ## Metadata Management
@@ -548,12 +538,15 @@ Please review our [CONTRIBUTING.md](https://github.com/EthicalML/awesome-product
 * [TensorStore](https://github.com/google/tensorstore) ![](https://img.shields.io/github/stars/google/tensorstore.svg?style=social) - TensorStore is an open-source C++ and Python software library designed for storage and manipulation of large multi-dimensional arrays.
 
 
-## Model Serving
+## Model Serving and Monitoring
 * [Backprop](https://github.com/backprop-ai/backprop) ![](https://img.shields.io/github/stars/backprop-ai/backprop.svg?style=social) - Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.
 * [BentoML](https://github.com/bentoml/BentoML) ![](https://img.shields.io/github/stars/bentoml/bentoml.svg?style=social) - BentoML is an open source framework for high performance ML model serving.
 * [Cortex](https://github.com/cortexlabs/cortex) ![](https://img.shields.io/github/stars/cortexlabs/cortex.svg?style=social) - Cortex is an open source platform for deploying machine learning models—trained with any framework—as production web services. No DevOps required.
+* [Deepchecks](https://github.com/deepchecks/deepchecks) ![](https://img.shields.io/github/stars/deepchecks/deepchecks.svg?style=social) - Deepchecks is an open source package for comprehensively validating your machine learning models and data with minimal effort during development, deployment or in production.
 * [DeepDetect](https://github.com/jolibrain/deepdetect) ![](https://img.shields.io/github/stars/jolibrain/deepdetect.svg?style=social) - Machine Learning production server for TensorFlow, XGBoost and Cafe models written in C++ and maintained by Jolibrain.
 * [exo](https://github.com/exo-explore/exo) ![](https://img.shields.io/github/stars/exo-explore/exo.svg?style=social) - exo helps you run your AI cluster at home with everyday devices.
+* [Giskard](https://github.com/Giskard-AI/giskard)![](https://img.shields.io/github/stars/Giskard-AI/giskard.svg?style=social) - Quality Assurance for AI models. Open-source platform to help organizations increase the efficiency of their AI development workflow, eliminate risks of AI biases and ensure robust, reliable & ethical AI models.
+* [Helicone](https://github.com/Helicone/helicone) ![](https://img.shields.io/github/stars/Helicone/helicone.svg?style=social) - Helicone is an observability platform for LLMs.
 * [Hydrosphere ML Lambda](https://github.com/Hydrospheredata/hydro-serving) ![](https://img.shields.io/github/stars/Hydrospheredata/hydro-serving.svg?style=social) - Open source model management cluster for deploying, serving and monitoring machine learning models and ad-hoc algorithms with a FaaS architecture.
 * [Intel® Extension for Transformers](https://github.com/intel/intel-extension-for-transformers) ![](https://img.shields.io/github/stars/intel/intel-extension-for-transformers.svg?style=social) - An Innovative Transformer-based Toolkit to Accelerate GenAI/LLM Everywhere.
 * [Inference](https://github.com/roboflow/inference) ![](https://img.shields.io/github/stars/roboflow/inference.svg?style=social) - A fast, production-ready inference server for computer vision supporting deployment of many popular model architectures and fine-tuned models. With Inference, you can deploy models such as YOLOv5, YOLOv8, CLIP, SAM, and CogVLM on your own hardware using Docker.
@@ -561,17 +554,22 @@ Please review our [CONTRIBUTING.md](https://github.com/EthicalML/awesome-product
 * [Jina](https://github.com/jina-ai/jina)  ![](https://img.shields.io/github/stars/jina-ai/jina.svg?style=social) - Cloud native search framework that   supports to use deep learning/state of the art AI models for search.
 * [KsanaLLM](https://github.com/pcg-mlp/KsanaLLM) ![](https://img.shields.io/github/stars/pcg-mlp/KsanaLLM.svg?style=social) - Serverless framework to deploy and monitor machine learning models in Kubernetes - [(Video)](https://www.youtube.com/watch?v=hGIvlFADMhU).
 * [KServe](https://github.com/kserve/kserve) ![](https://img.shields.io/github/stars/kserve/kserve.svg?style=social) - KsanaLLM is a high performance and easy-to-use engine for LLM inference and serving.
+* [Langfuse](https://github.com/langfuse/langfuse) ![](https://img.shields.io/github/stars/langfuse/langfuse.svg?style=social) - Langfuse is an open source observability & analytics solution for LLM-based applications.
 * [Lepton AI](https://github.com/leptonai/leptonai) ![](https://img.shields.io/github/stars/leptonai/leptonai.svg?style=social) - LeptonAI Python library allows you to build an AI service from Python code with ease.
 * [LightLLM](https://github.com/ModelTC/lightllm) ![](https://img.shields.io/github/stars/ModelTC/lightllm.svg?style=social) - LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.
+* [LLMonitor](https://github.com/lunary-ai/lunary) ![](https://img.shields.io/github/stars/lunary-ai/lunary.svg?style=social) - Observability & analytics for AI apps and agents.
 * [LocalAI](https://github.com/mudler/LocalAI) ![](https://img.shields.io/github/stars/mudler/LocalAI.svg?style=social) - LocalAI is a drop-in replacement REST API that's compatible with OpenAI API specifications for local inferencing.
 * [m2cgen](https://github.com/BayesWitnesses/m2cgen) ![](https://img.shields.io/github/stars/BayesWitnesses/m2cgen.svg?style=social) - A lightweight library which allows to transpile trained classic machine learning models into a native code of C, Java, Go, R, PHP, Dart, Haskell, Rust and many other programming languages.
 * [MLRun](https://github.com/mlrun/mlrun)![](https://img.shields.io/github/stars/mlrun/mlrun.svg?style=social)- MLRun is an open MLOps framework for quickly building and managing continuous ML and generative AI applications across their lifecycle.
 * [MLServer](https://github.com/SeldonIO/mlserver) ![](https://img.shields.io/github/stars/SeldonIO/mlserver.svg?style=social) - An inference server for your machine learning models, including support for multiple frameworks, multi-model serving and more.
+* [mltrace](https://github.com/loglabs/mltrace) ![](https://img.shields.io/github/stars/loglabs/mltrace.svg?style=social) - a lightweight, open-source Python tool to get "bolt-on" observability in ML pipelines.
+* [NannyML](https://github.com/NannyML/nannyml) ![](https://img.shields.io/github/stars/nannyml/nannyml.svg?style=social) - An open source library to estimate post-deployment model performance (without access to targets). Capable of fully capturing the impact of data drift on performance.
 * [Mosec](https://github.com/mosecorg/mosec) ![](https://img.shields.io/github/stars/mosecorg/mosec.svg?style=social) - A rust-powered and multi-stage pipelined model server which offers dynamic batching and more. Super easy to implement and deploy as micro-services.
 * [Nuclio](https://github.com/nuclio/nuclio) ![](https://img.shields.io/github/stars/nuclio/nuclio.svg?style=social) - A high-performance "serverless" framework focused on data, I/O, and compute-intensive workloads. It is well integrated with popular data science tools, such as Jupyter and Kubeflow; supports a variety of data and streaming sources; and supports execution over CPUs and GPUs.
 * [OpenDiT](https://github.com/NUS-HPC-AI-Lab/OpenDiT) ![](https://img.shields.io/github/stars/NUS-HPC-AI-Lab/OpenDiT.svg?style=social) - OpenDiT is an open-source project that provides a high-performance implementation of Diffusion Transformer(DiT), specifically designed to enhance the efficiency of training and inference for DiT applications, including text-to-video generation and text-to-image generation.
 * [OpenScoring](https://github.com/openscoring/openscoring) ![](https://img.shields.io/github/stars/openscoring/openscoring.svg?style=social) - REST web service for the true real-time scoring (< 1 ms) of Scikit-Learn, R and Apache Spark models.
 * [OpenVINO](https://github.com/openvinotoolkit/openvino) ![](https://img.shields.io/github/stars/openvinotoolkit/openvino_tensorflow.svg?style=social) - OpenVINO is an open-source toolkit for optimizing and deploying AI inference.
+* [Phoenix](https://github.com/Arize-ai/phoenix) ![](https://img.shields.io/github/stars/arize-ai/phoenix?style=social) - Phoenix is an open source ML observability in a notebook to validate, monitor, and fine-tune your generative LLM, CV, and tabular models.
 * [PowerInfer](https://github.com/SJTU-IPADS/PowerInfer) ![](https://img.shields.io/github/stars/SJTU-IPADS/PowerInfer?style=social) - PowerInfer is a CPU/GPU LLM inference engine leveraging activation locality for your device.
 * [PredictionIO](https://github.com/apache/predictionio) ![](https://img.shields.io/github/stars/apache/predictionio.svg?style=social) - An open source Machine Learning Server built on top of a state-of-the-art open source stack for developers and data scientists to create predictive engines for any machine learning task.
 * [Prompt2Model](https://github.com/neulab/prompt2model) ![](https://img.shields.io/github/stars/neulab/prompt2model.svg?style=social) - Prompt2Model is a system that takes a natural language task description (like the prompts used for LLMs such as ChatGPT) to train a small special-purpose model that is conducive for deployment.
@@ -585,8 +583,10 @@ Please review our [CONTRIBUTING.md](https://github.com/EthicalML/awesome-product
 * [text-generation-inference](https://github.com/huggingface/text-generation-inference) ![](https://img.shields.io/github/stars/huggingface/text-generation-inference.svg?style=social) - Large Language Model Text Generation Inference.
 * [TorchServe](https://github.com/pytorch/serve) ![](https://img.shields.io/github/stars/pytorch/serve.svg?style=social) - TorchServe is a flexible and easy to use tool for serving PyTorch models.
 * [Triton Inference Server](https://github.com/triton-inference-server/server) ![](https://img.shields.io/github/stars/triton-inference-server/server.svg?style=social) - Triton is a high performance open source serving software to deploy AI models from any framework on GPU & CPU while maximizing utilization.
+* [TruLens](https://github.com/truera/trulens) ![](https://img.shields.io/github/stars/truera/trulens.svg?style=social) - TruLens provides a set of tools for developing and monitoring neural nets, including large language models.
 * [UnionML](https://github.com/unionai-oss/unionml) ![](https://img.shields.io/github/stars/unionai-oss/unionml.svg?style=social) - UnionML is an open source MLOps framework that aims to reduce the boilerplate and friction that comes with building models and deploying them to production.
 * [vLLM](https://github.com/vllm-project/vllm) ![](https://img.shields.io/github/stars/vllm-project/vllm.svg?style=social) - vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.
+* [ydata-profiling](https://github.com/ydataai/ydata-profiling) ![](https://img.shields.io/github/stars/ydataai/ydata-profiling.svg?style=social) - ydata-profiling creates HTML profiling reports from pandas DataFrame objects. It extends the pandas DataFrame with df.profile_report() for quick data analysis.
 
 
 ## Model Training Orchestration