- About the Solution
- Web Framework for ML Services
- Machine Learning Model Serving Platform
- Vector Database - Vector Search
- Strategies for Improving Image Retrieval
- Development Environment
- Testing and Results
- References
This project implements an image search engine for Shopee using qdrant as the vector database for efficient similarity search. The choice of qdrant over faiss was made to start with a small server that runs on CPU and scales when there is a significant increase in traffic.
-
Overview FastAPI is a modern, high-performance web framework designed for building APIs with Python 3.7+ that leverages standard Python type hints. It is built on top of the Starlette web server and provides a plethora of features that simplify web application development. These features include automatic data validation, error handling, and interactive API documentation.
-
Key Features
- High Performance: FastAPI stands out as one of the fastest Python frameworks available. This speed is achieved through its use of asynchronous programming and integration with the Starlette web server.
- Automatic Data Validation: By utilizing Python type hints, FastAPI automatically validates both incoming requests and outgoing responses. This enhances code robustness and error prevention.
- Automatic Documentation Generation: FastAPI effortlessly generates interactive API documentation based on your code, simplifying comprehension and usage for other developers.
- Ease of Use: FastAPI boasts an intuitive and straightforward syntax, making it relatively easy to learn and work with. Additionally, a wealth of documentation and tutorials is available to support developers.
-
FastAPI for ML Services
- While FastAPI is primarily designed for IO-intensive applications like web applications, it can be employed for building ML services.
- The framework's asynchronous requests contribute to improved speed compared to Flask's synchronous requests.
- It is important to note that FastAPI does not offer built-in support for Machine Learning features such as micro-batching and asynchronous inference requests.
- When scaling workers, the models are copied, potentially increasing memory usage (RAM on CPU or VRAM on GPU).
-
Conclusion
-
FastAPI represents a potent and versatile web framework suitable for building ML services. Its reputation for high performance, user-friendliness, and extensive documentation is well-deserved.
-
However, it is essential to recognize that FastAPI lacks built-in support for all the typical features required for ML model serving, such as model loading, versioning, and performance monitoring. To address these needs, third-party libraries may be necessary.
-
In summary, FastAPI is a compelling choice for ML service development, especially if flexibility and performance are paramount. Nevertheless, developers should be prepared to integrate third-party solutions to fulfill specific requirements.
-
-
References
-
Overview
-
Triton Inference Server is an open-source software platform designed for deploying and managing AI models at scale. It supports a wide range of deep learning and machine learning frameworks, including TensorFlow, PyTorch, ONNX, and OpenVINO. Triton can be deployed on various hardware platforms, including CPUs, GPUs, and FPGAs.
-
Triton is engineered for high performance and scalability, employing techniques such as batching, concurrency, and model optimization to achieve low latency and high throughput. It can be scaled to handle a large volume of inference requests.
-
Triton offers flexibility and ease of use, supporting deployment options like on-premises, cloud, and edge. It can also be integrated with existing infrastructure, such as Kubernetes and cloud orchestration platforms.
-
-
Key Features
- High Performance: Triton is optimized for high throughput and low latency inference.
- Scalability: It can handle a large number of inference requests, making it suitable for real-time applications like video processing and natural language processing.
- Flexibility: Triton supports various deployment options and can be integrated with existing infrastructure.
- Ease of Use: Triton provides tools for deploying, managing, and monitoring AI models in production.
-
FastAPI with Triton Inference Server
FastAPI - Triton Inference Server ArchitectureUsing FastAPI with Triton Server is a more flexible approach to deploying and serving AI models with Python. FastAPI is a modern, high-performance web framework that is easy to use and extend. Triton Server is a high-performance inference serving software that can be used to deploy and serve AI models on CPUs, GPUs, and other accelerators.
-
PyTriton
PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments. It provides a simple and intuitive way to deploy and serve AI models with Triton Server.
import pytriton app = pytriton.App() # Define the model. model = pytriton.Model( "linear_regression", "linear_regression.model", [pytriton.Input("data", pytriton.DataType.FLOAT32, (-1, 1))], [pytriton.Output("prediction", pytriton.DataType.FLOAT32, (1,))], ) # Deploy the model. app.deploy(model) # Start the Triton Inference Server. app.start_triton() # Serve the model. app.serve()
Once you have started the PyTriton application, you can access the model through its HTTP/gRPC API. For example, to make a prediction using the model, you can send an HTTP POST request to the /predict endpoint with the input data in the request body.
-
FastAPI with Triton Server and PyTriton
-
Both FastAPI with Triton Server and PyTriton are good options for deploying and serving AI models with Python. However, there are some key differences between the two approaches:
- Flexibility: FastAPI with Triton Server is a more flexible approach. You can use FastAPI to create custom APIs that expose your models in a variety of ways. PyTriton is less flexible, but it is easier to use.
- Performance: FastAPI with Triton Server can be more performant than PyTriton, especially if you are using a complex model or a large dataset.
- Ease of use: PyTriton is easier to use than FastAPI with Triton Server. PyTriton provides a simple and intuitive interface for deploying and serving models.
-
-
Conclusion
- Triton Inference Server is a powerful and flexible platform for deploying AI models at scale. It is a popular choice for companies and organizations needing to deploy AI models in production.
-
References
-
Overview
- BentoML is an open-source platform for deploying and managing machine learning models in production. It provides a number of features that make it easy to get started, such as model packaging, deployment, and monitoring.
-
Key Features
-
Model packaging: BentoML provides a simple way to package machine learning models for production. Bentos can be created from a variety of popular machine learning frameworks, including TensorFlow, PyTorch, and scikit-learn.
-
Model deployment: BentoML provides a variety of options for deploying machine learning models in production. Bentos can be deployed to on-premises servers, cloud platforms, and edge devices.
-
Model monitoring: BentoML provides a number of tools for monitoring the performance and health of machine learning models in production. BentoML can track metrics such as latency, accuracy, and throughput.
-
-
Conclusion
- BentoML is a powerful and flexible platform for deploying and managing machine learning models in production. It is a popular choice for companies and organizations that need to deploy machine learning models at scale.
-
References
-
Overview
- Faiss is a powerful library designed for efficient similarity search and clustering of dense vectors. It supports searching in sets of vectors of any size, even those that may not fit in RAM. Faiss is written in C++ with complete Python/numpy wrappers and is highly versatile, finding applications in recommendation systems, image retrieval, and more.
-
Key Features
- Distance Metrics: Faiss supports a range of distance metrics including L2 (Euclidean) distances, dot products, and cosine similarity for comparing vectors.
- GPU Support:
- GPU Implementation: Faiss provides GPU support for efficient vector search.
- Compatibility: Faiss can seamlessly handle input from both CPU and GPU memory.
- Performance: Faiss can significantly enhance performance when both input and output remain resident on the GPU.
- Multi-GPU Usage: Faiss supports both single and multi-GPU usage.
-
Disadvantages
- Faiss does not support real-time updates. To implement real-time updates, you will need to create a custom wrapper around it or use a framework that supports CRUD operations, high availability, horizontal scalability, concurrent access, and more.
-
Reference
-
Overview
- ElasticSearch is a widely-used search engine that developers commonly employ to implement search functionality in their applications. With Vector Search, ElasticSearch enables developers to search for documents based on semantic similarity, going beyond simple textual relevance. This feature has diverse use cases, including semantic search, recommendation systems, image search, and question answering.
-
Advantages
- It is highly scalable, capable of handling large-scale data and high user concurrency.
- Supports asynchronous client operations.
-
Disadvantages
- Does not support GPU directly but can be extended with third-party plugins.
- Elasticsearch tends to be slower than its competitors across various datasets and metrics.
- Does not natively support batch vector search.
-
Pricing
- Storing and searching embeddings are free.
- Embedding models or built-in semantic search models come at a cost, with a price of $125 per month for Machine learning.
-
Examples
- Create the index
PUT my-index { "mappings": { "properties": { "my_vector": { "type": "dense_vector", "dims": 384, "index": true, "similarity": "cosine" }, "my_text" : { "type" : "text" } } } }
- Index a document
PUT my-index/_doc/1 { "my_vector" : [0.5, 10, 6, ….], "my_text" :"Jamaica's tropical climate brings warmth all year round" }
- Run vector search with deployed text embedding model
POST my-index/_search { "knn": { "field": "my_vector", "query_vector_builder": { "text_embedding": { "model_id": "sentence-transformers__all-minilm-l6-v2", "model_text": "How is the weather in Jamaica?" } }, "k": 10, "num_candidates": 100 } }
- Run vector search directly
POST my-index/_search { "knn": { "field": "my_vector", "query_vector": [0.3, 0.1, 1.2, ...], "k": 10, "num_candidates": 100 }, }
- Create the index
-
References
-
Overview
- Qdrant is a vector similarity search engine and vector database that offers a production-ready service with a user-friendly API for storing, searching, and managing vectors along with additional payload data. Qdrant is specifically tailored to support extended filtering, making it suitable for various applications such as neural-network or semantic-based matching, faceted search, and more. It is written in Rust for speed and reliability, even under high loads.
-
Advantages
- Implements a custom modification of the HNSW algorithm for Approximate Nearest Neighbor Search, combining state-of-the-art speed with advanced filtering capabilities.
- Supports both CPU and GPU-based computing, providing flexibility for different hardware configurations.
- Highly scalable, capable of handling large-scale data and high user concurrency.
- Supports batch vector search.
- Offers asynchronous client operations.
-
Benchmark
-
References
-
Leveraging Deep Learning Architectures
Deep learning architectures, such as Convolutional Neural Networks (CNNs), can play a pivotal role in feature extraction. Utilize pre-trained models like Vision Transformers (ViTs), ResNet, VGG, or EfficientNet to extract meaningful image features.
-
Utilizing Embeddings
Consider employing techniques like Siamese networks or triplet loss to generate embeddings for images. These embeddings help create a feature space conducive to similarity-based searches, a core aspect of image retrieval.
-
Fine-tuning Models
Fine-tune pre-trained models on your dataset. Transfer learning enables you to fine-tune models to your specific task, capitalizing on the knowledge embedded in pre-trained weights.
-
Cross-Modal Retrieval
Leverage both image and text information if available in your dataset. Cross-modal retrieval techniques enable you to utilize textual descriptions or tags for enhanced retrieval.
-
Feedback Loops
Incorporate feedback loops where user interactions with the retrieval system inform and improve future recommendations.
-
Create Environment and Install Packages
conda create -n image-search python=3.9
conda activate image-search
pip install -r requirements.txt
-
Run the Application
uvicorn app:app --port 7000
- Ubuntu: 20.04
- CUDA Version: 12.2
- EC2: g4dn.xlarge
- CPU: Intel Xeon Family 4-vCPUs
- RAM: 16GB
- GPU: NVIDIA T4 Tensor Core
- VRAM: 16GB
- Uvicorn Workers: 4 FastAPI, Qdrant, Faiss, Triton, Locust are executed on the same device.
-
Overview
Locust is an open-source, Python-based load testing tool that allows you to test the performance and scalability of your web applications or services. It is designed to be user-friendly, highly customizable, and easy to extend.
-
How to run
pip install locust
cd locust locust http://localhost:8089/
-
References
-
Ingest Data Time Create Faiss Index ~ 5 seconds.
-
Search Time CPU
About 39ms
-
Ingest Data Time
Create Collection and add 100,000 points take 6 minutes. -
Search Time CPU
About 3ms
-
Report
- Original efficientnet_b3 model.pt
- Uvicorn workers = 1
- User spawn rate = 1
Id Request Concurrency Device p95 Latency (ms) RPS Max GPU Memory Usage (MB) 1 1 CPU 130 9 0 2 1 GPU 21 48 525 3 4 CPU 460 9 0 4 4 GPU 85 49 525 5 8 CPU 920 9 0 6 8 GPU 170 49 525 7 16 CPU 1800 9 0 8 16 GPU 340 49 525 9 32 CPU 3600 9 0 10 32 GPU 650 50 525
-
Folder layout
model_repository/ ├── efficientnet_b3 │ ├── 1 │ │ └── model.pt │ └── config.pbtxt ├── efficientnet_b3_onnx │ ├── 1 │ │ └── model.onnx │ └── config.pbtxt └── efficientnet_b3_trt ├── 1 │ └── model.plan └── config.pbtxt
-
Dynamic Batching
Dynamic batching, in reference to the Triton Inference Server, refers to the functionality which allows the combining of one or more inference requests into a single batch (which has to be created dynamically) to maximize throughput.
Dynamic batching can be enabled and configured on per model basis by specifying selections in the model's config.pbtxt. Dynamic Batching can be enabled with its default settings by adding the following to the config.pbtxt file:
dynamic_batching { }
While Triton batches these incoming requests without any delay, users can choose to allocate a limited delay for the scheduler to collect more inference requests to be used by the dynamic batcher.
dynamic_batching { max_queue_delay_microseconds: 100 }
-
Concurrent Model Execution
The Triton Inference Server can spin up multiple instances of the same model, which can process queries in parallel. Triton can spawn instances on the same device (GPU), or a different device on the same node as per the user's specifications. This customizability is especially useful when considering ensembles that have models with different throughputs. Multiple copies of heavier models can be spawned on a separate GPU to allow for more parallel processing. This is enabled via the use of instance groups option in a model's configuration.
-
Measuring Triton Performance with Performance Analyzer
Having made some improvements to the model's serving capabilities by enabling
dynamic batching
and the use ofmultiple model instances
, the next step is to measure the impact of these features. To that end, the Triton Inference Server comes packaged with the Performance Analyzer which is a tool specifically designed to measure performance for Triton Inference Servers.-
On a first terminal, run Triton Container
docker compose --profile prod up triton_server
-
On a second terminal, run Triton SDK Container
docker pull nvcr.io/nvidia/tritonserver:23.01-py3-sdk docker run -it --net=host -v ${PWD}:/workspace/ nvcr.io/nvidia/tritonserver:23.01-py3-sdk bash
-
On a third terminal, it is advisable to monitor the GPU Utilization to see if the deployment is saturating GPU resources.
watch -n0.1 nvidia-smi
To measure the performance gain, let's run performance analyzer on the following configurations:
perf_analyzer -m <model name> -b <batch size> --shape <input layer>:<input shape> --concurrency-range <lower number of request>:<higher number of request>:<step>
-
No Dynamic Batching, single model instance: This configuration will be the baseline measurement. To set up the Triton Server in this configuration, do not add
instance_group
ordynamic_batching
inconfig.pbtxt
and make sure to include--gpus=1
in thedocker run
command to set up the server.# Query perf_analyzer -m efficientnet_b3_trt -b 2 --shape input:3,300,300 --concurrency-range 2:16:2 --percentile=95 # Summarized Inference Result Inferences/Second vs. Client p95 Batch Latency Concurrency: 2, throughput: 579.911 infer/sec, latency 7159 usec Concurrency: 4, throughput: 613.242 infer/sec, latency 17064 usec Concurrency: 6, throughput: 647.959 infer/sec, latency 26696 usec Concurrency: 8, throughput: 660.103 infer/sec, latency 30016 usec Concurrency: 10, throughput: 640.003 infer/sec, latency 39883 usec Concurrency: 12, throughput: 664.315 infer/sec, latency 46405 usec Concurrency: 14, throughput: 591.206 infer/sec, latency 57214 usec Concurrency: 16, throughput: 585.562 infer/sec, latency 65926 usec # Perf for 16 concurrent requests Request concurrency: 16 Client: Request count: 5273 Throughput: 585.562 infer/sec p50 latency: 58041 usec p90 latency: 64255 usec p95 latency: 65926 usec p99 latency: 69645 usec Avg HTTP time: 54594 usec (send/recv 2792 usec + response wait 51802 usec) Server: Inference count: 10544 Execution count: 1200 Successful request count: 5272 Avg request latency: 38647 usec (overhead 120 usec + queue 9176 usec + compute input 4285 usec + compute infer 18019 usec + compute output 7047 usec)
-
Just Dynamic Batching: To set up the Triton Server in this configuration, add
dynamic_batching
inconfig.pbtxt
.# Query perf_analyzer -m efficientnet_b3_trt -b 2 --shape input:3,300,300 --concurrency-range 2:16:2 --percentile=95 # Inference Result Inferences/Second vs. Client p95 Batch Latency Concurrency: 2, throughput: 575.35 infer/sec, latency 7228 usec Concurrency: 4, throughput: 591.983 infer/sec, latency 16285 usec Concurrency: 6, throughput: 655.545 infer/sec, latency 21486 usec Concurrency: 8, throughput: 648.158 infer/sec, latency 30080 usec Concurrency: 10, throughput: 658.363 infer/sec, latency 35557 usec Concurrency: 12, throughput: 678.189 infer/sec, latency 37481 usec Concurrency: 14, throughput: 677.299 infer/sec, latency 47840 usec Concurrency: 16, throughput: 676.165 infer/sec, latency 49272 usec # Perf for 16 concurrent requests Request concurrency: 16 Client: Request count: 6087 Throughput: 676.165 infer/sec p50 latency: 47349 usec p90 latency: 48491 usec p95 latency: 49272 usec p99 latency: 50612 usec Avg HTTP time: 47298 usec (send/recv 2369 usec + response wait 44929 usec) Server: Inference count: 12176 Execution count: 1522 Successful request count: 6088 Avg request latency: 39544 usec (overhead 72 usec + queue 16273 usec + compute input 2 usec + compute infer 11828 usec + compute output 11369 usec)
As each of the requests had a batch size (of 2), while the maximum batch size of the model was 8, dynamically batching these requests resulted in considerably improved throughput. Another consequence is a reduction in the latency. This reduction can be primarily attributed to reduced wait time in queue wait time. As the requests are batched together, multiple requests can be processed in parallel.
-
Dynamic Batching with multiple model instances: To set up the Triton Server in this configuration, add
instance_group
withcount: 2
inconfig.pbtxt
and make sure to include--gpus=1
and make sure to include--gpus=1
in thedocker run
command to set up the server. Includedynamic_batching
per instructions of the previous section in the model configuration.# Query perf_analyzer -m efficientnet_b3_trt -b 2 --shape input:3,300,300 --concurrency-range 2:16:2 --percentile=95 # Inference Result Inferences/Second vs. Client p95 Batch Latency Concurrency: 2, throughput: 432.983 infer/sec, latency 10019 usec Concurrency: 4, throughput: 509.38 infer/sec, latency 21272 usec Concurrency: 6, throughput: 621.19 infer/sec, latency 26593 usec Concurrency: 8, throughput: 642.087 infer/sec, latency 32329 usec Concurrency: 10, throughput: 653.192 infer/sec, latency 39368 usec Concurrency: 12, throughput: 659.101 infer/sec, latency 45811 usec Concurrency: 14, throughput: 665.102 infer/sec, latency 51499 usec Concurrency: 16, throughput: 668.254 infer/sec, latency 58111 usec # Perf for 16 concurrent requests Request concurrency: 16 Client: Request count: 6019 Throughput: 668.254 infer/sec p50 latency: 50748 usec p90 latency: 56465 usec p95 latency: 58111 usec p99 latency: 61444 usec Avg HTTP time: 47858 usec (send/recv 2190 usec + response wait 45668 usec) Server: Inference count: 12030 Execution count: 2045 Successful request count: 6015 Avg request latency: 38563 usec (overhead 83 usec + queue 7360 usec + compute input 218 usec + compute infer 18151 usec + compute output 12750 usec)
-
-
Pytorch Model Testing Report use Locust
- Test diffence configurations
- Only triton inference step
- Original efficientnet_b3 model.pt
- Uvicorn workers = 1
- User spawn rate = 1
Id Request Concurrency Max Batch Size Dynamic Batching Instance Count p95 Latency (ms) RPS Max GPU Memory Usage (MB) Average GPU Utilization (%) 1 1 1 Disabled 1:CPU 140 8 160 0 2 1 1 Disabled 1:GPU 14 74 313 35 3 8 1 Disabled 1:GPU 83 120 313 55 4 8 8 Disabled 1:GPU 85 113 313 55 5 8 8 Disabled 2:GPU 69 140 439 66 6 8 8 Enabled 2:GPU 68 150 1019 69 7 16 1 Disabled 1:GPU 170 111 313 58 8 16 16 Disabled 1:GPU 170 116 313 55 9 16 16 Disabled 2:GPU 130 143 439 66 10 16 16 Enabled 2:GPU 130 155 1667 75 11 32 1 Disabled 1:GPU 330 112 313 55 12 32 32 Disabled 1:GPU 310 116 313 58 13 32 32 Disabled 3:GPU 290 137 547 66 14 32 32 Enabled 3:GPU 270 150 7395 73 15 64 1 Disabled 1:GPU 610 118 313 55 16 64 64 Disabled 1:GPU 570 117 313 55 17 64 64 Disabled 3:GPU 610 132 547 65 18 64 64 Enabled 3:GPU 640 143 14023 75 19 64 64 Enabled 5:GPU 670 137 11183 70 20 64 64 Enabled 1:CPU 13000 5 160 0
Pytorch model - 32 Request Concurrency - Max Batch Size 32 - Dynamic Batching - 3:GPU -
ONNX Model Testing Report use Locust
- Test diffence configurations
- Only triton inference step
- ONNX model efficientnet_b3_onnx model.onnx
- FP32 convert (float32)
- Uvicorn workers = 1
- User spawn rate = 1
ONNX model - 32 Request Concurrency - Max Batch Size 32 - Disabled Dynamic Batching - 2:GPU -
TensorRT Model Testing Report use Locust
- Test diffence configurations
- Only triton inference step
- TensorRT model efficientnet_b3_trt model.plan
- FP16 convert (float16)
- Uvicorn workers = 1
- User spawn rate = 1
Id Request Concurrency Max Batch Size Dynamic Batching Instance Count p95 Latency (ms) RPS Max GPU Memory Usage (MB) Average GPU Utilization (%) 2 1 1 Disabled 1:GPU 8 127 531 35 1 16 16 Enabled 1:CPU 100 188 531 35 1 16 16 Enabled 2:CPU 100 186 893 30 2 32 32 Enabled 1:GPU 222 194 547 35 3 32 32 Enabled 2:GPU 222 193 922 55
TensorRT model - 32 Request Concurrency - Max Batch Size 32 - Dynamic Batching - 2:GPU -
References