By Eduardo Venegas, Moises Chávez, Leonardo Galindo, and Alberto Cortés.
DEMO: https://youtu.be/mwhK7TP2GxQ
Technology is a great way to help those in need, as it continous to develop it also presents new possibilities, one such being human vision aided and complemented by computer vision.
This project consists of a web and mobile based app that generates captions from images captured by the device's camera. The goal of the app is to help people with limited vision to see the world around them by using an easy to use UI from where the users can take a picture, upload it to the app a be presented with an audio caption in multiple languages of what they are seeing.
This can be run in a Web Browser or compiled to an Adroid or IOS device using the Ionic Capacitor tool.
- Docker
- Google CLoud
- Ionic Angular
- TensorFlow
- Github Actions
- Flask
- Python
- Typescript
The architecture of the app is composed by a client app that captures images and a server app that process the images using a Deep Learning model and returns the generated captions.
- The client app is an Ionic Angular app takes a picture and encodes the captured image to a base64 string that is sent through a POST request to the server.
- The server app is a Flask container running in Docker that holds a trained Captioning Neural Network composed of a Convolutional Neural Network to extract features of an image and a Recurrent Neural Network that generated captions from the features of the image using a Long Short Term Memory model.
- Once an image is passed as input to the Captioning Neural Network it returns a text caption that is returned to the client app.
- Finally, the client app presents the text caption as an audio output that is being translated to any of the 11 available languages using text-to-speech, once the audio is finished the app returns to the main activity.
- The client app can be used in a Web Browser or compiled to an Android or IOS device by using Ionic.
- Response time under 1.5 seconds
Users can select the language of the audio output and press the button that launches a camera intent.
Currently 11 languages are being supported, handling translations and pronunciation.
The user can capture an image and confirm or reject the captured image.
This project has a full Continuous Integration and Delivery system.
- All code is tested the moment a pull request is created by building it in Github Actions
- You can merge into main when all tests pass.
- When Continuous Delivery is triggered, Github Actions builds the API image and pushes it into a Github Package Registry.
- SSHs into a Google Cloud instance, pull the new images, stop the current docker compose and run it again.
- Also Github Actions connects with Firebase to deliver automatic client deployments.
- As an extra, an Android an IOS app can be compiled from the main source code.
Make sure you have python3 and pip installed
Create and activate virtual environment using virtualenv
$ python -m venv python3-virtualenv
$ source python3-virtualenv/bin/activate
Use the package manager pip to install all dependencies
pip install -r requirements.txt
Install the node modules
npm i
Make sure to have ionic installed
$ flask run
$ ionic serve
Contributions are welcome! Please refer to the guidelines.