character level machine translation on named entities, using fastapi, spacy, pytorch sequence 2 sequence model and docker
Table of Contents
Use this project to easily setup a machine translation api for authenticated user. Get 'normalized' entiites from a raw text (email, files, chatbot conversations):
- Add your own translation in database,
- Crain your models using your translation pairs
- Convert raw entities into actionnable features
- Install Docker and Docker-Compose
- Clone the repo
git clone https://github.com/vincentporte/machine_translation_fastapi_pytorch_docker.git
- Build the docker image
docker-compose build backend
- Setup keys and credentials in
.env
POSTGRES_USER=db_user POSTGRES_PASSWORD=db_pass POSTGRES_DB=db_name SECRET_KEY=secret_key_for_users_management DATABASE_URL=postgres://db_user:db_pass@db:5432/db_name
- Add your own NER model, see Spacy docs
- Run your containers
docker-compose up -d;docker-compose logs -f
- Init you database
docker-compose exec backend aerich init-db
- Add your dataset files and train your own seq2seq model
docker-compose exec backend python app/services/training.py
- Run tests
docker-compose exec backend pytest
- Access DB cmd line
docker exec -it mt_db psql -U db_user -h 127.0.0.1 -W db_name
- Grant superuser rigths
UPDATE usermodel SET is_superuser = 't', is_verified = 't' WHERE email = '[email protected]';
- Generate migration file
docker-compose exec backend aerich migrate
- Apply upgrade to DB
docker-compose exec backend aerich upgrade
- Replace /config/nginx/nginx.conf with nginx.conf.live and update server_name refs
server_name subdomain.domain.com;
- Add CAA record in your DNS
sudomain.domain.com. CAA 0 issue letsencrypt.org
- Update refs in letsencrypt.sh
domains=(subdomain.domain.com) email="[email protected]"
- Run letsencrypt.sh script to setup certificates
sudo ./init-letsencrypt.sh
Using FastAPIUsers
curl \
-H "Content-Type: application/json" \
-X POST \
-d "{\"email\": \"[email protected]\",\"password\": \"strongpassword\"}" \
http://localhost/auth/register
Returns:
{
"id":"800e9564-6804-4ab5-bc59-a088182227be",
"email":"[email protected]",
"is_active":true,
"is_superuser":false,
"is_verified":false
}
curl \
-H "Content-Type: multipart/form-data" \
-X POST \
-F "[email protected]" \
-F "password=strongpassword" \
http://localhost:8000/auth/login
Returns:
{
"access_token":"eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyX2lkIjoiODAwZTk1NjQtNjgwNC00YWI1LWJjNTktYTA4ODE4MjIyN2JlIiwiYXVkIjpbImZhc3RhcGktdXNlcnM6YXV0aCJdLCJleHAiOjE2MzAxMzk5OTJ9.w-ZWpm51fyybFivmKjun3qbXuqwXCgYyxGbPD1yhIr4",
"token_type":"bearer"
}
curl -X 'POST'
'http://localhost/ner'
-H 'accept: application/json'
-H 'Authorization: Bearer token'
-H 'Content-Type: application/json'
-d '{
"sentence": "un devis pour 500 flyers en quadri r/v, format a4 pour demain svp"
}'
Returns:
{
"entities": [
{
"text": "500",
"entity": "EXEMPLAIRES",
"pos": 0,
"start": 14,
"end": 17
},
{
"text": "flyers",
"entity": "PRODUCT",
"pos": 1,
"start": 18,
"end": 24
},
{
"text": "quadri r/v",
"entity": "IMPRESSION",
"pos": 2,
"start": 28,
"end": 38
},
{
"text": "format a4",
"entity": "FORMAT",
"pos": 3,
"start": 40,
"end": 49
}
],
"ner": "imprimeur_4.3.20210312124255"
}
curl -X 'POST'
'http://localhost/products'
-H 'accept: application/json'
-H 'Authorization: Bearer token'
-H 'Content-Type: application/json'
-d '{
"entity_type": "FO",
"source": "fo a4",
"translation": "format ouvert : 210.0 x 297.0 mm"
}'
Returns:
{
"id": 2,
"entity_type": "FO",
"source": "fo a4",
"translation": "format ouvert : 210.0 x 297.0 mm"
}
curl -X 'POST'
'http://localhost/products/extract'
-H 'accept: application/json'
-H 'Authorization: Bearer token'
-d ''
Returns:
{
"msg": "extracting"
}
curl -X 'POST'
'http://localhost/translate'
-H 'accept: application/json'
-H 'Authorization: Bearer token'
-H 'Content-Type: application/json'
-d '{
"entities": [
{
"text": "500",
"entity": "EXEMPLAIRES",
"pos": 0,
"start": 14,
"end": 17
},
{
"text": "flyers",
"entity": "PRODUCT",
"pos": 1,
"start": 18,
"end": 24
},
{
"text": "quadri r/v",
"entity": "IMPRESSION",
"pos": 2,
"start": 28,
"end": 38
},
{
"text": "format a4",
"entity": "FORMAT",
"pos": 3,
"start": 40,
"end": 49
}
],
"model": "imprimeur"
}'
Returns:
{
"entities": [
{
"text": "500",
"entity": "EXEMPLAIRES",
"pos": 0,
"start": 14,
"end": 17
},
{
"text": "flyers",
"entity": "PRODUCT",
"pos": 1,
"start": 18,
"end": 24
},
{
"text": "recto : quadri, verso : quadri",
"entity": "IMPRESSION",
"pos": 2,
"start": 28,
"end": 38
},
{
"text": "format fini : 210.0 x 297.0 mm",
"entity": "FORMAT",
"pos": 3,
"start": 40,
"end": 49
}
]
}
For more examples, please refer to the Documentation
- Train translation model with users dataset
- add translation pairs in DB
- extract dataset and train pytorch model
- User Verification by email
- setup mailgun API Key
- Named entity recognition to extract part of text to translate
- setup spay
See the open issues for a full list of proposed features (and known issues).
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under the GPL-3.0 License. See LICENSE.txt
for more information.
Vincent PORTE - [email protected]
Project Link: https://github.com/vincentporte/machine_translation_fastapi_pytorch_docker