Releases: leaf-ai/studio-go-runner
0.14.3-main-aaaagseyvek
Merge branch 'main' of github.com:leaf-ai/studio-go-runner
0.14.3-main-aaaagsdvzdw
Upgraded cards Introduced new guCount value as well
0.14.3-main-aaaagsbfdfr
Fix the template file so it gets rewound on each pass of the job gene…
0.14.2
0.14.1
IMPROVEMENTS:
- The queue-status is now called the queue-scaler due to its extended functionality
- cosign support for Image verification on dockerhub and AWS ECR
FIXES:
- Provisioning of hosts with the queue-scaler tool can cause overly powerful machines to be allocated
- The dockerhub release images for this version have been signed. Please review the instructions in the README.md A note concerning security and privacy.
0.14.0
IMPROVEMENTS:
- Upgrades to the AWS cli, and prometheus common libraries
- Introduce queue-status, a tool for use with Job dispatching deployments using AutoScaling
- Ubuntu 18.04 migrated to Ubuntu 20.04
- TensorFlow 1.x support removed, versions now supported are 2.3-2.5
- Python support bumped to include 3.9, 3.8.10 is the default
- gRPC and protobuf upgrades
- Go 1.16.4 support
- CUDA 11.2 Migration
FIXES:
- GPU Memory usage could result in 2 cards being allocated 1 for memory 1 for compute incorrectly
It is worth reminding that the Go module feature now being used provides module authentication using checksums against a database of modules hosted by google. Please review the following privacy notice in regards to this feature, https://proxy.golang.org/privacy. A vendor directory is provided as a means of avoiding Go module proxies performing integrity checking if you wish to run in a air-gaped configuration.
0.13.2
IMPROVEMENTS:
- Storage limitations now used when downloading artifacts, based on the requested disk space from the StudioML client
- Idle Time limits added, new options -limit-idle-duration duration, -limit-interval duration with string values such as 10m for 10 minutes
- Jobs completed limit option added, -limit-tasks
- Document auto scaling, down to 0, in docs/aws_k8s.md, for the EKS use case.
- Go 1.16.3 support
- A100 support in non mig mode only for AWS, mixed, and single mig mode for on-premises Kubernetes
- RabbitMQ Rabbit Hole and many other dependency upgrades
FIXES:
- Security changes made for file escape when unpacking artifact archives
- When using multiple GPUs the CUDA_VISIBLE_DEVICES was getting overwritten by the addition of new GPU devices
KNOWN BUGS:
- AWS A100 (p4d.24xlarge) mixed, and single mig support is waiting on AWS fixes
0.13.1
IMPROVEMENTS:
- Go 1.16 support
- Docker file for the stack introduced to improve build times
- AWS MMQ support for RabbitMQ, specific instructions can be found at docs/aws_k8s.md
FIXES:
- TestTFXCfgGenerator timeout was too small causing the test to be flaky and timeout
- Prevent releases overwritting older versions
- Fix CWE-22 code blocks for symbolic links in tarfiles, https://cwe.mitre.org/data/definitions/22.html
- CVE impacted package upgrades
0.13.0
IMPROVEMENTS:
-
Code base pkg components used by multiple projects refactored into a new repository, github.com/leaf-ai/go-service
-
Go 1.15.8 support with modules
-
Remove deprecated Google Cloud storage proprietary API and use S3 mode to interact with the Google Cloud Storage offering
-
S3 Credential migration to being per artifact, also environment variables are no longer used, except when the --allow-env-secrets is specified