Skip to content

Commit

Permalink
Feature/142 144 145 multi gpu data center (#160)
Browse files Browse the repository at this point in the history
* Begin work on and assigned release

* Dry run k8s based release

* Fix YAML comments, YAML, RPG for a new generation

* k8s build and release test

* k8s build and release test

* k8s build and release test

* k8s build and release test

* comment ammended

* Fixed for cards not running ECC memory

* Print warning on startup related to GPU features that are ignored or outright fail so they can be seen within cluster logs etc

* Shadowing issues

* Test case for CUDA needs fixing

* Removing some debug statements from the docker file

* Moving to 2 gpu tests with verification using k8s annotations

* Moving to 2 gpu tests with verification using k8s annotations

* Moving to 2 gpu tests with verification using k8s annotations

* Moving to 2 gpu tests with verification using k8s annotations

* Moving to 2 gpu tests with verification using k8s annotations

* Moving to 2 gpu tests with verification using k8s annotations

* Added minio archive uploads for the test experiment

* Added RMQ support for sending messages which is specific to test cases for running live experiments in test

* Bump pre-release version

* Bump pre-release version

* Bump pre-release version

* Relocate code the cmd test relies on being in the non test code from the internal packages

* rmq reference needs env variables from the kubernetes pod

* Cleanup minio bucket empty procedure

* Bump pre-release version

* Use the main minio singleton server for tests to erase buckets

* Tested failure conditions in builds stop the image push

* Missed error condition

* testing RMQ send

* Publish is blocking try using confirmation to exit the publish step

* Publish is blocking try using confirmation to exit the publish step

* Publish is blocking try using confirmation to exit the publish step

* Enabled rabbit MQ downloading for the CLI tooling

* Correct build issues

* Correct build issues

* Correct build issues

* Intentional syntax error

* Add binding to try and get the msg delivery done

* Add debug to the test build flags

* Put in the base tensorflow and other python packages, no specific attempt to do CUDA libraries for now

* Expand installed software to full CUDA support

* Moving to tests that require the entire CUDA runtime

* Add new msgs to logging so we can iron out counter implementations and test

* Moving to tests that require the entire CUDA runtime

* Moving to tests that require the entire CUDA runtime

* Fix debugging messages

* Try to fix label cardinality

* Missed the experiment field

* Reduced parameters passed into the work callback function, still needs to be converted to method to tighten things up properly

* Run small prometheus code blocks to experiment with their data structures

* Run small prometheus code blocks to experiment with their data structures

* Dont timeout explicitly inside the test, allow the golang test framework to deal with that

* Debugging within metric extraction on prometheus

* Debugging within metric extraction on prometheus

* Debugging within metric extraction on prometheus

* Debugging within metric extraction on prometheus

* Debugging within metric extraction on prometheus

* Fix metric naming issues

* Open vs Create fix for a new file

* Locate the output file from tar unpacking

* Locate the output file from tar unpacking

* Locate the output file from tar unpacking

* Locate the output file from tar unpacking

* Fix issue with an incorrect file name

* Added regex based extraction and testing of validation and training loss and accuracy values

* Better logging when test lost and accuracy extraction gives answers that are out of range

* Better logging when test lost and accuracy extraction gives answers that are out of range

* Better logging when test lost and accuracy extraction gives answers that are out of range

* Better logging when test lost and accuracy extraction gives answers that are out of range

* Fix index issue

* Fix index issue

* Upgrade to Go 1.11.1

* Adding unit tests for the reloation to a directory function

* Test refactoring in pre for a wider variety of TF and pytorch experiments

* Test refactoring in pre for a wider variety of TF and pytorch experiments

* add mgpu skeleton for pytorch, some refactoring

* add mgpu skeleton for pytorch, some refactoring

* fix experiment id in test data, groom some logging

* fix experiment id in test data, groom some logging

* Fixed issue with the wrong py entry point file name in the template

* Improve logging

* Add validation skeleton

* Add validation skeleton

* Make matched metrics log meaningful

* Use 4 slots to get 2 GPU cards, logging

* Debugging GPU discovery

* Debugging GPU discovery

* Migrate resource free reporting to the prometheus endpoint

* Add resource monitoring to prometheus

* Add resource monitoring to prometheus

* Add resource monitoring to prometheus

* Neglected to register resource free gauges

* Improve the cuda reporting and make better use of channels for status messages coming from other packages

* Mid point for GPU allocation unit tests

* generate enabled for the project with the new duat

* duat 0.9.1 upgrade

* Unittests for allocation now working

* template multi gpu driven test attempt

* template multi gpu driven test attempt

* template multi gpu driven test attempt

* fix json data type issue

* Supply some default unit values that make sence for multi gpu

* Add negative case

* Lint issues work to advance the report card scores

* Get rid of debugging prints

* Added slot documentation

* Bump the release version

* Change log added prior to release tagging
  • Loading branch information
karlmutch authored Oct 19, 2018
1 parent 2071b61 commit c8ff770
Showing 0 changed files with 0 additions and 0 deletions.

0 comments on commit c8ff770

Please sign in to comment.