Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Feature/142 144 145 multi gpu data center (#160)
* Begin work on and assigned release * Dry run k8s based release * Fix YAML comments, YAML, RPG for a new generation * k8s build and release test * k8s build and release test * k8s build and release test * k8s build and release test * comment ammended * Fixed for cards not running ECC memory * Print warning on startup related to GPU features that are ignored or outright fail so they can be seen within cluster logs etc * Shadowing issues * Test case for CUDA needs fixing * Removing some debug statements from the docker file * Moving to 2 gpu tests with verification using k8s annotations * Moving to 2 gpu tests with verification using k8s annotations * Moving to 2 gpu tests with verification using k8s annotations * Moving to 2 gpu tests with verification using k8s annotations * Moving to 2 gpu tests with verification using k8s annotations * Moving to 2 gpu tests with verification using k8s annotations * Added minio archive uploads for the test experiment * Added RMQ support for sending messages which is specific to test cases for running live experiments in test * Bump pre-release version * Bump pre-release version * Bump pre-release version * Relocate code the cmd test relies on being in the non test code from the internal packages * rmq reference needs env variables from the kubernetes pod * Cleanup minio bucket empty procedure * Bump pre-release version * Use the main minio singleton server for tests to erase buckets * Tested failure conditions in builds stop the image push * Missed error condition * testing RMQ send * Publish is blocking try using confirmation to exit the publish step * Publish is blocking try using confirmation to exit the publish step * Publish is blocking try using confirmation to exit the publish step * Enabled rabbit MQ downloading for the CLI tooling * Correct build issues * Correct build issues * Correct build issues * Intentional syntax error * Add binding to try and get the msg delivery done * Add debug to the test build flags * Put in the base tensorflow and other python packages, no specific attempt to do CUDA libraries for now * Expand installed software to full CUDA support * Moving to tests that require the entire CUDA runtime * Add new msgs to logging so we can iron out counter implementations and test * Moving to tests that require the entire CUDA runtime * Moving to tests that require the entire CUDA runtime * Fix debugging messages * Try to fix label cardinality * Missed the experiment field * Reduced parameters passed into the work callback function, still needs to be converted to method to tighten things up properly * Run small prometheus code blocks to experiment with their data structures * Run small prometheus code blocks to experiment with their data structures * Dont timeout explicitly inside the test, allow the golang test framework to deal with that * Debugging within metric extraction on prometheus * Debugging within metric extraction on prometheus * Debugging within metric extraction on prometheus * Debugging within metric extraction on prometheus * Debugging within metric extraction on prometheus * Fix metric naming issues * Open vs Create fix for a new file * Locate the output file from tar unpacking * Locate the output file from tar unpacking * Locate the output file from tar unpacking * Locate the output file from tar unpacking * Fix issue with an incorrect file name * Added regex based extraction and testing of validation and training loss and accuracy values * Better logging when test lost and accuracy extraction gives answers that are out of range * Better logging when test lost and accuracy extraction gives answers that are out of range * Better logging when test lost and accuracy extraction gives answers that are out of range * Better logging when test lost and accuracy extraction gives answers that are out of range * Fix index issue * Fix index issue * Upgrade to Go 1.11.1 * Adding unit tests for the reloation to a directory function * Test refactoring in pre for a wider variety of TF and pytorch experiments * Test refactoring in pre for a wider variety of TF and pytorch experiments * add mgpu skeleton for pytorch, some refactoring * add mgpu skeleton for pytorch, some refactoring * fix experiment id in test data, groom some logging * fix experiment id in test data, groom some logging * Fixed issue with the wrong py entry point file name in the template * Improve logging * Add validation skeleton * Add validation skeleton * Make matched metrics log meaningful * Use 4 slots to get 2 GPU cards, logging * Debugging GPU discovery * Debugging GPU discovery * Migrate resource free reporting to the prometheus endpoint * Add resource monitoring to prometheus * Add resource monitoring to prometheus * Add resource monitoring to prometheus * Neglected to register resource free gauges * Improve the cuda reporting and make better use of channels for status messages coming from other packages * Mid point for GPU allocation unit tests * generate enabled for the project with the new duat * duat 0.9.1 upgrade * Unittests for allocation now working * template multi gpu driven test attempt * template multi gpu driven test attempt * template multi gpu driven test attempt * fix json data type issue * Supply some default unit values that make sence for multi gpu * Add negative case * Lint issues work to advance the report card scores * Get rid of debugging prints * Added slot documentation * Bump the release version * Change log added prior to release tagging
- Loading branch information