Skip to content

Latest commit

 

History

History
92 lines (63 loc) · 2.44 KB

README.md

File metadata and controls

92 lines (63 loc) · 2.44 KB

Arcane

                                               
_____ _______   ____ _____    ____   ____      
\__  \\_  __ \_/ ___\\__  \  /    \_/ __ \     
 / __ \|  | \/\  \___ / __ \|   |  \  ___/     
(____  /__|    \___  >____  /___|  /\___  > /\ 
     \/            \/     \/     \/     \/  \/ 

Arcane is a powerful CLI tool designed to simplify and streamline the process of distributed training for machine learning models. It allows you to efficiently manage and monitor training tasks across multiple machines or GPUs.

Features

  • Distributed Training: Seamlessly distribute training tasks across multiple nodes.
  • Easy Configuration: Simple setup with configuration files and environment variables.
  • Resource Monitoring: Track resource usage and performance metrics.
  • Scalability: Easily scale your training tasks to handle large datasets and complex models.

Installation

To install Arcane, clone the repository and use the following command:

git clone https://github.com/yourusername/arcane.git
cd arcane
pip install .

Usage

Arcane provides a command-line interface for managing distributed training tasks. Here are some common commands:

  • Start Training: Begin a distributed training session.

    arcane train --config path/to/config.yaml
  • Monitor Progress: Check the status of your training tasks.

    arcane status
  • Stop Training: Terminate a running training session.

    arcane stop

Configuration

Arcane uses a YAML configuration file to specify training parameters and machine roles. Here is an example configuration:

master:
  host: master-node
  port: 12345

workers:
  - host: worker-node-1
    port: 12346
  - host: worker-node-2
    port: 12347

training:
  model: resnet50
  dataset: /path/to/dataset
  epochs: 10
  batch_size: 32

Testing

To test Arcane across multiple machines:

  1. Ensure all machines are on the same network and have SSH access.
  2. Set up the environment and dependencies on each machine.
  3. Use the sample configuration file to start a distributed training task.
  4. Monitor logs and resource usage to ensure everything is functioning correctly.

Contributing

We welcome contributions! Please read our contributing guidelines for more details.

PS: This is still under construction

License

This project is licensed under the MIT License - see the LICENSE file for details.