Skip to content

Arcane is a CLI tool for distributed machine learning, enabling efficient management and monitoring of training tasks across multiple machines or GPUs. It offers easy configuration, resource monitoring, and scalability for handling large datasets and complex models.

License

Notifications You must be signed in to change notification settings

devadigapratham/arcane

Repository files navigation

Arcane

                                               
_____ _______   ____ _____    ____   ____      
\__  \\_  __ \_/ ___\\__  \  /    \_/ __ \     
 / __ \|  | \/\  \___ / __ \|   |  \  ___/     
(____  /__|    \___  >____  /___|  /\___  > /\ 
     \/            \/     \/     \/     \/  \/ 

Arcane is a powerful CLI tool designed to simplify and streamline the process of distributed training for machine learning models. It allows you to efficiently manage and monitor training tasks across multiple machines or GPUs.

Features

  • Distributed Training: Seamlessly distribute training tasks across multiple nodes.
  • Easy Configuration: Simple setup with configuration files and environment variables.
  • Resource Monitoring: Track resource usage and performance metrics.
  • Scalability: Easily scale your training tasks to handle large datasets and complex models.

Installation

To install Arcane, clone the repository and use the following command:

git clone https://github.com/yourusername/arcane.git
cd arcane
pip install .

Usage

Arcane provides a command-line interface for managing distributed training tasks. Here are some common commands:

  • Start Training: Begin a distributed training session.

    arcane train --config path/to/config.yaml
  • Monitor Progress: Check the status of your training tasks.

    arcane status
  • Stop Training: Terminate a running training session.

    arcane stop

Configuration

Arcane uses a YAML configuration file to specify training parameters and machine roles. Here is an example configuration:

master:
  host: master-node
  port: 12345

workers:
  - host: worker-node-1
    port: 12346
  - host: worker-node-2
    port: 12347

training:
  model: resnet50
  dataset: /path/to/dataset
  epochs: 10
  batch_size: 32

Testing

To test Arcane across multiple machines:

  1. Ensure all machines are on the same network and have SSH access.
  2. Set up the environment and dependencies on each machine.
  3. Use the sample configuration file to start a distributed training task.
  4. Monitor logs and resource usage to ensure everything is functioning correctly.

Contributing

We welcome contributions! Please read our contributing guidelines for more details.

PS: This is still under construction

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Arcane is a CLI tool for distributed machine learning, enabling efficient management and monitoring of training tasks across multiple machines or GPUs. It offers easy configuration, resource monitoring, and scalability for handling large datasets and complex models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages