GraphStorm is a high-performance distributed graph analytics framework designed to run on HPC clusters. This project demonstrates the power of distributed computing by implementing and optimizing several graph algorithms, such as PageRank and community detection, across multiple nodes in a cluster. It showcases how graph processing can be scaled efficiently from a single machine to a full-blown HPC cluster using free and open-source resources.
- Features
- Installation
- Usage
- Algorithms Implemented
- Benchmarking
- File Structure
- Documentation
- Contributing
- License
- Contact
- Distributed Processing: Utilize tools like Dask and Ray to distribute graph algorithms across multiple nodes.
- Scalable: Efficiently process large-scale graph datasets on HPC clusters.
- Real-World Applications: Implement graph algorithms used in social network analysis, recommendation systems, fraud detection, and more.
- Optimized Partitioning: Advanced graph partitioning and load balancing techniques to maximize performance.
- Free & Open-Source: Built entirely using free resources and open-source software.
-
Clone the repository:
git clone https://github.com/gitchrisqueen/graphstorm.git cd graphstorm
-
Create a virtual environment and install dependencies:
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate` pip install -r requirements.txt
-
Download sample graph data:
Sample data can be found in the
graphstorm/data/sample_graph_data
directory. You can add more datasets as needed. See the data/README.md for more information.
To run graph algorithms on a single machine:
-
Navigate to the
graphstorm/examples/
directory. -
Execute one of the example scripts:
python pagerank_example.py
This will run the PageRank algorithm on a sample graph and output the results.
To run the project on a distributed cluster:
-
Ensure that Dask or Ray is installed on all cluster nodes.
-
Modify the
graphstorm/src/distributed_processing/
scripts to match your cluster's configuration (IP addresses, ports, etc.). -
Start the Dask or Ray cluster:
python dask_cluster_setup.py # For Dask python ray_cluster_setup.py # For Ray
-
Run the desired algorithm:
python pagerank_example.py
The script will automatically distribute the workload across the available nodes.
- PageRank: Calculates the importance of nodes within a graph.
- Community Detection: Identifies clusters of nodes that are more densely connected to each other than to other nodes.
- Graph Partitioning: Divides the graph into partitions for efficient distributed processing.
Performance results and benchmarks comparing single-node versus distributed processing can be found in the graphstorm/benchmarks/
directory. This section includes:
- Performance Results: Detailed performance metrics for each algorithm.
- Scalability Tests: Analysis of how well the algorithms scale across multiple nodes.
- Local vs. Distributed: Comparisons between local execution and distributed execution.
graphstorm/
├── README.md
├── LICENSE
├── requirements.txt
├── .gitignore
├── data/
│ ├── sample_graph_data/
│ └── README.md
├── src/
│ ├── graph_partitioning/
│ │ ├── partitioner.py
│ │ └── load_balancer.py
│ ├── algorithms/
│ │ ├── pagerank.py
│ │ └── community_detection.py
│ ├── distributed_processing/
│ │ ├── dask_cluster_setup.py
│ │ └── ray_cluster_setup.py
│ └── utils/
│ ├── graph_loader.py
│ └── graph_saver.py
├── benchmarks/
│ ├── performance_results.md
│ ├── local_vs_distributed.md
│ └── scalability_tests.md
├── docs/
│ ├── design_decisions.md
│ ├── optimization_strategies.md
│ ├── cluster_setup_guide.md
│ └── api_documentation.md
└── examples/
├── pagerank_example.py
├── community_detection_example.py
└── graph_partitioning_example.py
For detailed design decisions, optimization strategies, and performance results, please refer to the graphstorm/docs/
directory. This directory includes:
- Design Decisions: Rationale behind architectural choices.
- Optimization Strategies: Techniques used to improve performance.
- Cluster Setup Guide: Instructions on configuring your cluster.
- API Documentation: Comprehensive API reference for all modules.
Contributions are welcome! Please fork the repository and submit a pull request with your changes. For major changes, please open an issue first to discuss what you would like to change.
This project is licensed under the MIT License - see the graphstorm/LICENSE
file for details.
For any inquiries, please contact Christopher Queen.