Skip to content

Commit

Permalink
Merge branch 'sc2021' of https://github.com/exanauts/ExaTron.jl into …
Browse files Browse the repository at this point in the history
…sc2021
  • Loading branch information
kibaekkim committed Jun 5, 2021
2 parents 1b80bd1 + fbb4d7c commit b5742f4
Show file tree
Hide file tree
Showing 16 changed files with 1,357 additions and 110 deletions.
625 changes: 592 additions & 33 deletions Manifest.toml

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,5 @@ DelimitedFiles = "8bb1440f-4735-579b-a4ab-409b98df4dab"
Libdl = "8f399da3-3557-5675-b5ff-fb832c97cbdb"
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
MPI = "da04e1cc-30fd-572f-bb4f-1f8673147195"
Plots = "91a5bcdd-55d7-5caf-9e0b-520d859cae80"
Printf = "de0858da-6303-5e67-8744-51eddeeeb8d7"
174 changes: 142 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# ExaTron.jl

This is a TRON solver implementation in Julia.
The intention is to make it work on GPUs as well.
Currently, we translated the Fortran implementation of [TRON](https://www.mcs.anl.gov/~more/tron)
into Julia.
ExaTron.jl implements a trust-region Newton algorithm for bound constrained batch nonlinear
programming on GPUs.
Its algorithm is based on [Lin and More](https://epubs.siam.org/doi/10.1137/S1052623498345075)
and [TRON](https://www.mcs.anl.gov/~more/tron).

## Installation

Expand All @@ -12,34 +12,144 @@ This package can be installed by cloning this repository:
] add https://github.com/exanauts/ExaTron.jl
```

## Performance ExaTron with ADMM on GPUs

With `@inbounds` attached to every array access and the use of instruction
parallelism instead of `for` loop, timings have reduced significantly.
The most recent results are as follows:

| Data | # active branches | Objective | Primal feasibility | Dual feasibility | Time (secs) | rho_pq | rho_va |
| ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| case1354pegase | 1991 | 7.400441e+04 | 1.186926e-05 | 9.799325e-03 | 24.78 | 10.0 | 1000.0 |
| case2869pegase | 4582 | 1.338728e+05 | 1.831719e-04 | 3.570605e-02 | 42.86 | 10.0 | 1000.0 |
| case9241pegase | 16049 | 3.139228e+05 | 2.526600e-03 | 8.328549e+00 | 98.88 | 50.0 | 5000.0 |
| case13659pegase | 20467 | 3.841941e+05 | 5.315441e-03 | 9.915973e+00 | 116.84 | 50.0 | 5000.0 |
| case19402_goc | 34704 | 1.950577e+06 | 3.210911e-03 | 4.706196e+00 | 239.45 | 500.0 | 5000.0 |

For better accuracy, angle variables with constraints `\theta_i - \theta_j = \atan2(wI_{ij}, wR_{ij})`
were added.
This enables us to achieve a more accurate solution, since when there is a cycle in the network
the constraint forces that its sum of angles in the cycle is zero.
With new variables and constraints, experimental results are below.
We note that objective values have increased in most cases, which became closer to the values obtained
from Ipopt.

| Data | # active branches | Objective | Primal feasibility | Dual feasibility | Time (secs) | rho_pq | rho_va | # Iterations |
| ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| case1354pegase | 1,991 | 7.406087e+04 | 3.188928e-05 | 1.200796e-02 | 20.28 | 10.0 | 1000.0 | 5,000 |
| case2869pegase | 4,582 | 1.339846e+05 | 2.123712e-04 | 2.228853e-01 | 35.74 | 10.0 | 1000.0 | 5,000 |
| case9241pegase | 16,049 | 3.158906e+05 | 6.464865e-03 | 5.607324e+00 | 139.41 | 50.0 | 5000.0 | 6,000 |
| case13659pegase | 20,467 | 3.861735e+05 | 5.794895e-03 | 8.512909e+00 | 187.97 | 50.0 | 5000.0 | 7,000 |
## How to run

We note that the following is for illustration purposes only.
If you want to run it on a HPC cluster, you may want to follow instructions specific to the HPC software.

### Using a single GPU

```bash
$ julia --project ./src/admm_standalone.jl ./data/casename pq_val va_val iterlim true
```
where `casename` is the filename of a power network, `pq_val` is an initial penalty value
for power values, `va_val` an initial penalty value for voltage values, `iterlim` the
maximum iteration limit, and `true|false` specifies whether to use GPU or CPU.
Power network files are provided in the `data` directory.

The following table shows what values need to be specified for parameters:

| casename | pq_val | va_val | iterlim |
| -------: | -----: | -----: | ------: |
| case2868rte | 10.0 | 1000.0 | 6,000 |
| case6515rte | 20.0 | 2000.0 | 15,000 |
| case9241pegase | 50.0 | 5000.0 | 35,000 |
| case13659pegase | 50.0 | 5000.0 | 45,000 |
| case19402_goc | 500.0 | 50000.0 | 30,000 |

For example, if you want to solve `case19402_goc` using a single GPU, you need to run
```bash
$ julia --project ./src/admm_standalone.jl ./data/case19402_goc 500 50000 30000 true
```

### Using multiple GPUs

If you want to use `N` GPUs, we launch `N` MPI processes and execute `launch_mpi.jl`.

```bash
$ mpirun -np N julia --project ./src/launch_mpi.jl ./data/casename pq_val va_val iterlim true
```

We assume that all of the MPI processes can see the `N` number of GPUs. Otherwise, it will generate an error.
The parameter values are the same as the single GPU case, except that we use the following actual
iteration limit for each case. If you see the logs, the total number of iterations is the same as single GPU case.
| casename | iterlim |
| -------: | ------: |
| case2868rte | 5648 |
| case6515rte | 13651 |
| case9241pegase | 30927 |
| case13659pegase | 41126 |
| case19402_goc | 28358 |

## Reproducing experiments

We describe how to reproduce experiments in Section 6 of our manuscript.
For each figure or table, we provide a corresponding script to reproduce results.
Note that the following table shows correspondence between the casename and the size of batch.
| casename | batch size |
| -------: | ---------: |
| case2868rte | 3.8K |
| case6515rte | 9K |
| case9241pegase | 16K |
| case13659pegase | 20K |
| case19402_goc | 34K |

### Figure 5

```bash
$ ./figure5.sh
```

It will generate `output_gpu1_casename.txt` file for each `casename`. Near the end of the file, you will see
the timing results: `Branch/iter = %.2f (millisecs)` is the relevant result.
For example, in order to obtain timing results for `case19402_goc`, we read the following line around the end of
the file
```bash
Branch/iter = 3.94 (millisecs)
```
Here `3.94` miiliseconds will be the input for the `34K` batch size in Figure 5.

### Figure 6

```bash
$ ./figure6.sh
```
It will generate `output_gpu${j}_casename.txt` file for each `casename` where `j` represents the number of GPUs
used. Near the end of the file, you will see the timing results: `[0] (Br+MPI)/iter = %.2f (millisecs)` is the relevant result,
where `[0]` represents the rank (the root in this case) of a process.
For example, in order to obtain timing results for `case19402_goc` with 6 GPUs, we read the following line around the end of the file
`output_gpu6_case19402_goc.txt`
```bash
[0] (Br+MPI)/iter = 0.79 (millisecs)
```
The speedup is `3.94/0.79 = 4.98` in this case. In this way, you can reproduce Figure 6.

### Table 5

```bash
$ ./table5.sh branch_time_file
```
where `branch_time_file` corresponds to the file containing the computation time of branch of each GPU.
It is generated from `figure6.sh`. For example, `figure6.sh` will generate the following files:
```bash
br_time_gpu6_case2868rte.txt
br_time_gpu6_case6515rte.txt
br_time_gpu6_case9241pegase.txt
br_time_gpu6_case13659pegase.txt
br_time_gpu6_case19402_goc.txt
```
The following command will give you the load imbalance statistics for `case13659pegase`:
```bash
$ ./table5.sh br_time_gpu6_case13659pegase.txt
```
Similarly, you can reproduce load imbalance statistics for other case files.

### Figure 7

```bash
$ ./figure7.sh branch_time_file
```

The usage is the same as `table5.sh`. We use the same branch computation file.
To reproduce Figure 7, you need to use the file for `case13659pegase`:
```bash
$ ./figure7.sh br_time_gpu6_case13659pegase.txt
```
It will generate `br_time_gpu6_case13659pegase.pdf`. The file should be similar to Figure 7.

### Figure 8

```bash
$ ./figure8.sh
```

This script will run ExaTron.jl using 40 CPU cores. It will generate output files named `output_cpu40_casename.txt`.
Each file contains timing results for each case. For example, if you want to read the timing results for `case19402_goc`,
we read the following line around the end of the file.
```bash
[0] (Br+MPI)/iter = 30.03 (milliseconds)
```
Here `30.03` will be the input for `case19402_goc` for CPUs in Figure 8. For 6 GPUs, we use the results from `figure6.sh`.

## Citing this package

Expand Down
32 changes: 32 additions & 0 deletions figure5.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
#!/bin/bash
#
# This script describes how to reproduce the results of Figure 5.
# This is just an example for iillustration purposes. Different platforms
# (such as Summit cluster) may require different setups.
#
# For each run of admm_standalone.jl, it will generate iteration logs
# and timing results. The relevant timing results for Figure 5 are printed
# at the end of its run and will be the following:
#
# Branch/iter = %.2f (millisecs)
#
# The above timing results were used for Figure 5.
#
# Prerequisite:
# - CUDA library files should be accessible before executing this script,
# e.g., module load cuda/10.2.89.
# - CUDA aware MPI should be available.

export JULIA_CUDA_VERBOSE=1
export JULIA_MPI_BINARY="system"

DATA=("case2868rte" "case6515rte" "case9241pegase" "case13659pegase" "case19402_goc")
PQ=(10 20 50 50 500)
VA=(1000 2000 5000 5000 50000)
ITER=(6000 15000 35000 45000 30000)

for i in ${!DATA[@]}; do
echo "Solving ${DATA[$i]} . . ."
julia --project ./src/admm_standalone.jl "./data/${DATA[$i]}" ${PQ[$i]} ${VA[$i]} ${ITER[$i]} true > output_gpu1_${DATA[$i]}.txt 2>&1
done

36 changes: 36 additions & 0 deletions figure6.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
#!/bin/bash
#
# This script describes how to reproduce the results of Figure 6.
# This is just an example for iillustration purposes. Different platforms
# (such as Summit cluster) may require different setups.
#
# For each run of launch_mpi.jl, it will generate iteration logs
# and timing results. The relevant timing results for Figure 6 are printed
# at the end of its run and will be the following:
#
# (Br+MPI)/iter = %.2f (millisecs)
#
# We divide the above timing results by the timing results obtained when
# we use a single GPU.
#
# Prerequisite:
# - CUDA library files should be accessible before executing this script,
# e.g., module load cuda/10.2.89.
# - CUDA aware MPI should be available.

export JULIA_CUDA_VERBOSE=1
export JULIA_MPI_BINARY="system"

DATA=("case2868rte" "case6515rte" "case9241pegase" "case13659pegase" "case19402_goc")
PQ=(10 20 50 50 500)
VA=(1000 2000 5000 5000 50000)
ITER=(5648 13651 30927 41126 28358)
NGPU=(2 3 4 5 6)

for j in ${!NGPU[@]}; do
for i in ${!DATA[@]}; do
echo "Solving ${DATA[$i]} using ${NGPU[$j]} GPUs . . ."
mpirun -np ${NGPU[$j]} julia --project ./src/launch_mpi.jl "./data/${DATA[$i]}" ${PQ[$i]} ${VA[$i]} ${ITER[$i]} true > output_gpu${NGPU[$j]}_${DATA[$i]}.txt 2>&1
mv br_time_gpu.txt br_time_gpu${NGPU[$j]}_${DATA[$i]}.txt
done
done
21 changes: 21 additions & 0 deletions figure7.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#!/bin/bash
#
# This script describes how to reproduce the results of Figure 7.
# This is just an example for iillustration purposes. Different platforms
# (such as Summit cluster) may require different setups.
#
# We need br_time_13659pegase.txt file which is obtained when we run
# with 6 GPUs over 13659pegase example. The file can be obtained by
# running figure6.sh.

function usage() {
echo "Usage: ./figure7.sh case"
echo " case: the case file containing branch computation time of each GPU"
}

if [[ $# != 1 ]]; then
usage
exit
fi

julia --project ./src/heatmap.jl $1
32 changes: 32 additions & 0 deletions figure8.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
#!/bin/bash
#
# This script describes how to reproduce the results of Figure 8.
# This is just an example for iillustration purposes. Different platforms
# (such as Summit cluster) may require different setups.
#
# For each run of launch_mpi.jl, it will generate iteration logs
# and timing results. The relevant timing results for Figure 8 are printed
# at the end of its run and will be the following:
#
# (Br+MPI)/iter = %.2f (millisecs)
#
# We use these numbers for the timings of 40 CPU cores and use the timings
# from Figure 6 for 6 GPUs.
#
# Prerequisite:
# - CUDA library files should be accessible before executing this script,
# e.g., module load cuda/10.2.89.
# - CUDA aware MPI should be available.

export JULIA_CUDA_VERBOSE=1
export JULIA_MPI_BINARY="system"

DATA=("case2868rte" "case6515rte" "case9241pegase" "case13659pegase" "case19402_goc")
PQ=(10 20 50 50 500)
VA=(1000 2000 5000 5000 50000)
ITER=(5718 13640 30932 41140 28358)

for i in ${!DATA[@]}; do
echo "Solving ${DATA[$i]} using 40 CPU cores . . ."
mpirun -np 40 julia --project ./src/launch_mpi.jl "./data/${DATA[$i]}" ${PQ[$i]} ${VA[$i]} ${ITER[$i]} false > output_cpu40_${DATA[$i]}.txt 2>&1
done
19 changes: 0 additions & 19 deletions launch_mpi.jl

This file was deleted.

Loading

0 comments on commit b5742f4

Please sign in to comment.