Merge branch 'sc2021' of https://github.com/exanauts/ExaTron.jl into …

…sc2021
exanauts · Jun 5, 2021 · b5742f4 · b5742f4
2 parents 1b80bd1 + fbb4d7c
commit b5742f4
Show file tree

Hide file tree

Showing 16 changed files with 1,357 additions and 110 deletions.
diff --git a/Manifest.toml b/Manifest.toml
diff --git a/Project.toml b/Project.toml
@@ -10,4 +10,5 @@ DelimitedFiles = "8bb1440f-4735-579b-a4ab-409b98df4dab"
 Libdl = "8f399da3-3557-5675-b5ff-fb832c97cbdb"
 LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
 MPI = "da04e1cc-30fd-572f-bb4f-1f8673147195"
+Plots = "91a5bcdd-55d7-5caf-9e0b-520d859cae80"
 Printf = "de0858da-6303-5e67-8744-51eddeeeb8d7"
diff --git a/README.md b/README.md
@@ -1,9 +1,9 @@
 # ExaTron.jl
 
-This is a TRON solver implementation in Julia.
-The intention is to make it work on GPUs as well.
-Currently, we translated the Fortran implementation of [TRON](https://www.mcs.anl.gov/~more/tron)
-into Julia.
+ExaTron.jl implements a trust-region Newton algorithm for bound constrained batch nonlinear 
+programming on GPUs.
+Its algorithm is based on [Lin and More](https://epubs.siam.org/doi/10.1137/S1052623498345075)
+and [TRON](https://www.mcs.anl.gov/~more/tron).
 
 ## Installation
 
@@ -12,34 +12,144 @@ This package can be installed by cloning this repository:
 ] add https://github.com/exanauts/ExaTron.jl
 ```
 
-## Performance ExaTron with ADMM on GPUs
-
-With `@inbounds` attached to every array access and the use of instruction
-parallelism instead of `for` loop, timings have reduced significantly.
-The most recent results are as follows:
-
-| Data | # active branches | Objective | Primal feasibility | Dual feasibility | Time (secs) | rho_pq | rho_va |
-| ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
-| case1354pegase  |  1991 | 7.400441e+04 | 1.186926e-05 | 9.799325e-03 |  24.78 |  10.0 |  1000.0 |
-| case2869pegase  |  4582 | 1.338728e+05 | 1.831719e-04 | 3.570605e-02 |  42.86 |  10.0 |  1000.0 |
-| case9241pegase  | 16049 | 3.139228e+05 | 2.526600e-03 | 8.328549e+00 |  98.88 |  50.0 |  5000.0 |
-| case13659pegase | 20467 | 3.841941e+05 | 5.315441e-03 | 9.915973e+00 | 116.84 |  50.0 |  5000.0 |
-| case19402_goc   | 34704 | 1.950577e+06 | 3.210911e-03 | 4.706196e+00 | 239.45 | 500.0 |  5000.0 |
-
-For better accuracy, angle variables with constraints `\theta_i - \theta_j = \atan2(wI_{ij}, wR_{ij})`
-were added.
-This enables us to achieve a more accurate solution, since when there is a cycle in the network
-the constraint forces that its sum of angles in the cycle is zero.
-With new variables and constraints, experimental results are below.
-We note that objective values have increased in most cases, which became closer to the values obtained
-from Ipopt.
-
-| Data | # active branches | Objective | Primal feasibility | Dual feasibility | Time (secs) | rho_pq | rho_va | # Iterations |
-| ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
-| case1354pegase  |  1,991 | 7.406087e+04 | 3.188928e-05 | 1.200796e-02 |  20.28 | 10.0 | 1000.0 | 5,000 |
-| case2869pegase  |  4,582 | 1.339846e+05 | 2.123712e-04 | 2.228853e-01 |  35.74 | 10.0 | 1000.0 | 5,000 |
-| case9241pegase  | 16,049 | 3.158906e+05 | 6.464865e-03 | 5.607324e+00 | 139.41 | 50.0 | 5000.0 | 6,000 |
-| case13659pegase | 20,467 | 3.861735e+05 | 5.794895e-03 | 8.512909e+00 | 187.97 | 50.0 | 5000.0 | 7,000 |
+## How to run
+
+We note that the following is for illustration purposes only.
+If you want to run it on a HPC cluster, you may want to follow instructions specific to the HPC software.
+
+### Using a single GPU
+
+```bash
+$ julia --project ./src/admm_standalone.jl ./data/casename pq_val va_val iterlim true
+```
+where `casename` is the filename of a power network, `pq_val` is an initial penalty value
+for power values, `va_val` an initial penalty value for voltage values, `iterlim` the
+maximum iteration limit, and `true|false` specifies whether to use GPU or CPU.
+Power network files are provided in the `data` directory.
+
+The following table shows what values need to be specified for parameters:
+
+| casename | pq_val | va_val | iterlim |
+| -------: | -----: | -----: | ------: |
+| case2868rte | 10.0 | 1000.0 | 6,000 |
+| case6515rte | 20.0 | 2000.0 | 15,000 |
+| case9241pegase | 50.0 | 5000.0 | 35,000 |
+| case13659pegase | 50.0 | 5000.0 | 45,000 |
+| case19402_goc | 500.0 | 50000.0 | 30,000 |
+
+For example, if you want to solve `case19402_goc` using a single GPU, you need to run
+```bash
+$ julia --project ./src/admm_standalone.jl ./data/case19402_goc 500 50000 30000 true
+```
+
+### Using multiple GPUs
+
+If you want to use `N` GPUs, we launch `N` MPI processes and execute `launch_mpi.jl`.
+
+```bash
+$ mpirun -np N julia --project ./src/launch_mpi.jl ./data/casename pq_val va_val iterlim true
+```
+
+We assume that all of the MPI processes can see the `N` number of GPUs. Otherwise, it will generate an error.
+The parameter values are the same as the single GPU case, except that we use the following actual
+iteration limit for each case. If you see the logs, the total number of iterations is the same as single GPU case.
+| casename | iterlim |
+| -------: | ------: |
+| case2868rte | 5648 |
+| case6515rte | 13651 |
+| case9241pegase | 30927 |
+| case13659pegase | 41126 |
+| case19402_goc | 28358 |
+
+## Reproducing experiments
+
+We describe how to reproduce experiments in Section 6 of our manuscript.
+For each figure or table, we provide a corresponding script to reproduce results.
+Note that the following table shows correspondence between the casename and the size of batch.
+| casename | batch size |
+| -------: | ---------: |
+| case2868rte | 3.8K |
+| case6515rte | 9K |
+| case9241pegase | 16K |
+| case13659pegase | 20K |
+| case19402_goc | 34K |
+
+### Figure 5
+
+```bash
+$ ./figure5.sh
+```
+
+It will generate `output_gpu1_casename.txt` file for each `casename`. Near the end of the file, you will see
+the timing results: `Branch/iter = %.2f (millisecs)` is the relevant result.
+For example, in order to obtain timing results for `case19402_goc`, we read the following line around the end of 
+the file
+```bash
+Branch/iter = 3.94 (millisecs)
+```
+Here `3.94` miiliseconds will be the input for the `34K` batch size in Figure 5.
+
+### Figure 6
+
+```bash
+$ ./figure6.sh
+```
+It will generate `output_gpu${j}_casename.txt` file for each `casename` where `j` represents the number of GPUs
+used. Near the end of the file, you will see the timing results: `[0] (Br+MPI)/iter = %.2f (millisecs)` is the relevant result,
+where `[0]` represents the rank (the root in this case) of a process.
+For example, in order to obtain timing results for `case19402_goc` with 6 GPUs, we read the following line around the end of the file
+`output_gpu6_case19402_goc.txt`
+```bash
+[0] (Br+MPI)/iter = 0.79 (millisecs)
+```
+The speedup is `3.94/0.79 = 4.98` in this case. In this way, you can reproduce Figure 6.
+
+### Table 5
+
+```bash
+$ ./table5.sh branch_time_file
+```
+where `branch_time_file` corresponds to the file containing the computation time of branch of each GPU.
+It is generated from `figure6.sh`. For example, `figure6.sh` will generate the following files:
+```bash
+br_time_gpu6_case2868rte.txt
+br_time_gpu6_case6515rte.txt
+br_time_gpu6_case9241pegase.txt
+br_time_gpu6_case13659pegase.txt
+br_time_gpu6_case19402_goc.txt
+```
+The following command will give you the load imbalance statistics for `case13659pegase`:
+```bash
+$ ./table5.sh br_time_gpu6_case13659pegase.txt
+```
+Similarly, you can reproduce load imbalance statistics for other case files.
+
+### Figure 7
+
+```bash
+$ ./figure7.sh branch_time_file
+```
+
+The usage is the same as `table5.sh`. We use the same branch computation file.
+To reproduce Figure 7, you need to use the file for `case13659pegase`:
+```bash
+$ ./figure7.sh br_time_gpu6_case13659pegase.txt
+```
+It will generate `br_time_gpu6_case13659pegase.pdf`. The file should be similar to Figure 7.
+
+### Figure 8
+
+```bash
+$ ./figure8.sh
+```
+
+This script will run ExaTron.jl using 40 CPU cores. It will generate output files named `output_cpu40_casename.txt`.
+Each file contains timing results for each case. For example, if you want to read the timing results for `case19402_goc`,
+we read the following line around the end of the file.
+```bash
+[0] (Br+MPI)/iter = 30.03 (milliseconds)
+```
+Here `30.03` will be the input for `case19402_goc` for CPUs in Figure 8. For 6 GPUs, we use the results from `figure6.sh`.
 
 ## Citing this package
 

diff --git a/figure5.sh b/figure5.sh
@@ -0,0 +1,32 @@
+#!/bin/bash
+#
+# This script describes how to reproduce the results of Figure 5.
+# This is just an example for iillustration purposes. Different platforms
+# (such as Summit cluster) may require different setups.
+#
+# For each run of admm_standalone.jl, it will generate iteration logs
+# and timing results. The relevant timing results for Figure 5 are printed
+# at the end of its run and will be the following:
+#
+#   Branch/iter = %.2f (millisecs)
+#
+# The above timing results were used for Figure 5.
+#
+# Prerequisite:
+#  - CUDA library files should be accessible before executing this script,
+#    e.g., module load cuda/10.2.89.
+#  - CUDA aware MPI should be available.
+
+export JULIA_CUDA_VERBOSE=1
+export JULIA_MPI_BINARY="system"
+
+DATA=("case2868rte" "case6515rte" "case9241pegase" "case13659pegase" "case19402_goc")
+PQ=(10 20 50 50 500)
+VA=(1000 2000 5000 5000 50000)
+ITER=(6000 15000 35000 45000 30000)
+
+for i in ${!DATA[@]}; do
+    echo "Solving ${DATA[$i]} . . ."
+    julia --project ./src/admm_standalone.jl "./data/${DATA[$i]}" ${PQ[$i]} ${VA[$i]} ${ITER[$i]} true > output_gpu1_${DATA[$i]}.txt 2>&1
+done
+
diff --git a/figure6.sh b/figure6.sh
@@ -0,0 +1,36 @@
+#!/bin/bash
+#
+# This script describes how to reproduce the results of Figure 6.
+# This is just an example for iillustration purposes. Different platforms
+# (such as Summit cluster) may require different setups.
+#
+# For each run of launch_mpi.jl, it will generate iteration logs
+# and timing results. The relevant timing results for Figure 6 are printed
+# at the end of its run and will be the following:
+#
+#   (Br+MPI)/iter = %.2f (millisecs)
+#
+# We divide the above timing results by the timing results obtained when
+# we use a single GPU.
+#
+# Prerequisite:
+#  - CUDA library files should be accessible before executing this script,
+#    e.g., module load cuda/10.2.89.
+#  - CUDA aware MPI should be available.
+
+export JULIA_CUDA_VERBOSE=1
+export JULIA_MPI_BINARY="system"
+
+DATA=("case2868rte" "case6515rte" "case9241pegase" "case13659pegase" "case19402_goc")
+PQ=(10 20 50 50 500)
+VA=(1000 2000 5000 5000 50000)
+ITER=(5648 13651 30927 41126 28358)
+NGPU=(2 3 4 5 6)
+
+for j in ${!NGPU[@]}; do
+    for i in ${!DATA[@]}; do
+        echo "Solving ${DATA[$i]} using ${NGPU[$j]} GPUs . . ."
+        mpirun -np ${NGPU[$j]} julia --project ./src/launch_mpi.jl "./data/${DATA[$i]}" ${PQ[$i]} ${VA[$i]} ${ITER[$i]} true > output_gpu${NGPU[$j]}_${DATA[$i]}.txt 2>&1
+        mv br_time_gpu.txt br_time_gpu${NGPU[$j]}_${DATA[$i]}.txt
+	done
+done
diff --git a/figure7.sh b/figure7.sh
@@ -0,0 +1,21 @@
+#!/bin/bash
+#
+# This script describes how to reproduce the results of Figure 7.
+# This is just an example for iillustration purposes. Different platforms
+# (such as Summit cluster) may require different setups.
+#
+# We need br_time_13659pegase.txt file which is obtained when we run 
+# with 6 GPUs over 13659pegase example. The file can be obtained by 
+# running figure6.sh.
+
+function usage() {
+    echo "Usage: ./figure7.sh case"
+    echo "  case: the case file containing branch computation time of each GPU"
+}
+
+if [[ $# != 1 ]]; then
+    usage
+    exit
+fi
+
+julia --project ./src/heatmap.jl $1
diff --git a/figure8.sh b/figure8.sh
@@ -0,0 +1,32 @@
+#!/bin/bash
+#
+# This script describes how to reproduce the results of Figure 8.
+# This is just an example for iillustration purposes. Different platforms
+# (such as Summit cluster) may require different setups.
+#
+# For each run of launch_mpi.jl, it will generate iteration logs
+# and timing results. The relevant timing results for Figure 8 are printed
+# at the end of its run and will be the following:
+#
+#   (Br+MPI)/iter = %.2f (millisecs)
+#
+# We use these numbers for the timings of 40 CPU cores and use the timings
+# from Figure 6 for 6 GPUs.
+#
+# Prerequisite:
+#  - CUDA library files should be accessible before executing this script,
+#    e.g., module load cuda/10.2.89.
+#  - CUDA aware MPI should be available.
+
+export JULIA_CUDA_VERBOSE=1
+export JULIA_MPI_BINARY="system"
+
+DATA=("case2868rte" "case6515rte" "case9241pegase" "case13659pegase" "case19402_goc")
+PQ=(10 20 50 50 500)
+VA=(1000 2000 5000 5000 50000)
+ITER=(5718 13640 30932 41140 28358)
+
+for i in ${!DATA[@]}; do
+    echo "Solving ${DATA[$i]} using 40 CPU cores . . ."
+    mpirun -np 40 julia --project ./src/launch_mpi.jl "./data/${DATA[$i]}" ${PQ[$i]} ${VA[$i]} ${ITER[$i]} false > output_cpu40_${DATA[$i]}.txt 2>&1
+done
diff --git a/launch_mpi.jl b/launch_mpi.jl