Introduction

Ninja (Noise INJection Agent) is a tool for reproducing subtle and unintended message races

Quick start

1. Building NINJA

From Spack

$ git clone https://github.com/LLNL/spack
$ ./spack/bin/spack install pruners-ninja

From git repository

$ git clone [email protected]:PRUNERS/NINJA.git
$ cd NINJA
$ ./autogen.sh
$ configure --prefix=<path to installation directory>
$ make
$ make install

From tarball

$ tar zxvf ./ninja_xxxxx.tar.bz
$ cd <NINJA directory>
$ configure --prefix=<path to installation directory>
$ make
$ make install

2. Run examples

$ cd test

Running without NINJA

This example code (ninja_test_matching_race) is a synthetic benchmark embracing a message-race bug.

$ ./ninja_test_matching_race
Usage: ./matching_race <type: 0=SR, 1=SSR> <matching safe: 0=unsafe 1=safe> <# of loops> <# of patterns per loop> <interval(usec)>

<type: 0=SR, 1=SSR>:
- 0: TEST CASE 0 (Same as Case 1 in Reference)
- 1: TEST CASE 1 (Same as Case 2 in Reference)
<matching safe: 0=unsafe 1=safe>
- 0: Run under tag-unsafe condition (i.e., enable message races)
- 1: Run under tag-safe condition (i.e., disable message races)
<# of loops>: the number of iterations
<# of patterns per loop>: the number of communication routines per iteration
<interval(usec)>: interval time (usec) between communication routines

The manifestation of this bug is non-deterministic. Even when enabling message races (i.e. <matching safe>=0), this message-race bug may not manifest and run all the way to loop 1000. (NOTE: this bug is non-deterministic. Therefore, you will not see the same manifestations as well as outputs)

$ srun -n (OR mpirun -np) 16 ./ninja_test_matching_race 1 0 1000 2 0
***************************
Matching Type     : 1
Is Matching safe ?: 0
# of Loops        : 1000
# of Patterns     : 2
Interval(usec)    : 0
***************************
NIN(test):  0: loop: 1 (ninja_test_matching_race.c:225)
NIN(test):  0: loop: 2 (ninja_test_matching_race.c:225)
NIN(test):  0: loop: 3 (ninja_test_matching_race.c:225)
 ...
NIN(test):  0: loop: 1000 (ninja_test_matching_race.c:225)
NIN(test):  0: Time: 0.054226 (ninja_test_matching_race.c:314)

Running with NINJA (System-centric mode)

If the bug does not manifest, NINJA's system-centric mode NIN_MODE=0 may be able to manifest the bug.

$ LD_PRELOAD=<path to installation directory>/lib/libninja.so NIN_MODE=0 srun -n (OR mpirun -np) 16 ./ninja_test_matching_race 1 0 1000 2 0 
===========================================
 NIN_LOCAL_NOISE: 0
 NIN_PATTERN: 2
 NIN_MODE: 0
 NIN_DIR: ./.ninja
===========================================
***************************
Matching Type     : 1
Is Matching safe ?: 0
# of Loops        : 1000
# of Patterns     : 2
Interval(usec)    : 0
***************************
NIN(test):  0: loop: 1 (ninja_test_matching_race.c:225)
NIN(test):  0: loop: 2 (ninja_test_matching_race.c:225)
NIN(test):  0: loop: 3 (ninja_test_matching_race.c:225)
NIN(test):  0: loop: 4 (ninja_test_matching_race.c:225)
NIN(test):  0: loop: 5 (ninja_test_matching_race.c:225)
NIN(test):  0: loop: 6 (ninja_test_matching_race.c:225)
NIN(test):  0: loop: 7 (ninja_test_matching_race.c:225)
NIN(test):  0: loop: 8 (ninja_test_matching_race.c:225)
NIN(test):  0: loop: 9 (ninja_test_matching_race.c:225)
NIN(test):  0: loop: 10 (ninja_test_matching_race.c:225)
NIN(test):  0: loop: 11 (ninja_test_matching_race.c:225)
NIN(test):  0: loop: 12 (ninja_test_matching_race.c:225)
NIN(test): 10: Wrong matching detected: recv_val: 23, send_val: 22 at loop 11 (ninja_test_matching_race.c:247)
application called MPI_Abort(, 0) - process 10

NINJA's system-centric mode simply emulates noisy enviroments to induce message races. Therefore, if two unsafe communication routines are significantly separated. NINJA's system-centric mode may not be able to manifest message-race bugs. Let's increase the interval time between unsafe communication routines from 0 to 1000 usec.

$ LD_PRELOAD=<path to installation directory>/lib/libninja.so NIN_MODE=0 srun -n (OR mpirun -np) 16 ./ninja_test_matching_race 1 0 1000 2 1000
***************************
Matching Type     : 1
Is Matching safe ?: 0
# of Loops        : 1000
# of Patterns     : 2
Interval(usec)    : 1000
***************************
NIN(test):  0: loop: 1 (ninja_test_matching_race.c:225)
NIN(test):  0: loop: 2 (ninja_test_matching_race.c:225)
NIN(test):  0: loop: 3 (ninja_test_matching_race.c:225)
 ...
NIN(test):  0: loop: 1000 (ninja_test_matching_race.c:225)
NIN(test):  0: Time: 2.433113 (ninja_test_matching_race.c:314)
NIN:0: Learning file written to ./.ninja directory (ninj_fc.cpp:566)

During NINJA's system-centric mode, NINJA profiles intervals of each unsafe communication routine. At the end of the execution (on MPI_Finalize()) under NINJA's system-centric mode, NINJA outputs the profile for NINJA's application-centric mode.

Running with NINJA (Application-centric mode)

NINJA's application-centric mode (NIN_MODE=1) reads this profile and then injects an adequate amount of noise in order to manifest message-race bugs.

$ LD_PRELOAD=<path to installation directory>/lib/libninja.so NIN_MODE=1 srun -n (OR mpirun -np) 16 ./ninja_test_matching_race 1 0 1000 2 1000
===========================================
 NIN_LOCAL_NOISE: 0
 NIN_PATTERN: 2
 NIN_MODE: 1
 NIN_DIR: ./.ninja
===========================================
***************************
Matching Type     : 1
Is Matching safe ?: 0
# of Loops        : 1000
# of Patterns     : 2
Interval(usec)    : 1000
***************************
NIN(test):  0: loop: 1 (ninja_test_matching_race.c:225)
NIN(test): 13: Wrong matching detected: recv_val: 1, send_val: 0 at loop 0 (ninja_test_matching_race.c:247)
application called MPI_Abort(, 0) - process 13

Environment variables

NIN_PATTERN:
- 0: Network noise free
- 1: Random network noise
  - NIN_RAND_RATIO: NIN_RAND_RATIO % of MPI sends are delayed
  - NIN_RAND_DELAY: Selected messages are delayed by usleep(NIN_RAND_DELAY)
- 2 (default): Smart network noise injection
  - NIN_MODE
    - 0 (default): System-centric mode
    - 1: Application-centric mode
  - NIN_DIR: Directory for send pattern learning files
    - Default is NIN_DIR=./.ninja
NIN_LOCAL_NOISE:
- 0 (default): Local noise free
- 1: Constant local noise
  - NIN_LOCAL_NOISE_AMOUNT: Run CPU intensive work for <NIN_LOCAL_NOISE_AMOUN> usec after send/recv/matching function

Reference

Kento Sato, Dong H. Ahn, Ignacio Laguna, Gregory L. Lee, Martin Schulz and Christopher M. Chambreau, “Noise Injection Techniques for Reproducing Subtle and Unintended Message Races”, Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP17), Austin, USA, Feb, 2017.

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
files		files
m4		m4
src		src
test		test
AUTHORS		AUTHORS
ChangeLog		ChangeLog
LICENSE.TXT		LICENSE.TXT
Makefile.am		Makefile.am
NEWS		NEWS
README		README
README.md		README.md
autoclean.sh		autoclean.sh
autogen.sh		autogen.sh
configure.ac		configure.ac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Quick start

1. Building NINJA

From Spack

From git repository

From tarball

2. Run examples

Running without NINJA

Running with NINJA (System-centric mode)

Running with NINJA (Application-centric mode)

Environment variables

Reference

About

Releases 3

Packages

Contributors 4

Languages

License

PRUNERS/NINJA

Folders and files

Latest commit

History

Repository files navigation

Introduction

Quick start

1. Building NINJA

From Spack

From git repository

From tarball

2. Run examples

Running without NINJA

Running with NINJA (System-centric mode)

Running with NINJA (Application-centric mode)

Environment variables

Reference

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 4

Languages

Packages