Skip to content

Commit

Permalink
Release 1.0.0 (#23)
Browse files Browse the repository at this point in the history
* Simplified the in-place native methods' signatures
* Added a check whether JNI "get critical" methods returned a copy
* Added a benchmark JAR task
* Simplified the native methods' signatures
* Removed F64FlatArray.transpose
* Introduced unroll mechanics for F64Array
* Serialization uses isFlattenable instead of isDense
* Changed native signature of logAddExp to dst-src syntax
* Simplified logRescale signature
* Simplified cumSum signature
* logAddExp now deals with positive infinities and NaNs
* Corrected guessShape implementation
* Correct JNI copy processing
* Changed Viewer safety mode to publication
* Added extensive Markdown documentation
* Reorganized benchmarks
* Removed weightedSum / Mean, since they were never used
* Fixed argMin / argMax logic
* Added benchmarking data and description thereof to documentation
* Fixed Travis builds
* Bumped version to 1.0.0
  • Loading branch information
dievsky authored Nov 19, 2019
2 parents 7a24c87 + 313ba9e commit 1efb244
Show file tree
Hide file tree
Showing 79 changed files with 2,611 additions and 1,532 deletions.
2 changes: 2 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,3 +24,5 @@ cache:
directories:
- $HOME/.gradle/caches/
- $HOME/.gradle/wrapper/

dist: trusty
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,3 +85,9 @@ Publishing
Publishing to [Bintray][bintray] is currently done via a dedicated
build configuration of an internal TeamCity server. This allows us
to deploy a cross-platform version.
Documentation
----
Visit [viktor Documentation](./docs/docs.md) for an extensive feature overview,
instructive code examples and benchmarking data.
120 changes: 120 additions & 0 deletions docs/benchmark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# viktor Benchmarks

We designed a series of microbenchmarks to compare `viktor`'s
native SIMD optimization efficiency to that of a simple Kotlin/Java loop.

The microbenchmark code can be found in `src/jmh` folder. To run the benchmarks, run
the following commands from `viktor`'s root folder:
```bash
$ ./gradlew clean assemble benchmarkJar
$ java -jar ./build/libs/viktor-benchmark.jar
```
You can add the usual [JMH](https://openjdk.java.net/projects/code-tools/jmh/)
command line arguments to the latter command, e.g. `-o` to specify
the output file or `-t` to control the number of threads.

## Benchmark Environment

We conducted the benchmarking on two machines:
a laptop running on Intel Core i7-6820HQ CPU at 2.70GHz
and a server running on Intel Xeon E5-2690 at 2.90GHz.
The following table summarizes the main features of both:

machine | laptop | server
--------|--------|-------
CPU | Intel Core i7-6820HQ | Intel Xeon E5-2690
frequency | 2.70GHz | 2.90 GHz
cores[^cores] | 8 | 32
architecture | `amd64` | `amd64`
highest SIMD extension[^simd] | `AVX` | `SSE2`
OS | Ubuntu 18.04.3 LTS | CentOS Linux 7 (Core)

[^cores]: The number of cores shouldn't matter since all benchmarks ran in a single thread.

[^simd]: The most advanced extension that was used by `viktor`. In reality, the laptop
had `AVX2` and the server had `SSE4.2` as the highest extension.

## Benchmark Results

The following data is provided for informational purposes only.
In each benchmark, we considered arrays of size `1000`, `100_000` and `10_000_000`.
We investigate two metrics:
* `Array ops/s` is the number of times the operation was performed on the entire array
per second. Shown on a logarithmic scale.
* `FLOPS` is the number of equivalent scalar operations performed per second. It is equal
to `Array ops/s` multiplied by `Array size`. Shown on a linear scale.

### Math Benchmarks

The task here was to calculate exponent (logarithm, `expm1`, `log1p` respectively)
of all the elements in a double array. It was done either with a simple loop,
or with a dedicated `viktor` array method (e.g. `expInPlace`). We also evaluated the
efficiency of
[`FastMath`](https://commons.apache.org/proper/commons-math/javadocs/api-3.3/org/apache/commons/math3/util/FastMath.html)
methods compared to Java's built-in `Math`.

We also measured the performance of `logAddExp` method for adding
two logarithmically stored arrays. It was compared with the scalar `logAddExp`
defined in `viktor`.

#### Laptop

![exp laptop array ops/s](./figures/ExpBenchmark_arrayopss_workstation.png)
![exp laptop flops](./figures/ExpBenchmark_flops_workstation.png)
![expm1 laptop array ops/s](./figures/Expm1Benchmark_arrayopss_workstation.png)
![expm1 laptop flops](./figures/Expm1Benchmark_flops_workstation.png)
![log laptop array ops/s](./figures/LogBenchmark_arrayopss_workstation.png)
![log laptop flops](./figures/LogBenchmark_flops_workstation.png)
![log1p laptop array ops/s](./figures/Log1pBenchmark_arrayopss_workstation.png)
![log1p laptop flops](./figures/Log1pBenchmark_flops_workstation.png)
![logAddExp laptop array ops/s](./figures/LogAddExpBenchmark_arrayopss_workstation.png)
![logAddExp laptop flops](./figures/LogAddExpBenchmark_flops_workstation.png)

#### Server

![exp server array ops/s](./figures/ExpBenchmark_arrayopss_server.png)
![exp server flops](./figures/ExpBenchmark_flops_server.png)
![expm1 server array ops/s](./figures/Expm1Benchmark_arrayopss_server.png)
![expm1 server flops](./figures/Expm1Benchmark_flops_server.png)
![log server array ops/s](./figures/LogBenchmark_arrayopss_server.png)
![log server flops](./figures/LogBenchmark_flops_server.png)
![log1p server array ops/s](./figures/Log1pBenchmark_arrayopss_server.png)
![log1p server flops](./figures/Log1pBenchmark_flops_server.png)
![logAddExp server array ops/s](./figures/LogAddExpBenchmark_arrayopss_server.png)
![logAddExp server flops](./figures/LogAddExpBenchmark_flops_server.png)

### Statistics Benchmarks

We tested the `sum()`, `sd()` and `logSumExp()` methods here. We also measured
the throughput of a dot product of two arrays (`dot()` method). All these benchmarks
(except for `logSumExp`) don't have a loop-based `FastMath` version since they only
use arithmetic operations in the loop.

#### Laptop

![sum laptop array ops/s](./figures/SumBenchmark_arrayopss_workstation.png)
![sum laptop flops](./figures/SumBenchmark_flops_workstation.png)
![sd laptop array ops/s](./figures/SDBenchmark_arrayopss_workstation.png)
![sd laptop flops](./figures/SDBenchmark_flops_workstation.png)
![logSumExp laptop array ops/s](./figures/LogSumExpBenchmark_arrayopss_workstation.png)
![logSumExp laptop flops](./figures/LogSumExpBenchmark_flops_workstation.png)
![dot laptop array ops/s](./figures/DotBenchmark_arrayopss_workstation.png)
![dot laptop flops](./figures/DotBenchmark_flops_workstation.png)

#### Server

![sum server array ops/s](./figures/SumBenchmark_arrayopss_server.png)
![sum server flops](./figures/SumBenchmark_flops_server.png)
![sd server array ops/s](./figures/SDBenchmark_arrayopss_server.png)
![sd server flops](./figures/SDBenchmark_flops_server.png)
![logSumExp server array ops/s](./figures/LogSumExpBenchmark_arrayopss_server.png)
![logSumExp server flops](./figures/LogSumExpBenchmark_flops_server.png)
![dot server array ops/s](./figures/DotBenchmark_arrayopss_server.png)
![dot server flops](./figures/DotBenchmark_flops_server.png)

## Cautious Conclusions

`viktor` seems to perform up to three times better than the
regular scalar computation approach. The only exception to that seems to be
`FastMath.exp()` which is on par `viktor`'s `exp()` method on `AVX`
(and faster than that on `SSE2`) .
Loading

1 comment on commit 1efb244

@dievsky
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll probably never know why GitHub decided to do a merge commit even though I clicked "Squash and Rebase".

Please sign in to comment.