Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
sslotin committed Dec 11, 2022
2 parents a7ade20 + 7c29e17 commit ed1945c
Show file tree
Hide file tree
Showing 10 changed files with 16 additions and 13 deletions.
2 changes: 1 addition & 1 deletion content/english/hpc/compilation/flags.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ There are 4 *and a half* main levels of optimization for speed in GCC:

- `-O0` is the default one that does no optimizations (although, in a sense, it does optimize: for compilation time).
- `-O1` (also aliased as `-O`) does a few "low-hanging fruit" optimizations, almost not affecting the compilation time.
- `-O2` enables all optimizations that are known to have little to no negative side effects and take reasonable time to complete (this is what most projects use for production builds).
- `-O2` enables all optimizations that are known to have little to no negative side effects and take a reasonable time to complete (this is what most projects use for production builds).
- `-O3` does very aggressive optimization, enabling almost all *correct* optimizations implemented in GCC.
- `-Ofast` does everything in `-O3`, plus a few more optimizations flags that may break strict standard compliance, but not in a way that would be critical for most applications (e.g., floating-point operations may be rearranged so that the result is off by a few bits in the mantissa).

Expand Down
2 changes: 1 addition & 1 deletion content/english/hpc/compilation/situational.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ The whole process is automated by modern compilers. For example, the `-fprofile-
g++ -fprofile-generate [other flags] source.cc -o binary
```

After we run the program — preferably on input that is as representative of real use case as possible — it will create a bunch of `*.gcda` files that contain log data for the test run, after which we can rebuild the program, but now adding the `-fprofile-use` flag:
After we run the program — preferably on input that is as representative of the real use case as possible — it will create a bunch of `*.gcda` files that contain log data for the test run, after which we can rebuild the program, but now adding the `-fprofile-use` flag:

```
g++ -fprofile-use [other flags] source.cc -o binary
Expand Down
3 changes: 2 additions & 1 deletion content/english/hpc/data-structures/binary-search.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
---
title: Binary Search
weight: 1
published: true
---

<!-- mention interpolation search and radix trees? -->
Expand Down Expand Up @@ -184,7 +185,7 @@ int lower_bound(int x) {
Note that this loop is not always equivalent to the standard binary search. Since it always rounds *up* the size of the search interval, it accesses slightly different elements and may perform one comparison more than needed. Apart from simplifying computations on each iteration, it also makes the number of iterations constant if the array size is constant, removing branch mispredictions completely.
As typical for predication, this trick is very fragile to compiler optimizations — depending on the compiler and how the funciton is invoked, it may still leave a branch or generate suboptimal code. It works fine on Clang 10, yielding a 2.5-3x improvement on small arrays:
As typical for predication, this trick is very fragile to compiler optimizations — depending on the compiler and how the function is invoked, it may still leave a branch or generate suboptimal code. It works fine on Clang 10, yielding a 2.5-3x improvement on small arrays:
<!-- todo: update numbers -->
Expand Down
4 changes: 2 additions & 2 deletions content/english/hpc/external-memory/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,15 +19,15 @@ When you fetch anything from memory, the request goes through an incredibly comp
-->

When you fetch anything from memory, there is always some latency before the data arrives. Moreover, the request doesn't go directly to its ultimate storage location, but it first goes through a complex system of address translation units and caching layers designed to both help in memory management and reduce the latency.
When you fetch anything from memory, there is always some latency before the data arrives. Moreover, the request doesn't go directly to its ultimate storage location, but it first goes through a complex system of address translation units and caching layers designed to both help in memory management and reduce latency.

Therefore, the only correct answer to this question is "it depends" — primarily on where the operands are stored:

- If the data is stored in the main memory (RAM), it will take around ~100ns, or about 200 cycles, to fetch it, and then another 200 cycles to write it back.
- If it was accessed recently, it is probably *cached* and will take less than that to fetch, depending on how long ago it was accessed — it could be ~50 cycles for the slowest layer of cache and around 4-5 cycles for the fastest.
- But it could also be stored on some type of *external memory* such as a hard drive, and in this case, it will take around 5ms, or roughly $10^7$ cycles (!) to access it.

Such high variance of memory performance is caused by the fact that memory hardware doesn't follow the same [laws of silicon scaling](/hpc/complexity/hardware) as CPU chips do. Memory is still improving through other means, but if 50 years ago memory timings were roughly on the same scale with the instruction latencies, nowadays they lag far behind.
Such a high variance of memory performance is caused by the fact that memory hardware doesn't follow the same [laws of silicon scaling](/hpc/complexity/hardware) as CPU chips do. Memory is still improving through other means, but if 50 years ago memory timings were roughly on the same scale with the instruction latencies, nowadays they lag far behind.

![](img/memory-vs-compute.png)

Expand Down
2 changes: 1 addition & 1 deletion content/english/hpc/external-memory/hierarchy.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ There are other caches inside CPUs that are used for something other than data.

### Non-Volatile Memory

While the data cells in CPU caches and the RAM only gently store just a few electrons (that periodically leak and need to be periodically refreshed), the data cells in *non-volatile memory* types store hundreds of them. This lets the data to persist for prolonged periods of time without power but comes at the cost of performance and durability — because when you have more electrons, you also have more opportunities for them colliding with silicon atoms.
While the data cells in CPU caches and the RAM only gently store just a few electrons (that periodically leak and need to be periodically refreshed), the data cells in *non-volatile memory* types store hundreds of them. This lets the data persist for prolonged periods of time without power but comes at the cost of performance and durability — because when you have more electrons, you also have more opportunities for them to collide with silicon atoms.

<!-- error correction -->

Expand Down
2 changes: 1 addition & 1 deletion content/english/hpc/external-memory/model.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Similar in spirit, in the *external memory model*, we simply ignore every operat

In this model, we measure the performance of an algorithm in terms of its high-level *I/O operations*, or *IOPS* — that is, the total number of blocks read or written to external memory during execution.

We will mostly focus on the case where the internal memory is RAM and external memory is SSD or HDD, although the underlying analysis techniques that we will develop are applicable to any layer in the cache hierarchy. Under these settings, reasonable block size $B$ is about 1MB, internal memory size $M$ is usually a few gigabytes, and $N$ is up to a few terabytes.
We will mostly focus on the case where the internal memory is RAM and the external memory is SSD or HDD, although the underlying analysis techniques that we will develop are applicable to any layer in the cache hierarchy. Under these settings, reasonable block size $B$ is about 1MB, internal memory size $M$ is usually a few gigabytes, and $N$ is up to a few terabytes.

### Array Scan

Expand Down
2 changes: 1 addition & 1 deletion content/english/hpc/number-theory/modular.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ $$
$$
\begin{aligned}
a^p &= (\underbrace{1+1+\ldots+1+1}_\text{$a$ times})^p &
\\\ &= \sum_{x_1+x_2+\ldots+x_a = p} P(x_1, x_2, \ldots, x_a) & \text{(by defenition)}
\\\ &= \sum_{x_1+x_2+\ldots+x_a = p} P(x_1, x_2, \ldots, x_a) & \text{(by definition)}
\\\ &= \sum_{x_1+x_2+\ldots+x_a = p} \frac{p!}{x_1! x_2! \ldots x_a!} & \text{(which terms will not be divisible by $p$?)}
\\\ &\equiv P(p, 0, \ldots, 0) + \ldots + P(0, 0, \ldots, p) & \text{(everything else will be canceled)}
\\\ &= a
Expand Down
3 changes: 2 additions & 1 deletion content/english/hpc/number-theory/montgomery.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
---
title: Montgomery Multiplication
weight: 4
published: true
---

Unsurprisingly, a large fraction of computation in [modular arithmetic](../modular) is often spent on calculating the modulo operation, which is as slow as [general integer division](/hpc/arithmetic/division/) and typically takes 15-20 cycles, depending on the operand size.
Expand Down Expand Up @@ -287,6 +288,6 @@ int inverse(int _a) {
}
```
While vanilla binary exponentiation with a compiler-generated fast modulo trick requires ~170ns per `inverse` call, this implementation takes ~166ns, going down to ~158s we omit `transform` and `reduce` (a reasonable use case is for `inverse` to be used as a subprocedure in a bigger modular computation). This is a small improvement, but Montgomery multiplication becomes much more advantageous for SIMD applications and larger data types.
While vanilla binary exponentiation with a compiler-generated fast modulo trick requires ~170ns per `inverse` call, this implementation takes ~166ns, going down to ~158ns we omit `transform` and `reduce` (a reasonable use case is for `inverse` to be used as a subprocedure in a bigger modular computation). This is a small improvement, but Montgomery multiplication becomes much more advantageous for SIMD applications and larger data types.
**Exercise.** Implement efficient *modular* [matix multiplication](/hpc/algorithms/matmul).
2 changes: 1 addition & 1 deletion content/english/hpc/simd/reduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ int sum_simd(v8si *a, int n) {
}
```

You can use this approach for for other reductions, such as for finding the minimum or the xor-sum of an array.
You can use this approach for other reductions, such as for finding the minimum or the xor-sum of an array.

### Instruction-Level Parallelism

Expand Down
7 changes: 4 additions & 3 deletions content/russian/cs/string-structures/aho-corasick.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
---
title: Алгоритм Ахо-Корасик
authors:
- Сергей Слотин
- Сергей Слотин
weight: 2
prerequisites:
- trie
- trie
published: true
---

Представим, что мы работаем журналистами в некотором авторитарном государстве, контролирующем СМИ, и в котором время от времени издаются законы, запрещающие упоминать определенные политические события или использовать определенные слова. Как эффективно реализовать подобную цензуру программно?
Expand Down Expand Up @@ -36,7 +37,7 @@ prerequisites:

**Определение.** *Суффиксная ссылка* $l(v)$ ведёт в вершину $u \neq v$, которая соответствует наидлиннейшему принимаемому бором суффиксу $v$.

**Определение.** *Автоматный переход* $\delta(v, c)$ ведёт в вершину, соответствующую минимальному принимаемому бором суффиксу строки $v + c$.
**Определение.** *Автоматный переход* $\delta(v, c)$ ведёт в вершину, соответствующую максимальному принимаемому бором суффиксу строки $v + c$.

**Наблюдение.** Если переход и так существует в боре (будем называть такой переход *прямым*), то автоматный переход будет вести туда же.

Expand Down

0 comments on commit ed1945c

Please sign in to comment.