diff --git a/content/english/hpc/compilation/flags.md b/content/english/hpc/compilation/flags.md index ceae9e87..74383237 100644 --- a/content/english/hpc/compilation/flags.md +++ b/content/english/hpc/compilation/flags.md @@ -12,7 +12,7 @@ There are 4 *and a half* main levels of optimization for speed in GCC: - `-O0` is the default one that does no optimizations (although, in a sense, it does optimize: for compilation time). - `-O1` (also aliased as `-O`) does a few "low-hanging fruit" optimizations, almost not affecting the compilation time. -- `-O2` enables all optimizations that are known to have little to no negative side effects and take reasonable time to complete (this is what most projects use for production builds). +- `-O2` enables all optimizations that are known to have little to no negative side effects and take a reasonable time to complete (this is what most projects use for production builds). - `-O3` does very aggressive optimization, enabling almost all *correct* optimizations implemented in GCC. - `-Ofast` does everything in `-O3`, plus a few more optimizations flags that may break strict standard compliance, but not in a way that would be critical for most applications (e.g., floating-point operations may be rearranged so that the result is off by a few bits in the mantissa). diff --git a/content/english/hpc/compilation/situational.md b/content/english/hpc/compilation/situational.md index bec2a255..41620c70 100644 --- a/content/english/hpc/compilation/situational.md +++ b/content/english/hpc/compilation/situational.md @@ -96,7 +96,7 @@ The whole process is automated by modern compilers. For example, the `-fprofile- g++ -fprofile-generate [other flags] source.cc -o binary ``` -After we run the program — preferably on input that is as representative of real use case as possible — it will create a bunch of `*.gcda` files that contain log data for the test run, after which we can rebuild the program, but now adding the `-fprofile-use` flag: +After we run the program — preferably on input that is as representative of the real use case as possible — it will create a bunch of `*.gcda` files that contain log data for the test run, after which we can rebuild the program, but now adding the `-fprofile-use` flag: ``` g++ -fprofile-use [other flags] source.cc -o binary diff --git a/content/english/hpc/data-structures/binary-search.md b/content/english/hpc/data-structures/binary-search.md index 85f9ef52..6426ddde 100644 --- a/content/english/hpc/data-structures/binary-search.md +++ b/content/english/hpc/data-structures/binary-search.md @@ -1,6 +1,7 @@ --- title: Binary Search weight: 1 +published: true --- @@ -184,7 +185,7 @@ int lower_bound(int x) { Note that this loop is not always equivalent to the standard binary search. Since it always rounds *up* the size of the search interval, it accesses slightly different elements and may perform one comparison more than needed. Apart from simplifying computations on each iteration, it also makes the number of iterations constant if the array size is constant, removing branch mispredictions completely. -As typical for predication, this trick is very fragile to compiler optimizations — depending on the compiler and how the funciton is invoked, it may still leave a branch or generate suboptimal code. It works fine on Clang 10, yielding a 2.5-3x improvement on small arrays: +As typical for predication, this trick is very fragile to compiler optimizations — depending on the compiler and how the function is invoked, it may still leave a branch or generate suboptimal code. It works fine on Clang 10, yielding a 2.5-3x improvement on small arrays: diff --git a/content/english/hpc/external-memory/_index.md b/content/english/hpc/external-memory/_index.md index fe53c83a..0af587b3 100644 --- a/content/english/hpc/external-memory/_index.md +++ b/content/english/hpc/external-memory/_index.md @@ -19,7 +19,7 @@ When you fetch anything from memory, the request goes through an incredibly comp --> -When you fetch anything from memory, there is always some latency before the data arrives. Moreover, the request doesn't go directly to its ultimate storage location, but it first goes through a complex system of address translation units and caching layers designed to both help in memory management and reduce the latency. +When you fetch anything from memory, there is always some latency before the data arrives. Moreover, the request doesn't go directly to its ultimate storage location, but it first goes through a complex system of address translation units and caching layers designed to both help in memory management and reduce latency. Therefore, the only correct answer to this question is "it depends" — primarily on where the operands are stored: @@ -27,7 +27,7 @@ Therefore, the only correct answer to this question is "it depends" — primaril - If it was accessed recently, it is probably *cached* and will take less than that to fetch, depending on how long ago it was accessed — it could be ~50 cycles for the slowest layer of cache and around 4-5 cycles for the fastest. - But it could also be stored on some type of *external memory* such as a hard drive, and in this case, it will take around 5ms, or roughly $10^7$ cycles (!) to access it. -Such high variance of memory performance is caused by the fact that memory hardware doesn't follow the same [laws of silicon scaling](/hpc/complexity/hardware) as CPU chips do. Memory is still improving through other means, but if 50 years ago memory timings were roughly on the same scale with the instruction latencies, nowadays they lag far behind. +Such a high variance of memory performance is caused by the fact that memory hardware doesn't follow the same [laws of silicon scaling](/hpc/complexity/hardware) as CPU chips do. Memory is still improving through other means, but if 50 years ago memory timings were roughly on the same scale with the instruction latencies, nowadays they lag far behind. ![](img/memory-vs-compute.png) diff --git a/content/english/hpc/external-memory/hierarchy.md b/content/english/hpc/external-memory/hierarchy.md index da1f5bb6..26dfc144 100644 --- a/content/english/hpc/external-memory/hierarchy.md +++ b/content/english/hpc/external-memory/hierarchy.md @@ -58,7 +58,7 @@ There are other caches inside CPUs that are used for something other than data. ### Non-Volatile Memory -While the data cells in CPU caches and the RAM only gently store just a few electrons (that periodically leak and need to be periodically refreshed), the data cells in *non-volatile memory* types store hundreds of them. This lets the data to persist for prolonged periods of time without power but comes at the cost of performance and durability — because when you have more electrons, you also have more opportunities for them colliding with silicon atoms. +While the data cells in CPU caches and the RAM only gently store just a few electrons (that periodically leak and need to be periodically refreshed), the data cells in *non-volatile memory* types store hundreds of them. This lets the data persist for prolonged periods of time without power but comes at the cost of performance and durability — because when you have more electrons, you also have more opportunities for them to collide with silicon atoms. diff --git a/content/english/hpc/external-memory/model.md b/content/english/hpc/external-memory/model.md index 35cba4ea..9ab86eba 100644 --- a/content/english/hpc/external-memory/model.md +++ b/content/english/hpc/external-memory/model.md @@ -18,7 +18,7 @@ Similar in spirit, in the *external memory model*, we simply ignore every operat In this model, we measure the performance of an algorithm in terms of its high-level *I/O operations*, or *IOPS* — that is, the total number of blocks read or written to external memory during execution. -We will mostly focus on the case where the internal memory is RAM and external memory is SSD or HDD, although the underlying analysis techniques that we will develop are applicable to any layer in the cache hierarchy. Under these settings, reasonable block size $B$ is about 1MB, internal memory size $M$ is usually a few gigabytes, and $N$ is up to a few terabytes. +We will mostly focus on the case where the internal memory is RAM and the external memory is SSD or HDD, although the underlying analysis techniques that we will develop are applicable to any layer in the cache hierarchy. Under these settings, reasonable block size $B$ is about 1MB, internal memory size $M$ is usually a few gigabytes, and $N$ is up to a few terabytes. ### Array Scan diff --git a/content/english/hpc/number-theory/modular.md b/content/english/hpc/number-theory/modular.md index 47310780..3d05e2f9 100644 --- a/content/english/hpc/number-theory/modular.md +++ b/content/english/hpc/number-theory/modular.md @@ -100,7 +100,7 @@ $$ $$ \begin{aligned} a^p &= (\underbrace{1+1+\ldots+1+1}_\text{$a$ times})^p & -\\\ &= \sum_{x_1+x_2+\ldots+x_a = p} P(x_1, x_2, \ldots, x_a) & \text{(by defenition)} +\\\ &= \sum_{x_1+x_2+\ldots+x_a = p} P(x_1, x_2, \ldots, x_a) & \text{(by definition)} \\\ &= \sum_{x_1+x_2+\ldots+x_a = p} \frac{p!}{x_1! x_2! \ldots x_a!} & \text{(which terms will not be divisible by $p$?)} \\\ &\equiv P(p, 0, \ldots, 0) + \ldots + P(0, 0, \ldots, p) & \text{(everything else will be canceled)} \\\ &= a diff --git a/content/english/hpc/number-theory/montgomery.md b/content/english/hpc/number-theory/montgomery.md index 669e39ba..0eeef0b0 100644 --- a/content/english/hpc/number-theory/montgomery.md +++ b/content/english/hpc/number-theory/montgomery.md @@ -1,6 +1,7 @@ --- title: Montgomery Multiplication weight: 4 +published: true --- Unsurprisingly, a large fraction of computation in [modular arithmetic](../modular) is often spent on calculating the modulo operation, which is as slow as [general integer division](/hpc/arithmetic/division/) and typically takes 15-20 cycles, depending on the operand size. @@ -287,6 +288,6 @@ int inverse(int _a) { } ``` -While vanilla binary exponentiation with a compiler-generated fast modulo trick requires ~170ns per `inverse` call, this implementation takes ~166ns, going down to ~158s we omit `transform` and `reduce` (a reasonable use case is for `inverse` to be used as a subprocedure in a bigger modular computation). This is a small improvement, but Montgomery multiplication becomes much more advantageous for SIMD applications and larger data types. +While vanilla binary exponentiation with a compiler-generated fast modulo trick requires ~170ns per `inverse` call, this implementation takes ~166ns, going down to ~158ns we omit `transform` and `reduce` (a reasonable use case is for `inverse` to be used as a subprocedure in a bigger modular computation). This is a small improvement, but Montgomery multiplication becomes much more advantageous for SIMD applications and larger data types. **Exercise.** Implement efficient *modular* [matix multiplication](/hpc/algorithms/matmul). diff --git a/content/english/hpc/simd/reduction.md b/content/english/hpc/simd/reduction.md index b74ef3b8..89678103 100644 --- a/content/english/hpc/simd/reduction.md +++ b/content/english/hpc/simd/reduction.md @@ -46,7 +46,7 @@ int sum_simd(v8si *a, int n) { } ``` -You can use this approach for for other reductions, such as for finding the minimum or the xor-sum of an array. +You can use this approach for other reductions, such as for finding the minimum or the xor-sum of an array. ### Instruction-Level Parallelism diff --git a/content/russian/cs/string-structures/aho-corasick.md b/content/russian/cs/string-structures/aho-corasick.md index 369f5171..2ca1da65 100644 --- a/content/russian/cs/string-structures/aho-corasick.md +++ b/content/russian/cs/string-structures/aho-corasick.md @@ -1,10 +1,11 @@ --- title: Алгоритм Ахо-Корасик authors: -- Сергей Слотин + - Сергей Слотин weight: 2 prerequisites: -- trie + - trie +published: true --- Представим, что мы работаем журналистами в некотором авторитарном государстве, контролирующем СМИ, и в котором время от времени издаются законы, запрещающие упоминать определенные политические события или использовать определенные слова. Как эффективно реализовать подобную цензуру программно? @@ -36,7 +37,7 @@ prerequisites: **Определение.** *Суффиксная ссылка* $l(v)$ ведёт в вершину $u \neq v$, которая соответствует наидлиннейшему принимаемому бором суффиксу $v$. -**Определение.** *Автоматный переход* $\delta(v, c)$ ведёт в вершину, соответствующую минимальному принимаемому бором суффиксу строки $v + c$. +**Определение.** *Автоматный переход* $\delta(v, c)$ ведёт в вершину, соответствующую максимальному принимаемому бором суффиксу строки $v + c$. **Наблюдение.** Если переход и так существует в боре (будем называть такой переход *прямым*), то автоматный переход будет вести туда же.