-
Notifications
You must be signed in to change notification settings - Fork 0
/
notes.txt
338 lines (169 loc) · 11.1 KB
/
notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
---
Back in January, I saw this Java coding challange posted on Hacker News.
(read quote)
---
The task is to take a 1 billion row input file of weather station temperature measurements, and aggregate them into minimum, mean and maximum by station.
You're not allowed to use any external dependencies.
I thought it would be fun to try doing this in Ruby instead of Java, and see how far we can push Ruby.
This talk is about my adventure trying to optimise Ruby code for this problem, and some tool and techiniques I used along the way.
---
What sort of sizes are we talking about here in terms of bytes?
The "create measurements" script is from the original 1brc repo, and it generates an input file with the specified number of lines.
So we run that with 1 billion, and we get a 13 gigabyte input file.
That's big enough that we should be careful how we work with it - we probably want to stream it line-by-line, rather than loading the whole file into RAM before we do anything.
But it's not "big data" - it's easily small enough for a single machine to handle entirely in RAM.
Importantly, it's small enough that the OS will cache the entire file in RAM after it's been loaded for the first time, so the performance is really a question of how quickly we can process those bytes on the CPU, not how quickly we can load from disk.
---
I started with a basic Ruby implementation.
(open 001_baseline.rb, talk it through)
---
So we have a Ruby program - how do we measure how fast it is?
The simplest tool is one you already have - it's built into your shell. You can prepend `time` to any command.
---
I'm running this with an input file with only 1 million lines for the sake of time.
If I run it multiple times, you'll see that the time is different on each run - there's some noise in the measurements.
That can be a bit of a trap when you're trying to determine if a change you just made improved the performance - it's quite easy to trick yourself into thinking something has worked. Ideally you want to do multiple runs and average them.
There's tool I found called hyperfine that automates this, and it can also compare different commands and tell you the relative performance difference. I'll give a demo of that next.
---
I ran this implementation with the 10 million line file and multiplied the result by 100.
1099 seconds - that's about 18 minutes!
For comparison, I also benchmarked two different Java implementations - they're in the 1BRC repo if you want to check them out.
The 45s one looked like a "middle-of-the-road" Java implementation. I'm not super familiar with Java, but it seemed like an idomatic Java implementation without any crazy optimisation tricks, other than working on chunks of the input in parallel.
And the second one was close to the top of the leaderboard.
There's a lot of work to do if we're going to beat Java.
---
Let's get the easy stuff out of the way first.
Will YJIT save us all from ever having to think about Ruby performance?
I'm using Ruby 3.3.0, which allows you to enable YJIT without using the command-line option or environment variable.
---
---
...for just enabling a flag. That's pretty cool - I'm very excited about Ruby's future with YJIT.
So we've got our basic implementation, we got a boost by enabling YJIT, now what?
How do we decide what to work on next? We could stumble around just changing things and see if they help, but it would be much better if we knew what the slowest parts were.
Enter:
---
A profiler is a tool that helps you analyse the performance of a running program.
These are both sampling profilers: they work by periodically (hundreds of times a second) observing the current callstack of the process and recording which methods are currently executing.
stackprof:
* records GC time
* can profile blocks of code, rather than the entire process
rbspy:
* has option to automatically record subprocesses
* works on any Ruby process, without installing a gem
---
What are we looking at here?
The width of the boxes is the percentage of time
The y-axis is the callstack
(compare with 002 yjit code)
* The slowest method we're using is String#split (37% of time)
* We're spending 11% of time in GC
Let's see what we can do about both of those. I'm going start by looking at the garbage.
---
A quick refresher on garbage collection
(talk through slide)
Your program stops while the garbage collector is running. The best way to reduce the amount of time spend in GC is
to reduce the amount of garbage by allocating fewer objects.
---
You can use GC stat total allocated objects to get the total number of objects that have been allocated so far, and you can compare it before and after to measure a particular block of code.
I measured this line, and there are 4 allocations. Anyone got any guesses what they might be?
array, 2x string, unfrozen string literal
(open 003_gc_golf.rb)
This is my attempt at trying to minimise the number of allocations per input line.
Reduced the number of allocations to 1 (excluding the line String itself)
---
---
With 6 cores, we should be able to get at least a 6x performance gain
What are options for parallel execution in Ruby?
Because we're spending approximately all of our time executing Ruby code and not waiting on IO, only a single thread will be able to execute at any time.
---
We need a strategy for dividing the work. There a few options, each with different challanges.
The first one is dividing by line - we could go through and farm out each line one-by-one to the parallel workers.
That's going to be super high overhead. Communicating a piece of work to a worker process isn't free, and if we have to do it 1 billion times, it's going to add up.
---
So what if we could do chunks of lines instead?
We'd have to scan through the entire file to locate the line endings first - not really an efficient option either.
---
---
(open up 005_parallel_processes.rb)
I'm using the parallel gem here for the sake of readability, but internally that's calling Process.fork and using IO.pipe to communicate with the child processes.
---
Ruby is a bit over twice as slow as the middle-of-the-road Java implementation. That's honestly better than I was expecting - this sort of CPU-intensive work isn't something Ruby is great at.
So what about Ractors - they should be even faster because everything happens within the one Ruby process.
---
So very much experimental.
---
Let's run a profiler again on the processes version, and see what the bottleneck is now.
I've switched to using rbspy this time, because it has better support for recording multiple subprocesses.
The slowest thing is now #each_line, but what can we do instead?
---
I want to put in a quick mention of one of my favourite tools - strace. You can use it to see how a process (any process - not just Ruby) is interacting with the operating system.
You'll get a whole lot of output showing every syscall the process is making.
You can see that our script is mostly making read calls.
---
(read through slide)
So we got a copy from the OS cache into the buffer, and then another copy for every line.
That's a lot of copying!
---
There's an alternative operating system API for accessing files called mmap (memory map), which we can use via IO::Buffer in Ruby.
Mmap tells the OS to use virtual memory to make the file appear within Ruby's memory, without any copying.
This works really well here because the input fits in memory, and we know it's already in the cache.
The IO::Buffer is a bit like having a giant string with the whole file in it.
We can get a byte at a particular index with #get_value - these are returned as Integers, so there's no String allocation.
For example - a semicolon will have ASCII value 59.
And we can also slice out strings with #get_string.
This removes the overhead of calling read for every 8 KiB, and we no longer have to allocate a String object for every line.
(open io buffer rb)
---
---
And that would make the code a whole lot more readable, and hopefully faster.
So I wrote a tiny C extension that adds it.
---
The Java 1BRC competition rules say you're not allowed to use external libraries. But whatever - those rules aren't real - we can go wherever we want on this Ruby adventure.
(open 007 buffer ext)
(open c extension)
---
So close...
---
---
(read slide)
And that string neeeds to be garbage collected.
We have access to individual bytes via IO::Buffer, so what if we could just parse them directly into a number, rather than going via a String?
The decimals also always have a single fractional digit - we don't even need floats - we can use integers storing tenths of a degree.
So what does that look like?
(open 008 custom parser)
---
Pretty amazing that writing this in Ruby is faster.
I'm not sure if it's saying something good about YJIT, or something bad about the String#to_f implementation.
---
We're now spending 34% of self time in the loop - there's lots of Ruby code for manging the current offset, and we're back to doing #get_value calls.
---
What would be an even nicer API, is if there was an IO::Buffer::Reader class that managed of offset automatically.
(open 009_io_buffer_reader.rb)
(quickly open C extension)
---
It might be more C than Ruby, but Ruby beat Java!
---
What we have with the IO::Buffer::Reader implementation feels pretty minimal. How is the optimized Java implementation still 18x faster?
We're working byte-by-byte - the hyper optimized implementations are doing bit pattern tricks to search multiple bytes in one operation. A CPU is just as fast at operating on an 8 byte number as it is with a single byte, so we're leaving a lot of performance on the table by working byte-by-byte.
There's an interesting blog post that goes into more detail about how these tricks work.
---
Can we just `gem install` a highly-optimised solution to this exact problem? Yes, yes we can.
open 010_polars.rb
---
This is the pragmatic "software engineering" solution. It's what you'd do at work.
---
One last interesting thing.
It's important to keep in mind that different computers have different performance characteristics. An optimisation that's faster on your development machine might be slower in production.
Even in production, machines from different generations will behave differently.
It's something you should be conscious of.
---
Here's an interesting example that kind of illustrates this.
I've been taking on these measurements on a desktop at home. If I compare how it performs to this laptop, you can see that initially the laptop is about 20% faster, but if I keep continously running the benchmark, it gets slower and slower.
Any ideas why?
---
Optimisation is all about this loop of measurement, profiling and coming up with ideas for improvement, and comparison.
There's real readability/maintainability vs performance trade-off:
You can make performance gains by hard-coding assumptions like the value always having a single decimal place, or doing unreadable bit pattern tricks, but they have a cost.
I know we could have just added the polars gem and gone fast immediately, but the journey of exploring a problem like this is lot of fun, and I learnt lots!
Can you make it go faster?
---