-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low-memory terminal state implementation #1584
Conversation
@andrewrk Thanks! I didn’t mark this issue as closing your original memory issue because there’s still a lot of areas that need work. But the terminal state at this point should no longer be an issue. GTK for example has huge memory issues that still need to be addressed that macOS doesn’t have. |
got you a nice integer overflow crash too:
to reproduce:
|
@andrewrk Thank you, good bug, fixed with test cases in last commit. |
I got a stack trace for the
I'm still unsure how exactly this is happening but I have a reliable repro on my Mac now:
|
Hit an assertion
to reproduce:
While trying to reproduce this with a floating window in sway and resizing the window in all unreasonable ways I got the following stack trace:
|
A subjectively high memory usage for several opened instances is what keeps me away from daily driving Ghostty. So I am really interested in helping this get improved. At least by providing some data. Here are some results on my Linux machine with i3: Current `main` (4b1958b)Current `paged-terminal` (6fdc985)For comparison, some other terminal emulators: All tests are performed with reasonable same steps:
So my current observations are:
|
Ghostty does this already, but by default only when it detects it was launched from a desktop environment. You can force this behavior using So your experience may be due to launching Ghostty in a way that its default detection is detecting it is CLI-launched. You will indeed get significant memory (and startup time) savings by using single-instance mode. The reason we have this detection at all is because single instance mode shares a base set of configuration. For CLI-launched Ghostty instances, I want to allow users to use CLI flags to override settings like font sizes, colors, etc. By launching a separate instance we can do that.
Otherwise, thank you for your data points! This is helpful. I want to reiterate that this PR address terminal state memory usage but there are many other components in Ghostty that can optimize memory. I don't think st/xfce necessarily have less memory due to their feature choices, I think Ghostty still simply has improvements to make 😄 |
I made an issue #1591 which tracks a handful of panics/asserts run in to on this PR (Making this comment cause I forgot to mention the PR in the issue so it didn't get listed here) |
c2bcd34
to
e9de444
Compare
Status update on this PR: I'm now able to run What this says about the stability of this PR: crashes due to terminal sequence state changes are in an acceptably rare state. I'm sure fuzzing will find more but for the reality of daily usage we're probably doing really well. What this doesn't say about the stability of this PR: this doesn't address how stable UI interactions are such as resizing, keybindings such as screen clears, prompt jumping, selection, etc. That can't be so naively fuzzed with random bytes. Of all the aforementioned features, resizing is the most important thing to test. If anyone got a crash previously, please test with the latest builds and report back (or open a new issue) if you can get one. Thank you! ❤️ |
I think you've fixed my "easily reproducible" crash with the latest two commits (at least, I can't just generally muck around in the terminal and cause the crash anymore). $ /Applications/GhosttyBeta.app/Contents/MacOS/ghostty +version
Ghostty 0.1.0-paged-terminal+a4d3af65
Build Config
- Zig version: 0.12.0-dev.3342+f88a971e4
- build mode : builtin.OptimizeMode.ReleaseSafe
- app runtime: apprt.Runtime.none
- font engine: font.main.Backend.coretext
- renderer : renderer.Metal
- libxev : main.Backend.kqueue |
25bcbcd
to
bffb7b5
Compare
This reverts commit 5904866.
As of the last few commits, all outstanding todos are complete with the exception of a CSI performance regression fix which is coming shortly. At the time of writing this, there are no known bugs hit by testers. I believe we're very close to being able to merge this PR. |
terminal: reset alt screen kitty keyboard state on full reset
Clearing these rows is necessary to avoid memory corruption, but the calls to `clearCells` in the first loop were redundant, since the rows in question are included in the second loop as well.
Fix scroll region performance regressions
I'm preparing to merge this so some last minute benchmarks... ASCII Stream
UTF8 Stream
Random Bytes Stream
|
This PR changes the data structures and memory layout of the core terminal state subsystem. This is the subsystem necessary for representing the grid (rows, columns, cells) of a terminal and the operations on that grid.
The major goal of this work was to address bloated memory usage (#254) and enable requested features and improvements more efficiently (#189, unlimited scrollback, scrollback on terminal restore, scrollback paged to disk, etc.). Improving runtime performance was not a goal, but as you’ll see, we ended up doing that too (since memory access is closely tied to CPU throughput nowadays).
TODO:
Page.cloneRowFrom
needs to clear existing graphemes and styles on destination before copyvtebench
since opening this PR due to bug fixes (Fix scroll region performance regressions #1612)src/terminal-old
. I want to wait to do this until we’re ready to merge so I can continue to benchmark between the two and compare behaviors. After deleting this, we’ll need to check out an old commit to verify old behaviors.I don’t recommend this really for daily use, but I’d love more people to test this branch if possible. I’m quite confident there are some gnarly bugs lurking somewhere. Please build with
ReleaseSafe
so that assertions can be triggered.macOS users, this PR will produce a signed and notarized macOS app. URL: https://pr.files.ghostty.dev/5717/ghostty-macos-universal.zip
Background: The Terminal Grid
As a point of background, the terminal grid can be thought of as a simple 2D (rows x cols) grid of monospace cells. And a cell is the combination of state required to render a cell: the content, background color, foreground color, styles (underline, bold, italic, etc.), and other attributes.
In addition to the visible grid, a terminal supports scrollback. Scrollback can be thought of as additional sets of rows and columns of the same dimension that simply isn’t in view until the viewport moves.
A terminal supports many operations via control sequences such as moving the cursor, scrolling a region, erasing lines, erasing cells, scrolling the screen, etc. All of these operations are parameterized on cells. Thus, the concept of a fixed-size grid is effectively baked into the terminal API and we must build around that fundamental design.
Background: Ghostty’s Previous Terminal Grid Memory Layout
Previously, Ghostty represented the terminal grid in a typical way: a circular buffer of cells. By moving some set of cells back from the current write pointer of the circular buffer, you can define the “active” area (the bottommost part of the screen that terminal APIs operate on). And by moving a viewport pointer, you could “scroll” the screen.
This is a very typical approach used by many mainstream terminal emulators. But it has downsides:
Memory Usage - The circular buffer dynamically grew for scrollback, but all scrollback had to be in memory. And whenever memory grew, the entire circular buffer had to be copied, which became progressively slower as the buffer grew.
Slow Row Operations - Since the offset into the circular buffer directly mapped to an
(x,y)
in the screen, moving rows required copying cells. For example, the terminal “erase lines” control sequences erases N lines, and shifts all lines below that up by N. For a typical screen, this could require copying thousands of cells.Also, each cell within the circular buffer fully contained all state required to render that cell: the codepoint, the fg/bg color, the styles, etc. I’ll touch on this point later, just remember it.
Additionally, Ghostty maintained a number of additional data in look aside tables for rare features such as extended grapheme clusters.
The renderer thread requires accessing the visible screen state. To do this, the renderer thread would acquire a lock shared with the IO thread, copy the visible part of the screen (the viewport), and unlock. We found through empirical analysis that copying all the data associated with the viewport was faster than the processing time on that data, but its still a relatively slow operation since memory is all over the place and thus not cache-friendly to copy. This has an additional affect: the longer the renderer copies, the slower the IO becomes because it is blocked on locks.
The Big Ideas
Big Idea 1: Unique Style Counts are Low
The first big thing I noticed: every cell is paying the overhead of every possible style attribute: foreground color, background color, underline color, bold, italic, codepoint, etc. But the total number of unique styles on a viewport is low.
I inserted some logging code and asked a number of beta testers to log what their count of uniques styles was in regular day to day activities. Under normal usage, no one had more than 25 unique styles within the visible screen at any time. There were rare exceptions (i.e.
btop
) but this generally held true for active usage.So the first big idea: what if cells didn’t have to pay the cost for repeated or unused styles? What if they didn’t waste memory on that?
In the old terminal state, each cell is 20 bytes. For a 300x80 terminal (roughly the fullscreen dimensions on my machine) with 10,000 lines of scrollback, the memory required is: 60.48 MB. Multiply this by multiple tabs, splits, etc. and it’s quite hefty.
In the new terminal state, each cell stores only the codepoint and a style ID. The style is stored in a look-aside table keyed by ID. We reference count the styles. A cell is now 8 bytes (64 bits). For the same 300x80 terminal with 10,000 lines of scrollback, the memory required is now: 24.19MB, or roughly 60% less.
We may be able to get cells down to 32-bits in the future, which would half the memory again. This PR does not do that yet since that requires more complexity, and I think we should get this in first.
Big Idea 2: Don’t Require Cells to be in Row Order
One of the most common terminal operations is moving rows. It is used heavily by editors (i.e. neovim), multiplexers (tmux), and pagers (less, bat, etc.).
The new terminal state now maintains a linear array of “row” metadata. Within the row metadata, we maintain a pointer to the start of the cells for that row. Each
Row
metadata structure is 8 bytes (64-bits). To move a row, we now just have to shift the rows rather than the full column widths.In the old terminal state, for the same 300x80 terminal, erasing the top line and shifting all rows up required copying 474 KB. The positive point: it was generally linear memory (ignoring circular buffer wraparounds).
In the new terminal state, the same operation requires copying 632 bytes (bytes!), or roughly 0.1% of the old amount. And this is also linear memory.
Big Idea 3: Only the Viewport and Active is Required in Memory
The only part of the screen that terminal APIs can manipulate is known as the “active” area of the screen, and is exactly your grids
cols x rows
dimensions. Terminal APIs can’t modify scrollback, so scrollback history becomes read-only.The only part of the screen that must be read for rendering is the viewport, also of
cols x rows
dimensions. The viewport is also very often identical to the active area since the viewport is very often at the bottom of the terminal.The big idea here is: instead of a circular buffer, let’s split up our screen into chunks (this PR calls them “pages”) and create a system that can offload unnecessary pages to disk if necessary so they don’t have to be in-memory.
To do this, instead of using a circular buffer, this PR uses a doubly-linked list of pages. Only the active/viewport need to be readily available.
Importantly, this PR does not implement the disk serialization. This is something I want to work on soon but in the interest of not introducing too much complexity into one PR, I didn’t finish this here. The important thing is that the new architecture enables this core idea.
Big Idea 4: Pointers Suck to Copy and Serialize
The renderer needs to copy the visible screen on every frame. And the future page serialization system needs to encode/decode pages to/from disk.
The old terminal state floated around many pointers. In order to copy these values, we had to iterate through all pointer values (i.e. in a hash map) and construct a new hash map and copy each value individually. This is slow.
The new terminal state preallocates a contiguous block of memory (typically in ~512KB chunks, or roughly 32 or 128 virtual memory pages on 16KB and 4KB page system respectively). Instead of storing pointers, we now store the base address and offsets.
The offsets are 32-bit, so they can address at most 4GB of data. Therefore, our pages are capped at 4GB today. I actually would like to lower than to 16-bit capped at 65K but implementing 32-bit was easier to start. At the limit, we must allocate more pages. Not a big deal.
Since the offsets are 32-bit, every pointer in the previous state is now half the size on a 64-bit system, saving more memory.
And to copy the data, we can perform a linear copy of the virtual memory pages, update our single base address, and we’re done. This makes copying fast and serialization much more trivial.
This PR implements this idea completely.
Benchmarks 🚀
Memory Usage Under Static Scenarios
All macOS GLFW builds, ReleaseFast, macOS 14.4.
20 empty windows:
5 windows with
cat 20MB.txt
of Japanese text each:5 Neovim windows:
(Explanation for Neovim windows: with no scrollback, the old terminal state used the smallest circular buffer necessary. The new terminal state performs a slightly larger preallocation. The net result is roughly no memory changes.)
ASCII IO Throughput
25MB of ASCII
Unicode IO Throughput
25MB of Unicode
vtebench
ResultsOld:
New:
Future
The terminal state is just one offender of memory usage within Ghostty. Looking at our benchmarks, it was a significant offender, but our memory usage is still much higher than I’d like.
In the future, I believe there are improvements we can continue to make to terminal state (i.e. 32-bit cells vs 64-bit cells). However, I think there are larger low-hanging fruit such as duplicate font information between multiple terminals, redundant CPU state for the renderer (when it’s present already on the GPU), etc. I plan on addressing those soon, after this PR is merged and somewhat stable.