Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low-memory terminal state implementation #1584

Merged
merged 430 commits into from
Mar 27, 2024
Merged

Low-memory terminal state implementation #1584

merged 430 commits into from
Mar 27, 2024

Conversation

mitchellh
Copy link
Contributor

@mitchellh mitchellh commented Mar 16, 2024

This PR changes the data structures and memory layout of the core terminal state subsystem. This is the subsystem necessary for representing the grid (rows, columns, cells) of a terminal and the operations on that grid.

The major goal of this work was to address bloated memory usage (#254) and enable requested features and improvements more efficiently (#189, unlimited scrollback, scrollback on terminal restore, scrollback paged to disk, etc.). Improving runtime performance was not a goal, but as you’ll see, we ended up doing that too (since memory access is closely tied to CPU throughput nowadays).

TODO:

I don’t recommend this really for daily use, but I’d love more people to test this branch if possible. I’m quite confident there are some gnarly bugs lurking somewhere. Please build with ReleaseSafe so that assertions can be triggered.

macOS users, this PR will produce a signed and notarized macOS app. URL: https://pr.files.ghostty.dev/5717/ghostty-macos-universal.zip

Background: The Terminal Grid

As a point of background, the terminal grid can be thought of as a simple 2D (rows x cols) grid of monospace cells. And a cell is the combination of state required to render a cell: the content, background color, foreground color, styles (underline, bold, italic, etc.), and other attributes.

In addition to the visible grid, a terminal supports scrollback. Scrollback can be thought of as additional sets of rows and columns of the same dimension that simply isn’t in view until the viewport moves.

A terminal supports many operations via control sequences such as moving the cursor, scrolling a region, erasing lines, erasing cells, scrolling the screen, etc. All of these operations are parameterized on cells. Thus, the concept of a fixed-size grid is effectively baked into the terminal API and we must build around that fundamental design.

Background: Ghostty’s Previous Terminal Grid Memory Layout

Previously, Ghostty represented the terminal grid in a typical way: a circular buffer of cells. By moving some set of cells back from the current write pointer of the circular buffer, you can define the “active” area (the bottommost part of the screen that terminal APIs operate on). And by moving a viewport pointer, you could “scroll” the screen.

This is a very typical approach used by many mainstream terminal emulators. But it has downsides:

  • Memory Usage - The circular buffer dynamically grew for scrollback, but all scrollback had to be in memory. And whenever memory grew, the entire circular buffer had to be copied, which became progressively slower as the buffer grew.

  • Slow Row Operations - Since the offset into the circular buffer directly mapped to an (x,y) in the screen, moving rows required copying cells. For example, the terminal “erase lines” control sequences erases N lines, and shifts all lines below that up by N. For a typical screen, this could require copying thousands of cells.

Also, each cell within the circular buffer fully contained all state required to render that cell: the codepoint, the fg/bg color, the styles, etc. I’ll touch on this point later, just remember it.

Additionally, Ghostty maintained a number of additional data in look aside tables for rare features such as extended grapheme clusters.

The renderer thread requires accessing the visible screen state. To do this, the renderer thread would acquire a lock shared with the IO thread, copy the visible part of the screen (the viewport), and unlock. We found through empirical analysis that copying all the data associated with the viewport was faster than the processing time on that data, but its still a relatively slow operation since memory is all over the place and thus not cache-friendly to copy. This has an additional affect: the longer the renderer copies, the slower the IO becomes because it is blocked on locks.

The Big Ideas

Big Idea 1: Unique Style Counts are Low

The first big thing I noticed: every cell is paying the overhead of every possible style attribute: foreground color, background color, underline color, bold, italic, codepoint, etc. But the total number of unique styles on a viewport is low.

I inserted some logging code and asked a number of beta testers to log what their count of uniques styles was in regular day to day activities. Under normal usage, no one had more than 25 unique styles within the visible screen at any time. There were rare exceptions (i.e. btop) but this generally held true for active usage.

So the first big idea: what if cells didn’t have to pay the cost for repeated or unused styles? What if they didn’t waste memory on that?

In the old terminal state, each cell is 20 bytes. For a 300x80 terminal (roughly the fullscreen dimensions on my machine) with 10,000 lines of scrollback, the memory required is: 60.48 MB. Multiply this by multiple tabs, splits, etc. and it’s quite hefty.

In the new terminal state, each cell stores only the codepoint and a style ID. The style is stored in a look-aside table keyed by ID. We reference count the styles. A cell is now 8 bytes (64 bits). For the same 300x80 terminal with 10,000 lines of scrollback, the memory required is now: 24.19MB, or roughly 60% less.

We may be able to get cells down to 32-bits in the future, which would half the memory again. This PR does not do that yet since that requires more complexity, and I think we should get this in first.

Big Idea 2: Don’t Require Cells to be in Row Order

One of the most common terminal operations is moving rows. It is used heavily by editors (i.e. neovim), multiplexers (tmux), and pagers (less, bat, etc.).

The new terminal state now maintains a linear array of “row” metadata. Within the row metadata, we maintain a pointer to the start of the cells for that row. Each Row metadata structure is 8 bytes (64-bits). To move a row, we now just have to shift the rows rather than the full column widths.

In the old terminal state, for the same 300x80 terminal, erasing the top line and shifting all rows up required copying 474 KB. The positive point: it was generally linear memory (ignoring circular buffer wraparounds).

In the new terminal state, the same operation requires copying 632 bytes (bytes!), or roughly 0.1% of the old amount. And this is also linear memory.

Big Idea 3: Only the Viewport and Active is Required in Memory

The only part of the screen that terminal APIs can manipulate is known as the “active” area of the screen, and is exactly your grids cols x rows dimensions. Terminal APIs can’t modify scrollback, so scrollback history becomes read-only.

The only part of the screen that must be read for rendering is the viewport, also of cols x rows dimensions. The viewport is also very often identical to the active area since the viewport is very often at the bottom of the terminal.

The big idea here is: instead of a circular buffer, let’s split up our screen into chunks (this PR calls them “pages”) and create a system that can offload unnecessary pages to disk if necessary so they don’t have to be in-memory.

To do this, instead of using a circular buffer, this PR uses a doubly-linked list of pages. Only the active/viewport need to be readily available.

Importantly, this PR does not implement the disk serialization. This is something I want to work on soon but in the interest of not introducing too much complexity into one PR, I didn’t finish this here. The important thing is that the new architecture enables this core idea.

Big Idea 4: Pointers Suck to Copy and Serialize

The renderer needs to copy the visible screen on every frame. And the future page serialization system needs to encode/decode pages to/from disk.

The old terminal state floated around many pointers. In order to copy these values, we had to iterate through all pointer values (i.e. in a hash map) and construct a new hash map and copy each value individually. This is slow.

The new terminal state preallocates a contiguous block of memory (typically in ~512KB chunks, or roughly 32 or 128 virtual memory pages on 16KB and 4KB page system respectively). Instead of storing pointers, we now store the base address and offsets.

The offsets are 32-bit, so they can address at most 4GB of data. Therefore, our pages are capped at 4GB today. I actually would like to lower than to 16-bit capped at 65K but implementing 32-bit was easier to start. At the limit, we must allocate more pages. Not a big deal.

Since the offsets are 32-bit, every pointer in the previous state is now half the size on a 64-bit system, saving more memory.

And to copy the data, we can perform a linear copy of the virtual memory pages, update our single base address, and we’re done. This makes copying fast and serialization much more trivial.

This PR implements this idea completely.

Benchmarks 🚀

Memory Usage Under Static Scenarios

All macOS GLFW builds, ReleaseFast, macOS 14.4.

20 empty windows:

  • Old: 730 MB 🤕
  • New: 146 MB

5 windows with cat 20MB.txt of Japanese text each:

  • Old: 630 MB
  • New: 115 MB

5 Neovim windows:

  • Old: 69 MB
  • New: 67 MB

(Explanation for Neovim windows: with no scrollback, the old terminal state used the smallest circular buffer necessary. The new terminal state performs a slightly larger preallocation. The net result is roughly no memory changes.)

ASCII IO Throughput

25MB of ASCII

Benchmark 1: noop
  Time (mean ± σ):       8.4 ms ±   0.8 ms    [User: 2.2 ms, System: 5.5 ms]
  Range (min … max):     6.9 ms …  10.3 ms    225 runs

Benchmark 2: new
  Time (mean ± σ):     166.9 ms ±   8.5 ms    [User: 158.4 ms, System: 6.8 ms]
  Range (min … max):   152.8 ms … 179.6 ms    16 runs

Benchmark 3: old
  Time (mean ± σ):     191.5 ms ±   1.6 ms    [User: 179.4 ms, System: 9.5 ms]
  Range (min … max):   188.4 ms … 194.6 ms    15 runs

Summary
  'noop' ran
   19.91 ± 2.05 times faster than 'new'
   22.84 ± 2.06 times faster than 'old'

Unicode IO Throughput

25MB of Unicode

Benchmark 1: noop
  Time (mean ± σ):       9.1 ms ±   1.1 ms    [User: 2.4 ms, System: 5.8 ms]
  Range (min … max):     7.2 ms …  12.6 ms    228 runs

Benchmark 2: new
  Time (mean ± σ):      99.0 ms ±   1.5 ms    [User: 90.2 ms, System: 7.2 ms]
  Range (min … max):    97.1 ms … 103.3 ms    29 runs

Benchmark 3: old
  Time (mean ± σ):     560.3 ms ±  13.2 ms    [User: 544.4 ms, System: 11.0 ms]
  Range (min … max):   552.0 ms … 597.0 ms    10 runs

Summary
  'noop' ran
   10.94 ± 1.34 times faster than 'new'
   61.90 ± 7.66 times faster than 'old'

vtebench Results

Old:

  dense_cells (871 samples @ 1 MiB):
    11.05ms avg (90% < 11ms) +-0.28ms

  scrolling (98 samples @ 1 MiB):
    85.53ms avg (90% < 87ms) +-1.05ms

  scrolling_bottom_region (120 samples @ 1 MiB):
    83.13ms avg (90% < 84ms) +-1ms

  scrolling_bottom_small_region (120 samples @ 1 MiB):
    83.07ms avg (90% < 84ms) +-1.07ms

  scrolling_fullscreen (58 samples @ 1 MiB):
    157.53ms avg (90% < 159ms) +-1.65ms

  scrolling_top_region (36 samples @ 1 MiB):
    283.22ms avg (90% < 291ms) +-5.82ms

  scrolling_top_small_region (120 samples @ 1 MiB):
    82.89ms avg (90% < 84ms) +-0.82ms

  unicode (1208 samples @ 1.06 MiB):
    7.72ms avg (90% < 8ms) +-0.51ms

New:

  dense_cells (872 samples @ 1 MiB):
    11.01ms avg (90% < 11ms) +-0.15ms

  scrolling (490 samples @ 1 MiB):
    16.79ms avg (90% < 17ms) +-0.54ms

  scrolling_bottom_region (186 samples @ 1 MiB):
    53.23ms avg (90% < 54ms) +-0.49ms

  scrolling_bottom_small_region (187 samples @ 1 MiB):
    53.11ms avg (90% < 54ms) +-0.6ms

  scrolling_fullscreen (383 samples @ 1 MiB):
    22.13ms avg (90% < 23ms) +-0.43ms

  scrolling_top_region (362 samples @ 1 MiB):
    27.18ms avg (90% < 28ms) +-0.81ms

  scrolling_top_small_region (186 samples @ 1 MiB):
    53.36ms avg (90% < 54ms) +-0.52ms

  unicode (1305 samples @ 1.06 MiB):
    7.08ms avg (90% < 8ms) +-0.49ms

Future

The terminal state is just one offender of memory usage within Ghostty. Looking at our benchmarks, it was a significant offender, but our memory usage is still much higher than I’d like.

In the future, I believe there are improvements we can continue to make to terminal state (i.e. 32-bit cells vs 64-bit cells). However, I think there are larger low-hanging fruit such as duplicate font information between multiple terminals, redundant CPU state for the renderer (when it’s present already on the GPU), etc. I plan on addressing those soon, after this PR is merged and somewhat stable.

@andrewrk
Copy link
Collaborator

Data point:

xfce4-terminal: 5 open windows
ghostty: 5 open windows (untouched)

image

@mitchellh
Copy link
Contributor Author

@andrewrk Thanks! I didn’t mark this issue as closing your original memory issue because there’s still a lot of areas that need work. But the terminal state at this point should no longer be an issue. GTK for example has huge memory issues that still need to be addressed that macOS doesn’t have.

@andrewrk
Copy link
Collaborator

got you a nice integer overflow crash too:

thread 1179862 panic: integer overflow
/home/andy/Downloads/ghostty/src/terminal/PageList.zig:2654:47: 0x1c0145a in downOverflow (ghostty)
        const rows = self.page.data.size.rows - (self.y + 1);
                                              ^
/home/andy/Downloads/ghostty/src/terminal/PageList.zig:2629:41: 0x1b3cf87 in down (ghostty)
        return switch (self.downOverflow(n)) {
                                        ^
/home/andy/Downloads/ghostty/src/terminal/PageList.zig:1875:37: 0x1b848f8 in pin (ghostty)
    var p = self.getTopLeft(pt).down(pt.coord().y) orelse return null;
                                    ^
/home/andy/Downloads/ghostty/src/terminal/PageList.zig:2315:28: 0x1ca1cf2 in pageIterator (ghostty)
    const tl_pin = self.pin(tl_pt).?;
                           ^
/home/andy/Downloads/ghostty/src/terminal/PageList.zig:295:31: 0x1d606cd in clone (ghostty)
    var it = self.pageIterator(.right_down, opts.top, opts.bot);
                              ^
/home/andy/Downloads/ghostty/src/terminal/Screen.zig:226:37: 0x1d637d4 in clonePool (ghostty)
    var pages = try self.pages.clone(.{
                                    ^
/home/andy/Downloads/ghostty/src/terminal/Screen.zig:208:30: 0x1d6467a in clone (ghostty)
    return try self.clonePool(alloc, null, top, bot);
                             ^
/home/andy/Downloads/ghostty/src/renderer/OpenGL.zig:693:58: 0x1d901b3 in updateFrame (ghostty)
        var screen_copy = try state.terminal.screen.clone(
                                                         ^
/home/andy/Downloads/ghostty/src/renderer/Thread.zig:463:27: 0x1dd0a18 in callback (ghostty)
    t.renderer.updateFrame(
                          ^
/home/andy/.cache/zig/p/12203116ff408eb48f81c947dfeb06f7feebf6a9efa962a560ac69463098b2c04a96/src/backend/io_uring.zig:768:29: 0x1cff431 in invoke (ghostty)
        return self.callback(self.userdata, loop, self, result);
                            ^
/home/andy/.cache/zig/p/12203116ff408eb48f81c947dfeb06f7feebf6a9efa962a560ac69463098b2c04a96/src/backend/io_uring.zig:159:33: 0x1d02030 in tick___anon_155515 (ghostty)
                switch (c.invoke(self, cqe.res)) {
                                ^
/home/andy/.cache/zig/p/12203116ff408eb48f81c947dfeb06f7feebf6a9efa962a560ac69463098b2c04a96/src/backend/io_uring.zig:60:42: 0x1d021cd in run (ghostty)
            .until_done => try self.tick_(.until_done),
                                         ^
/home/andy/Downloads/ghostty/src/renderer/Thread.zig:212:26: 0x1d023d2 in threadMain_ (ghostty)
    _ = try self.loop.run(.until_done);
                         ^
/home/andy/Downloads/ghostty/src/renderer/Thread.zig:174:21: 0x1cb80d5 in threadMain (ghostty)
    self.threadMain_() catch |err| {
                    ^
/home/andy/misc/zig/lib/std/Thread.zig:406:13: 0x1c75f8a in callFn__anon_151042 (ghostty)
            @call(.auto, f, args);
            ^
/home/andy/misc/zig/lib/std/Thread.zig:674:30: 0x1c19152 in entryFn (ghostty)
                return callFn(f, args_ptr.*);
                             ^
???:?:?: 0x7fe224aca0e3 in ??? (libc.so.6)
Unwind information for `libc.so.6:0x7fe224aca0e3` was not available, trace may be incomplete

Aborted (core dumped)

to reproduce:

  1. run find in a directory with many files
  2. use keyboard shortcuts to scroll up while find is still running
  3. use keyboard shortcuts to scroll down again while find is still running

@mitchellh
Copy link
Contributor Author

@andrewrk Thank you, good bug, fixed with test cases in last commit.

@mitchellh
Copy link
Contributor Author

I got a stack trace for the vtebench crash I was getting:

thread 5269359 panic: attempt to use null value
/Users/mitchellh/code/go/src/github.com/mitchellh/ghostty/src/terminal/Screen.zig:397:50: 0x103e40f33 in cursorDown (ghostty)
    const page_pin = self.cursor.page_pin.down(n).?;
                                                 ^
/Users/mitchellh/code/go/src/github.com/mitchellh/ghostty/src/terminal/Terminal.zig:1043:31: 0x103e40d6f in index (ghostty)
        self.screen.cursorDown(1);
                              ^
/Users/mitchellh/code/go/src/github.com/mitchellh/ghostty/src/termio/Exec.zig:1961:32: 0x103fb972f in linefeed (ghostty)
        try self.terminal.index();
                               ^
/Users/mitchellh/code/go/src/github.com/mitchellh/ghostty/src/terminal/stream.zig:336:46: 0x103fb9993 in execute (ghostty)
                    try self.handler.linefeed()
                                             ^
/Users/mitchellh/code/go/src/github.com/mitchellh/ghostty/src/terminal/stream.zig:94:41: 0x103fdc847 in nextSliceCapped (ghostty)
                        try self.execute(@intCast(cp));
                                        ^
/Users/mitchellh/code/go/src/github.com/mitchellh/ghostty/src/terminal/stream.zig:60:41: 0x103fdcf2f in nextSlice (ghostty)
                try self.nextSliceCapped(input[i .. i + len], &cp_buf);
                                        ^
/Users/mitchellh/code/go/src/github.com/mitchellh/ghostty/src/termio/Exec.zig:1634:41: 0x103f8913b in threadMainPosix (ghostty)
            ev.terminal_stream.nextSlice(buf) catch |err|
                                        ^
/nix/store/y1rw1hcxc1v6d4l6cjvxxpvgcgfmxi66-zig-0.12.0-dev.3282+da5b16f9e/lib/std/Thread.zig:406:13: 0x103f1df9f in callFn__anon_266051 (ghostty)
            @call(.auto, f, args);
            ^
/nix/store/y1rw1hcxc1v6d4l6cjvxxpvgcgfmxi66-zig-0.12.0-dev.3282+da5b16f9e/lib/std/Thread.zig:674:30: 0x103ea6433 in entryFn (ghostty)
                return callFn(f, args_ptr.*);

I'm still unsure how exactly this is happening but I have a reliable repro on my Mac now:

  1. zig build -Dapp-runtime=glfw run -- do NOT touch the window, default size is important
  2. cargo run --release in vtebench

@Hardy7cc
Copy link
Collaborator

Hit an assertion

/usr/lib/zig/lib/std/debug.zig:403:14: 0x1cff50c in assert (ghostty)
    if (!ok) unreachable; // assertion failure
             ^
/home/hardy/projects/ghostty/src/terminal/page.zig:302:15: 0x1ea030c in getRowAndCell (ghostty)
        assert(y < self.size.rows);
              ^
/home/hardy/projects/ghostty/src/terminal/PageList.zig:2472:49: 0x1dda339 in rowAndCell (ghostty)
        const rac = self.page.data.getRowAndCell(self.x, self.y);
                                                ^
/home/hardy/projects/ghostty/src/terminal/Screen.zig:450:53: 0x1efc1ab in cursorReload (ghostty)
    const page_rac = self.cursor.page_pin.rowAndCell();
                                                    ^
/home/hardy/projects/ghostty/src/terminal/Screen.zig:829:22: 0x2075fbf in resizeInternal (ghostty)
    self.cursorReload();
                     ^
/home/hardy/projects/ghostty/src/terminal/Screen.zig:787:28: 0x2076053 in resize (ghostty)
    try self.resizeInternal(cols, rows, true);
                           ^
/home/hardy/projects/ghostty/src/terminal/Terminal.zig:2060:35: 0x20764a0 in resize (ghostty)
            try self.screen.resize(cols, rows);
                                  ^
/home/hardy/projects/ghostty/src/termio/Exec.zig:435:33: 0x2076cff in resize (ghostty)
        try self.terminal.resize(
                                ^
/home/hardy/projects/ghostty/src/termio/Thread.zig:353:25: 0x208872c in callback (ghostty)
        self.impl.resize(v.grid_size, v.screen_size, v.padding) catch |err| {
                        ^
/home/hardy/.cache/zig/p/12203116ff408eb48f81c947dfeb06f7feebf6a9efa962a560ac69463098b2c04a96/src/backend/io_uring.zig:768:29: 0x1f9dbe1 in invoke (ghostty)
        return self.callback(self.userdata, loop, self, result);
                            ^
/home/hardy/.cache/zig/p/12203116ff408eb48f81c947dfeb06f7feebf6a9efa962a560ac69463098b2c04a96/src/backend/io_uring.zig:159:33: 0x1fa07f0 in tick___anon_155325 (ghostty)
                switch (c.invoke(self, cqe.res)) {
                                ^
/home/hardy/.cache/zig/p/12203116ff408eb48f81c947dfeb06f7feebf6a9efa962a560ac69463098b2c04a96/src/backend/io_uring.zig:60:42: 0x1fa098d in run (ghostty)
            .until_done => try self.tick_(.until_done),
                                         ^
/home/hardy/projects/ghostty/src/termio/Thread.zig:236:22: 0x1fa4fda in threadMain_ (ghostty)
    try self.loop.run(.until_done);
                     ^
/home/hardy/projects/ghostty/src/termio/Thread.zig:140:21: 0x1f57c91 in threadMain (ghostty)
    self.threadMain_() catch |err| {
                    ^
/usr/lib/zig/lib/std/Thread.zig:406:13: 0x1f156fa in callFn__anon_150870 (ghostty)
            @call(.auto, f, args);
            ^
/usr/lib/zig/lib/std/Thread.zig:674:30: 0x1eb8e92 in entryFn (ghostty)
                return callFn(f, args_ptr.*);
                             ^
???:?:?: 0x74b4a24e8559 in ??? (libc.so.6)
Unwind information for `libc.so.6:0x74b4a24e8559` was not available, trace may be incomplete

Aborted (core dumped)

./zig-out/bin/ghostty
info: ghostty version=0.1.0-HEAD+6fdc985b
info: runtime=apprt.Runtime.gtk
info: font_backend=font.main.Backend.fontconfig_freetype
info: dependency harfbuzz=8.2.2
info: dependency fontconfig=21402
info: renderer=renderer.OpenGL
info: libxev backend=main.Backend.io_uring

to reproduce:

  1. generate some content to have some scrolling e.g. cat README.md
  2. Half the height of the window. e.g. I'm using sway and if I then open another window to the bottom half I see all the lines squashed to about half their height

While trying to reproduce this with a floating window in sway and resizing the window in all unreasonable ways I got the following stack trace:

/home/hardy/projects/ghostty/src/terminal/PageList.zig:2383:42: 0x1b451b5 in getBottomRight (ghostty)
                .y = page.data.size.rows - 1,
                                         ^
/home/hardy/projects/ghostty/src/terminal/PageList.zig:2330:28: 0x1ca40b0 in pageIterator (ghostty)
        self.getBottomRight(tl_pt) orelse return .{ .row = null };
                           ^
/home/hardy/projects/ghostty/src/terminal/PageList.zig:564:31: 0x1dd746c in resizeCols (ghostty)
    var it = self.pageIterator(.right_down, .{ .screen = .{} }, null);
                              ^
/home/hardy/projects/ghostty/src/terminal/PageList.zig:528:32: 0x1dd85c2 in resize (ghostty)
            try self.resizeCols(cols, opts.cursor);
                               ^
/home/hardy/projects/ghostty/src/terminal/Screen.zig:813:26: 0x1dd86cd in resizeInternal (ghostty)
    try self.pages.resize(.{
                         ^
/home/hardy/projects/ghostty/src/terminal/Screen.zig:787:28: 0x1dd87b3 in resize (ghostty)
    try self.resizeInternal(cols, rows, true);
                           ^
/home/hardy/projects/ghostty/src/terminal/Terminal.zig:2060:35: 0x1dd8c00 in resize (ghostty)
            try self.screen.resize(cols, rows);
                                  ^
/home/hardy/projects/ghostty/src/termio/Exec.zig:435:33: 0x1dd945f in resize (ghostty)
        try self.terminal.resize(
                                ^
/home/hardy/projects/ghostty/src/termio/Thread.zig:353:25: 0x1deae8c in callback (ghostty)
        self.impl.resize(v.grid_size, v.screen_size, v.padding) catch |err| {
                        ^
/home/hardy/.cache/zig/p/12203116ff408eb48f81c947dfeb06f7feebf6a9efa962a560ac69463098b2c04a96/src/backend/io_uring.zig:768:29: 0x1d00341 in invoke (ghostty)
        return self.callback(self.userdata, loop, self, result);
                            ^
/home/hardy/.cache/zig/p/12203116ff408eb48f81c947dfeb06f7feebf6a9efa962a560ac69463098b2c04a96/src/backend/io_uring.zig:159:33: 0x1d02f50 in tick___anon_155325 (ghostty)
                switch (c.invoke(self, cqe.res)) {
                                ^
/home/hardy/.cache/zig/p/12203116ff408eb48f81c947dfeb06f7feebf6a9efa962a560ac69463098b2c04a96/src/backend/io_uring.zig:60:42: 0x1d030ed in run (ghostty)
            .until_done => try self.tick_(.until_done),
                                         ^
/home/hardy/projects/ghostty/src/termio/Thread.zig:236:22: 0x1d0773a in threadMain_ (ghostty)
    try self.loop.run(.until_done);
                     ^
/home/hardy/projects/ghostty/src/termio/Thread.zig:140:21: 0x1cba3f1 in threadMain (ghostty)
    self.threadMain_() catch |err| {
                    ^
/usr/lib/zig/lib/std/Thread.zig:406:13: 0x1c77e5a in callFn__anon_150870 (ghostty)
            @call(.auto, f, args);
            ^
/usr/lib/zig/lib/std/Thread.zig:674:30: 0x1c1b5f2 in entryFn (ghostty)
                return callFn(f, args_ptr.*);
                             ^
???:?:?: 0x73c0df686559 in ??? (libc.so.6)
Unwind information for `libc.so.6:0x73c0df686559` was not available, trace may be incomplete

Aborted (core dumped)

@echasnovski
Copy link
Collaborator

A subjectively high memory usage for several opened instances is what keeps me away from daily driving Ghostty. So I am really interested in helping this get improved. At least by providing some data.

Here are some results on my Linux machine with i3:

Current `main` (4b1958b)

ghostty_paged-terminal-compare_main

Current `paged-terminal` (6fdc985)

ghostty_paged-terminal-compare_optimized

For comparison, some other terminal emulators:

Latest `WezTerm` (20240203-110809-5046fc22)

ghostty_paged-terminal-compare_wezterm

My `st` build

ghostty_paged-terminal-compare_st

`xfce4-terminal`

ghostty_paged-terminal-compare_xfce4-terminal

All tests are performed with reasonable same steps:

  • Open one instance with first Neovim session.
  • Open second instance (in separate window) with second Neovim session.

So my current observations are:

  • The paged-terminal definitely has improvement in (at least) memory usage of the first instance. Current main has 224 Mb and 158 Mb, while paged-terminal has around 162 Mb and 161 Mb.
  • WezTerm performs worse here: 205 Mb and 176 Mb.
  • Both st and xfce4-terminal have much less memory usage, which is totally expected as they don't have that much features. What I noticed, though, is that xfce4-terminal seems to have single process running for both instances in separate windows (according to btop). Is there some magic involved that Ghostty can benefit from?

@mitchellh
Copy link
Contributor Author

@echasnovski

A subjectively high memory usage for several opened instances is what keeps me away from daily driving Ghostty.

What I noticed, though, is that xfce4-terminal seems to have single process running for both instances in separate windows (according to btop). Is there some magic involved that Ghostty can benefit from?

Ghostty does this already, but by default only when it detects it was launched from a desktop environment. You can force this behavior using gtk-single-instance = true (either in a config file or via the CLI flags). For example ghostty --gtk-single-instance=true.

So your experience may be due to launching Ghostty in a way that its default detection is detecting it is CLI-launched. You will indeed get significant memory (and startup time) savings by using single-instance mode.

The reason we have this detection at all is because single instance mode shares a base set of configuration. For CLI-launched Ghostty instances, I want to allow users to use CLI flags to override settings like font sizes, colors, etc. By launching a separate instance we can do that.

Both st and xfce4-terminal have much less memory usage, which is totally expected as they don't have that much features.

Otherwise, thank you for your data points! This is helpful.

I want to reiterate that this PR address terminal state memory usage but there are many other components in Ghostty that can optimize memory. I don't think st/xfce necessarily have less memory due to their feature choices, I think Ghostty still simply has improvements to make 😄

@qwerasd205
Copy link
Collaborator

I made an issue #1591 which tracks a handful of panics/asserts run in to on this PR (Making this comment cause I forgot to mention the PR in the issue so it didn't get listed here)

@mitchellh
Copy link
Contributor Author

@Hardy7cc I believe your issue is fixed 4fe49c7

@echasnovski
Copy link
Collaborator

It seems that on paged-terminal cursor block is highlighted so that it does not show underlying text (i.e. looks like foreground and background colors are the same). It is with default config and main does not have this.

Here is a screenshot (there is an a under cursor):

ghostty_paged-terminal_cursor-colors

@mitchellh
Copy link
Contributor Author

It seems that on paged-terminal cursor block is highlighted so that it does not show underlying text (i.e. looks like foreground and background colors are the same). It is with default config and main does not have this.

Here is a screenshot (there is an a under cursor):

ghostty_paged-terminal_cursor-colors

Fixed.

@mitchellh
Copy link
Contributor Author

mitchellh commented Mar 20, 2024

Status update on this PR: I'm now able to run cat /dev/urandom for 30+ minutes without a crash. This simply means that the very obvious crashes due to assertions or memory corruption are now gone. When I first opened this PR, I could get a crash in seconds, then dozens of seconds, then a minute, then a few minutes, then 10 minutes, and now we're up to 30 minutes and running still (no crash at the moment)...

What this says about the stability of this PR: crashes due to terminal sequence state changes are in an acceptably rare state. I'm sure fuzzing will find more but for the reality of daily usage we're probably doing really well.

What this doesn't say about the stability of this PR: this doesn't address how stable UI interactions are such as resizing, keybindings such as screen clears, prompt jumping, selection, etc. That can't be so naively fuzzed with random bytes. Of all the aforementioned features, resizing is the most important thing to test.

If anyone got a crash previously, please test with the latest builds and report back (or open a new issue) if you can get one. Thank you! ❤️

@mitchellh
Copy link
Contributor Author

mitchellh commented Mar 20, 2024

There is still some sort of bug with some escape sequence around not clearing rows:

CleanShot 2024-03-19 at 22 01 00@2x

I don't have a reliable reproduction for this yet, but noting this exists. I haven't really noticed it in day to day usage. This is the only mis-rendering I've noticed at all so far on this branch.

@stgarf
Copy link
Collaborator

stgarf commented Mar 20, 2024

If anyone got a crash previously, please test with the latest builds and report back (or open a new issue) if you can get one. Thank you! ❤️

I think you've fixed my "easily reproducible" crash with the latest two commits (at least, I can't just generally muck around in the terminal and cause the crash anymore).

$ /Applications/GhosttyBeta.app/Contents/MacOS/ghostty +version
Ghostty 0.1.0-paged-terminal+a4d3af65

Build Config
  - Zig version: 0.12.0-dev.3342+f88a971e4
  - build mode : builtin.OptimizeMode.ReleaseSafe
  - app runtime: apprt.Runtime.none
  - font engine: font.main.Backend.coretext
  - renderer   : renderer.Metal
  - libxev     : main.Backend.kqueue

@mitchellh
Copy link
Contributor Author

As of the last few commits, all outstanding todos are complete with the exception of a CSI performance regression fix which is coming shortly. At the time of writing this, there are no known bugs hit by testers. I believe we're very close to being able to merge this PR.

@mitchellh
Copy link
Contributor Author

I'm preparing to merge this so some last minute benchmarks...

ASCII Stream

Benchmark 1: noop
  Time (mean ± σ):       7.9 ms ±   1.0 ms    [User: 2.2 ms, System: 5.3 ms]
  Range (min … max):     6.5 ms …  12.6 ms    264 runs

Benchmark 2: new
  Time (mean ± σ):     146.6 ms ±   3.3 ms    [User: 138.7 ms, System: 6.4 ms]
  Range (min … max):   140.5 ms … 151.5 ms    20 runs

Benchmark 3: old
  Time (mean ± σ):     205.1 ms ±  74.6 ms    [User: 194.1 ms, System: 8.7 ms]
  Range (min … max):   183.7 ms … 474.7 ms    15 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  'noop' ran
   18.45 ± 2.33 times faster than 'new'
   25.82 ± 9.92 times faster than 'old'

UTF8 Stream

Benchmark 1: noop
  Time (mean ± σ):       8.3 ms ±   0.8 ms    [User: 2.2 ms, System: 5.5 ms]
  Range (min … max):     6.7 ms …  10.3 ms    251 runs

Benchmark 2: new
  Time (mean ± σ):      97.8 ms ±   2.2 ms    [User: 89.9 ms, System: 6.6 ms]
  Range (min … max):    96.2 ms … 106.5 ms    29 runs

Benchmark 3: old
  Time (mean ± σ):     349.5 ms ±   2.7 ms    [User: 337.7 ms, System: 9.1 ms]
  Range (min … max):   345.3 ms … 352.8 ms    10 runs

Summary
  'noop' ran
   11.72 ± 1.19 times faster than 'new'
   41.89 ± 4.14 times faster than 'old'

Random Bytes Stream

Benchmark 1: noop
  Time (mean ± σ):       5.8 ms ±   1.0 ms    [User: 1.8 ms, System: 3.7 ms]
  Range (min … max):     4.8 ms …  10.9 ms    310 runs

Benchmark 2: new
  Time (mean ± σ):     538.4 ms ±   3.9 ms    [User: 351.8 ms, System: 184.4 ms]
  Range (min … max):   535.2 ms … 545.8 ms    10 runs

Benchmark 3: old
  Time (mean ± σ):     854.9 ms ±   6.7 ms    [User: 663.4 ms, System: 188.0 ms]
  Range (min … max):   848.3 ms … 871.1 ms    10 runs

Summary
  'noop' ran
   92.30 ± 16.16 times faster than 'new'
  146.56 ± 25.67 times faster than 'old'

vtebench Old

Results:

  dense_cells (867 samples @ 1 MiB):
    10.99ms avg (90% < 11ms) +-0.32ms

  scrolling (98 samples @ 1 MiB):
    85.63ms avg (90% < 87ms) +-1.08ms

  scrolling_bottom_region (120 samples @ 1 MiB):
    83.39ms avg (90% < 85ms) +-0.97ms

  scrolling_bottom_small_region (120 samples @ 1 MiB):
    83.19ms avg (90% < 84ms) +-0.96ms

  scrolling_fullscreen (58 samples @ 1 MiB):
    157.45ms avg (90% < 159ms) +-1.33ms

  scrolling_top_region (36 samples @ 1 MiB):
    280.44ms avg (90% < 287ms) +-3.96ms

  scrolling_top_small_region (120 samples @ 1 MiB):
    82.98ms avg (90% < 84ms) +-0.69ms

  unicode (1190 samples @ 1.06 MiB):
    7.82ms avg (90% < 8ms) +-0.49ms

vtebench New

Results:

  dense_cells (869 samples @ 1 MiB):
    11.01ms avg (90% < 11ms) +-0.12ms

  scrolling (473 samples @ 1 MiB):
    17.21ms avg (90% < 18ms) +-0.43ms

  scrolling_bottom_region (385 samples @ 1 MiB):
    25.46ms avg (90% < 26ms) +-0.56ms

  scrolling_bottom_small_region (385 samples @ 1 MiB):
    25.5ms avg (90% < 26ms) +-0.58ms

  scrolling_fullscreen (372 samples @ 1 MiB):
    22.81ms avg (90% < 23ms) +-0.52ms

  scrolling_top_region (386 samples @ 1 MiB):
    25.36ms avg (90% < 26ms) +-0.58ms

  scrolling_top_small_region (386 samples @ 1 MiB):
    25.45ms avg (90% < 26ms) +-0.51ms

  unicode (1348 samples @ 1.06 MiB):
    6.94ms avg (90% < 7ms) +-0.35ms

@mitchellh mitchellh merged commit caf2742 into main Mar 27, 2024
10 checks passed
@mitchellh mitchellh deleted the paged-terminal branch March 27, 2024 03:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants