TextArea crashes Python on html language #4152

juftin · 2024-02-13T04:49:47Z

The TextArea widget crashes the Python process when trying to render the attached HTML file with the html language: snapshot_report.html.zip.

There typically isn't any stacktrace for me, the entire terminal window closes with a Python quit unexpectedly message. I was able to observe the following error though:

I used the pytest-textual-snapshot package to create the original HTML file. I've tested this on ARM Mac. This includes with the syntax extras installed. When you remove the language param the file is rendered correctly.

"""
TextArea Example
"""

import os
import pathlib
from typing import Any

from textual.app import ComposeResult, App
from textual.widgets import TextArea


class TextAreaApp(App):

    def __init__(self, path: os.PathLike, *args: Any, **kwargs: Any) -> None:
        super().__init__(*args, **kwargs)
        self.path = pathlib.Path(path)
        self.text_area = TextArea(
            language="html",
            theme="monokai",
            soft_wrap=False,
            show_line_numbers=True,
        )

    def compose(self) -> ComposeResult:
        yield self.text_area

    def on_mount(self) -> None:
        self.text_area.text = self.path.read_text(encoding="utf-8")


html_path = pathlib.Path(".") / "snapshot_report.html"
app = TextAreaApp(path=html_path)

if __name__ == "__main__":
    app.run()

The text was updated successfully, but these errors were encountered:

TomJGooding · 2024-02-14T16:32:21Z

I think the problem is the size of the html file. I experimented with reducing this to ~1,000 lines and eventually the app does load.

Running the original example with gdb shows a segfault related to treesitter:

Starting program: /home/tom/Projects/textual-sandbox/.venv/bin/python3 repro_4152.py
[Thread debugging using libthread_db enabled]                                                                                                                 
Using host libthread_db library "/usr/lib/libthread_db.so.1".
[New Thread 0x7ffff29ff6c0 (LWP 82877)]                                                                                                                       
[New Thread 0x7ffff21fe6c0 (LWP 82878)]

Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
ts_query_cursor__compare_captures (self=self@entry=0x5555556c9e40, left_state=left_state@entry=0x555556e897d0, right_state=right_state@entry=0x555556e897e0, left_contains_right=left_contains_right@entry=0x7fffffffc438, right_contains_left=right_contains_left@entry=0x7fffffffc490) at tree_sitter/core/lib/src/./query.c:3183
3183	tree_sitter/core/lib/src/./query.c: Bad file descriptor.

Backtrace

#0  ts_query_cursor__compare_captures (self=self@entry=0x5555556c9e40, left_state=left_state@entry=0x555556e897d0, right_state=right_state@entry=0x555556e897e0, left_contains_right=left_contains_right@entry=0x7fffffffc438, right_contains_left=right_contains_left@entry=0x7fffffffc490) at tree_sitter/core/lib/src/./query.c:3183
#1  0x00007ffff5bed90c in ts_query_cursor__advance (self=self@entry=0x5555556c9e40, stop_on_definite_step=stop_on_definite_step@entry=true) at tree_sitter/core/lib/src/./query.c:3868
#2  0x00007ffff5bee60a in ts_query_cursor_next_capture (self=0x5555556c9e40, match=match@entry=0x7fffffffc680, capture_index=capture_index@entry=0x7fffffffc664) at tree_sitter/core/lib/src/./query.c:4116
#3  0x00007ffff5bcf763 in query_captures (self=0x7ffff5b480f0, args=<optimized out>, kwargs=<optimized out>) at tree_sitter/binding.c:1994
#4  0x00007ffff79f9ea1 in ?? () from /usr/lib/libpython3.11.so.1.0
#5  0x00007ffff7a13d35 in PyObject_Call () from /usr/lib/libpython3.11.so.1.0
#6  0x00007ffff79eb237 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.11.so.1.0
#7  0x00007ffff7a09560 in _PyFunction_Vectorcall () from /usr/lib/libpython3.11.so.1.0
#8  0x00007ffff79f1327 in PyObject_Vectorcall () from /usr/lib/libpython3.11.so.1.0
#9  0x00007ffff7aa60fa in ?? () from /usr/lib/libpython3.11.so.1.0
#10 0x00007ffff79d1565 in _PyObject_GenericSetAttrWithDict () from /usr/lib/libpython3.11.so.1.0
#11 0x00007ffff79d0da3 in PyObject_SetAttr () from /usr/lib/libpython3.11.so.1.0
#12 0x00007ffff79e6faa in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.11.so.1.0
#13 0x00007ffff7a2b583 in ?? () from /usr/lib/libpython3.11.so.1.0
#14 0x00007ffff7a2aaf8 in ?? () from /usr/lib/libpython3.11.so.1.0
#15 0x00007ffff79e722b in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.11.so.1.0
#16 0x00007ffff7ab06f2 in ?? () from /usr/lib/libpython3.11.so.1.0
#17 0x00007ffff79e6305 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.11.so.1.0
#18 0x00007ffff7ab06f2 in ?? () from /usr/lib/libpython3.11.so.1.0
#19 0x00007ffff79e6305 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.11.so.1.0
#20 0x00007ffff7ab06f2 in ?? () from /usr/lib/libpython3.11.so.1.0
#21 0x00007ffff79e6305 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.11.so.1.0
#22 0x00007ffff7ab06f2 in ?? () from /usr/lib/libpython3.11.so.1.0
#23 0x00007ffff79e6305 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.11.so.1.0
#24 0x00007ffff7ab06f2 in ?? () from /usr/lib/libpython3.11.so.1.0
#25 0x00007ffff79e6305 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.11.so.1.0
#26 0x00007ffff7ab06f2 in ?? () from /usr/lib/libpython3.11.so.1.0
#27 0x00007ffff79e6305 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.11.so.1.0
#28 0x00007ffff7ab06f2 in ?? () from /usr/lib/libpython3.11.so.1.0
#29 0x00007ffff79e6305 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.11.so.1.0
#30 0x00007ffff7ab06f2 in ?? () from /usr/lib/libpython3.11.so.1.0
#31 0x00007ffff79e6305 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.11.so.1.0
#32 0x00007ffff7ab06f2 in ?? () from /usr/lib/libpython3.11.so.1.0
#33 0x00007ffff79e6305 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.11.so.1.0
#34 0x00007ffff7ab06f2 in ?? () from /usr/lib/libpython3.11.so.1.0
#35 0x00007ffff6dd8886 in ?? () from /usr/lib/python3.11/lib-dynload/_asyncio.cpython-311-x86_64-linux-gnu.so
#36 0x00007ffff79f9165 in ?? () from /usr/lib/libpython3.11.so.1.0
#37 0x00007ffff798ff41 in ?? () from /usr/lib/libpython3.11.so.1.0
#38 0x00007ffff7990a35 in ?? () from /usr/lib/libpython3.11.so.1.0
#39 0x00007ffff79f210a in ?? () from /usr/lib/libpython3.11.so.1.0
#40 0x00007ffff79eb237 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.11.so.1.0
#41 0x00007ffff7a9cd1a in ?? () from /usr/lib/libpython3.11.so.1.0
#42 0x00007ffff7a9c72c in PyEval_EvalCode () from /usr/lib/libpython3.11.so.1.0
#43 0x00007ffff7aba893 in ?? () from /usr/lib/libpython3.11.so.1.0
#44 0x00007ffff7ab6dda in ?? () from /usr/lib/libpython3.11.so.1.0
#45 0x00007ffff7acd0d3 in ?? () from /usr/lib/libpython3.11.so.1.0
#46 0x00007ffff7acc905 in _PyRun_SimpleFileObject () from /usr/lib/libpython3.11.so.1.0
#47 0x00007ffff7acb2f8 in _PyRun_AnyFileObject () from /usr/lib/libpython3.11.so.1.0
#48 0x00007ffff7ac5b98 in Py_RunMain () from /usr/lib/libpython3.11.so.1.0
#49 0x00007ffff7a8ef8b in Py_BytesMain () from /usr/lib/libpython3.11.so.1.0
#50 0x00007ffff7645cd0 in ?? () from /usr/lib/libc.so.6
#51 0x00007ffff7645d8a in __libc_start_main () from /usr/lib/libc.so.6
#52 0x0000555555555045 in _start ()

davep · 2024-02-14T16:35:34Z

Yeah, I had a very quick play yesterday (not enough to draw any useful conclusions) and it felt like it was a tree-sitter issue; for example when I loaded the file as-is but said it was python code it loaded pretty much instantly. Didn't seem to be an actual problem with TextArea itself.

TomJGooding · 2024-02-14T18:41:18Z

Delving deeper...

from pathlib import Path
from time import perf_counter

from tree_sitter_languages import get_language, get_parser

html_path = Path("snapshot_report_snip.html")
# https://github.com/Textualize/textual/issues/4152
html_text = html_path.read_text()

language = get_language("html")
parser = get_parser("html")

syntax_tree = parser.parse(bytes(html_text, "utf-8"))

textual_query_path = Path("html_textual.scm")
# https://github.com/Textualize/textual/blob/main/src/textual/tree-sitter/highlights/html.scm
textual_query_scm = textual_query_path.read_text()

test_query_path = Path("html_test.scm")
# https://github.com/tree-sitter/tree-sitter-html/blob/master/queries/highlights.scm
test_query_scm = test_query_path.read_text()

print("Running 'html_test.scm' highlight query...")
start = perf_counter()
test_query = language.query(test_query_scm)
test_captures = test_query.captures(syntax_tree.root_node)
end = perf_counter()
print(f"{len(test_captures)} captures took {end - start:.4f} seconds")
print("=====")

print("Running 'html_textual.scm' highlight query...")
start = perf_counter()
textual_query = language.query(textual_query_scm)
textual_captures = textual_query.captures(syntax_tree.root_node)
end = perf_counter()
print(f"{len(textual_captures)} captures took {end - start:.4f} seconds")
print("=====")

The issue seems to be a combination of the query pattern in tree-sitter/highlights/html.scm and the size of the file. Reducing the example snapshot html to ~1,000 lines will eventually work, but experimenting with a different scm file runs in less than a second even with the original file example.

TomJGooding · 2024-02-20T19:28:36Z

I haven't delved deep enough yet to understand the ((element (start_tag (tag_name) @_tag) patterns. but these seem to be the culprit. Even just including one of these will segfault with the original example. Removing these runs in less than a second.

from pathlib import Path
from time import perf_counter

from tree_sitter_languages import get_language, get_parser

highlights_query = """
(tag_name) @tag
(erroneous_end_tag_name) @html.end_tag_error
(comment) @comment
(attribute_name) @tag.attribute
(attribute
  (quoted_attribute_value) @string)
(text) @text @spell

((attribute
   (attribute_name) @_attr
   (quoted_attribute_value (attribute_value) @text.uri))
 (#any-of? @_attr "href" "src"))

[
 "<"
 ">"
 "</"
 "/>"
] @tag.delimiter

"=" @operator

(doctype) @constant

"<!" @tag.delimiter
"""


html_path = Path("snapshot_report.html")
# https://github.com/Textualize/textual/issues/4152
html_text = html_path.read_text()

language = get_language("html")
parser = get_parser("html")

syntax_tree = parser.parse(bytes(html_text, "utf-8"))

start = perf_counter()
test_query = language.query(highlights_query)
test_captures = test_query.captures(syntax_tree.root_node)
end = perf_counter()
print(f"{len(test_captures)} captures took {end - start:.4f} seconds")

Fixes Textualize#4152 by removing all `((element (start_tag (tag_name) @_tag)` patterns from the `html.scm` highlights query file. These patterns will cause a segfault on relatively large documents and even just one seems a massively expensive operation from some quick testing. All tests pass after removing these and I couldn't see they were actually used anywhere in syntax highlighting, but please correct me if I'm wrong!

* fix(tree-sitter): remove slow html highlight patterns Fixes #4152 by removing all `((element (start_tag (tag_name) @_tag)` patterns from the `html.scm` highlights query file. These patterns will cause a segfault on relatively large documents and even just one seems a massively expensive operation from some quick testing. All tests pass after removing these and I couldn't see they were actually used anywhere in syntax highlighting, but please correct me if I'm wrong! * run tests in ci * Update changelog --------- Co-authored-by: Darren Burns <[email protected]>

github-actions · 2024-02-22T13:11:39Z

Don't forget to star the repository!

Follow @textualizeio for Textual updates.

Textualize deleted a comment from github-actions bot Feb 13, 2024

TomJGooding mentioned this issue Feb 20, 2024

fix(tree-sitter): remove slow html highlight patterns #4195

Merged

3 tasks

darrenburns closed this as completed in #4195 Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TextArea crashes Python on html language #4152

TextArea crashes Python on html language #4152

juftin commented Feb 13, 2024

TomJGooding commented Feb 14, 2024

davep commented Feb 14, 2024

TomJGooding commented Feb 14, 2024 •

edited

Loading

TomJGooding commented Feb 20, 2024 •

edited

Loading

github-actions bot commented Feb 22, 2024

TextArea crashes Python on html language #4152

TextArea crashes Python on html language #4152

Comments

juftin commented Feb 13, 2024

TomJGooding commented Feb 14, 2024

davep commented Feb 14, 2024

TomJGooding commented Feb 14, 2024 • edited Loading

TomJGooding commented Feb 20, 2024 • edited Loading

github-actions bot commented Feb 22, 2024

TomJGooding commented Feb 14, 2024 •

edited

Loading

TomJGooding commented Feb 20, 2024 •

edited

Loading