Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TextArea crashes Python on html language #4152

Closed
juftin opened this issue Feb 13, 2024 · 5 comments · Fixed by #4195
Closed

TextArea crashes Python on html language #4152

juftin opened this issue Feb 13, 2024 · 5 comments · Fixed by #4195

Comments

@juftin
Copy link
Contributor

juftin commented Feb 13, 2024

The TextArea widget crashes the Python process when trying to render the attached HTML file with the html language: snapshot_report.html.zip.

There typically isn't any stacktrace for me, the entire terminal window closes with a Python quit unexpectedly message. I was able to observe the following error though:
image

I used the pytest-textual-snapshot package to create the original HTML file. I've tested this on ARM Mac. This includes with the syntax extras installed. When you remove the language param the file is rendered correctly.

"""
TextArea Example
"""

import os
import pathlib
from typing import Any

from textual.app import ComposeResult, App
from textual.widgets import TextArea


class TextAreaApp(App):

    def __init__(self, path: os.PathLike, *args: Any, **kwargs: Any) -> None:
        super().__init__(*args, **kwargs)
        self.path = pathlib.Path(path)
        self.text_area = TextArea(
            language="html",
            theme="monokai",
            soft_wrap=False,
            show_line_numbers=True,
        )

    def compose(self) -> ComposeResult:
        yield self.text_area

    def on_mount(self) -> None:
        self.text_area.text = self.path.read_text(encoding="utf-8")


html_path = pathlib.Path(".") / "snapshot_report.html"
app = TextAreaApp(path=html_path)

if __name__ == "__main__":
    app.run()
@Textualize Textualize deleted a comment from github-actions bot Feb 13, 2024
@TomJGooding
Copy link
Contributor

I think the problem is the size of the html file. I experimented with reducing this to ~1,000 lines and eventually the app does load.

Running the original example with gdb shows a segfault related to treesitter:

Starting program: /home/tom/Projects/textual-sandbox/.venv/bin/python3 repro_4152.py
[Thread debugging using libthread_db enabled]                                                                                                                 
Using host libthread_db library "/usr/lib/libthread_db.so.1".
[New Thread 0x7ffff29ff6c0 (LWP 82877)]                                                                                                                       
[New Thread 0x7ffff21fe6c0 (LWP 82878)]

Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
ts_query_cursor__compare_captures (self=self@entry=0x5555556c9e40, left_state=left_state@entry=0x555556e897d0, right_state=right_state@entry=0x555556e897e0, left_contains_right=left_contains_right@entry=0x7fffffffc438, right_contains_left=right_contains_left@entry=0x7fffffffc490) at tree_sitter/core/lib/src/./query.c:3183
3183	tree_sitter/core/lib/src/./query.c: Bad file descriptor. 
Backtrace
#0  ts_query_cursor__compare_captures (self=self@entry=0x5555556c9e40, left_state=left_state@entry=0x555556e897d0, right_state=right_state@entry=0x555556e897e0, left_contains_right=left_contains_right@entry=0x7fffffffc438, right_contains_left=right_contains_left@entry=0x7fffffffc490) at tree_sitter/core/lib/src/./query.c:3183
#1  0x00007ffff5bed90c in ts_query_cursor__advance (self=self@entry=0x5555556c9e40, stop_on_definite_step=stop_on_definite_step@entry=true) at tree_sitter/core/lib/src/./query.c:3868
#2  0x00007ffff5bee60a in ts_query_cursor_next_capture (self=0x5555556c9e40, match=match@entry=0x7fffffffc680, capture_index=capture_index@entry=0x7fffffffc664) at tree_sitter/core/lib/src/./query.c:4116
#3  0x00007ffff5bcf763 in query_captures (self=0x7ffff5b480f0, args=<optimized out>, kwargs=<optimized out>) at tree_sitter/binding.c:1994
#4  0x00007ffff79f9ea1 in ?? () from /usr/lib/libpython3.11.so.1.0
#5  0x00007ffff7a13d35 in PyObject_Call () from /usr/lib/libpython3.11.so.1.0
#6  0x00007ffff79eb237 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.11.so.1.0
#7  0x00007ffff7a09560 in _PyFunction_Vectorcall () from /usr/lib/libpython3.11.so.1.0
#8  0x00007ffff79f1327 in PyObject_Vectorcall () from /usr/lib/libpython3.11.so.1.0
#9  0x00007ffff7aa60fa in ?? () from /usr/lib/libpython3.11.so.1.0
#10 0x00007ffff79d1565 in _PyObject_GenericSetAttrWithDict () from /usr/lib/libpython3.11.so.1.0
#11 0x00007ffff79d0da3 in PyObject_SetAttr () from /usr/lib/libpython3.11.so.1.0
#12 0x00007ffff79e6faa in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.11.so.1.0
#13 0x00007ffff7a2b583 in ?? () from /usr/lib/libpython3.11.so.1.0
#14 0x00007ffff7a2aaf8 in ?? () from /usr/lib/libpython3.11.so.1.0
#15 0x00007ffff79e722b in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.11.so.1.0
#16 0x00007ffff7ab06f2 in ?? () from /usr/lib/libpython3.11.so.1.0
#17 0x00007ffff79e6305 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.11.so.1.0
#18 0x00007ffff7ab06f2 in ?? () from /usr/lib/libpython3.11.so.1.0
#19 0x00007ffff79e6305 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.11.so.1.0
#20 0x00007ffff7ab06f2 in ?? () from /usr/lib/libpython3.11.so.1.0
#21 0x00007ffff79e6305 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.11.so.1.0
#22 0x00007ffff7ab06f2 in ?? () from /usr/lib/libpython3.11.so.1.0
#23 0x00007ffff79e6305 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.11.so.1.0
#24 0x00007ffff7ab06f2 in ?? () from /usr/lib/libpython3.11.so.1.0
#25 0x00007ffff79e6305 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.11.so.1.0
#26 0x00007ffff7ab06f2 in ?? () from /usr/lib/libpython3.11.so.1.0
#27 0x00007ffff79e6305 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.11.so.1.0
#28 0x00007ffff7ab06f2 in ?? () from /usr/lib/libpython3.11.so.1.0
#29 0x00007ffff79e6305 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.11.so.1.0
#30 0x00007ffff7ab06f2 in ?? () from /usr/lib/libpython3.11.so.1.0
#31 0x00007ffff79e6305 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.11.so.1.0
#32 0x00007ffff7ab06f2 in ?? () from /usr/lib/libpython3.11.so.1.0
#33 0x00007ffff79e6305 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.11.so.1.0
#34 0x00007ffff7ab06f2 in ?? () from /usr/lib/libpython3.11.so.1.0
#35 0x00007ffff6dd8886 in ?? () from /usr/lib/python3.11/lib-dynload/_asyncio.cpython-311-x86_64-linux-gnu.so
#36 0x00007ffff79f9165 in ?? () from /usr/lib/libpython3.11.so.1.0
#37 0x00007ffff798ff41 in ?? () from /usr/lib/libpython3.11.so.1.0
#38 0x00007ffff7990a35 in ?? () from /usr/lib/libpython3.11.so.1.0
#39 0x00007ffff79f210a in ?? () from /usr/lib/libpython3.11.so.1.0
#40 0x00007ffff79eb237 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.11.so.1.0
#41 0x00007ffff7a9cd1a in ?? () from /usr/lib/libpython3.11.so.1.0
#42 0x00007ffff7a9c72c in PyEval_EvalCode () from /usr/lib/libpython3.11.so.1.0
#43 0x00007ffff7aba893 in ?? () from /usr/lib/libpython3.11.so.1.0
#44 0x00007ffff7ab6dda in ?? () from /usr/lib/libpython3.11.so.1.0
#45 0x00007ffff7acd0d3 in ?? () from /usr/lib/libpython3.11.so.1.0
#46 0x00007ffff7acc905 in _PyRun_SimpleFileObject () from /usr/lib/libpython3.11.so.1.0
#47 0x00007ffff7acb2f8 in _PyRun_AnyFileObject () from /usr/lib/libpython3.11.so.1.0
#48 0x00007ffff7ac5b98 in Py_RunMain () from /usr/lib/libpython3.11.so.1.0
#49 0x00007ffff7a8ef8b in Py_BytesMain () from /usr/lib/libpython3.11.so.1.0
#50 0x00007ffff7645cd0 in ?? () from /usr/lib/libc.so.6
#51 0x00007ffff7645d8a in __libc_start_main () from /usr/lib/libc.so.6
#52 0x0000555555555045 in _start ()

@davep
Copy link
Contributor

davep commented Feb 14, 2024

Yeah, I had a very quick play yesterday (not enough to draw any useful conclusions) and it felt like it was a tree-sitter issue; for example when I loaded the file as-is but said it was python code it loaded pretty much instantly. Didn't seem to be an actual problem with TextArea itself.

@TomJGooding
Copy link
Contributor

TomJGooding commented Feb 14, 2024

Delving deeper...

from pathlib import Path
from time import perf_counter

from tree_sitter_languages import get_language, get_parser

html_path = Path("snapshot_report_snip.html")
# https://github.com/Textualize/textual/issues/4152
html_text = html_path.read_text()

language = get_language("html")
parser = get_parser("html")

syntax_tree = parser.parse(bytes(html_text, "utf-8"))

textual_query_path = Path("html_textual.scm")
# https://github.com/Textualize/textual/blob/main/src/textual/tree-sitter/highlights/html.scm
textual_query_scm = textual_query_path.read_text()

test_query_path = Path("html_test.scm")
# https://github.com/tree-sitter/tree-sitter-html/blob/master/queries/highlights.scm
test_query_scm = test_query_path.read_text()

print("Running 'html_test.scm' highlight query...")
start = perf_counter()
test_query = language.query(test_query_scm)
test_captures = test_query.captures(syntax_tree.root_node)
end = perf_counter()
print(f"{len(test_captures)} captures took {end - start:.4f} seconds")
print("=====")

print("Running 'html_textual.scm' highlight query...")
start = perf_counter()
textual_query = language.query(textual_query_scm)
textual_captures = textual_query.captures(syntax_tree.root_node)
end = perf_counter()
print(f"{len(textual_captures)} captures took {end - start:.4f} seconds")
print("=====")

The issue seems to be a combination of the query pattern in tree-sitter/highlights/html.scm and the size of the file. Reducing the example snapshot html to ~1,000 lines will eventually work, but experimenting with a different scm file runs in less than a second even with the original file example.

@TomJGooding
Copy link
Contributor

TomJGooding commented Feb 20, 2024

I haven't delved deep enough yet to understand the ((element (start_tag (tag_name) @_tag) patterns. but these seem to be the culprit. Even just including one of these will segfault with the original example. Removing these runs in less than a second.

from pathlib import Path
from time import perf_counter

from tree_sitter_languages import get_language, get_parser

highlights_query = """
(tag_name) @tag
(erroneous_end_tag_name) @html.end_tag_error
(comment) @comment
(attribute_name) @tag.attribute
(attribute
  (quoted_attribute_value) @string)
(text) @text @spell

((attribute
   (attribute_name) @_attr
   (quoted_attribute_value (attribute_value) @text.uri))
 (#any-of? @_attr "href" "src"))

[
 "<"
 ">"
 "</"
 "/>"
] @tag.delimiter

"=" @operator

(doctype) @constant

"<!" @tag.delimiter
"""


html_path = Path("snapshot_report.html")
# https://github.com/Textualize/textual/issues/4152
html_text = html_path.read_text()

language = get_language("html")
parser = get_parser("html")

syntax_tree = parser.parse(bytes(html_text, "utf-8"))

start = perf_counter()
test_query = language.query(highlights_query)
test_captures = test_query.captures(syntax_tree.root_node)
end = perf_counter()
print(f"{len(test_captures)} captures took {end - start:.4f} seconds")

TomJGooding added a commit to TomJGooding/textual that referenced this issue Feb 20, 2024
Fixes Textualize#4152 by removing all
`((element (start_tag (tag_name) @_tag)` patterns from the `html.scm`
highlights query file.

These patterns will cause a segfault on relatively large documents
and even just one seems a massively expensive operation from some
quick testing.

All tests pass after removing these and I couldn't see they were
actually used anywhere in syntax highlighting, but please correct me
if I'm wrong!
darrenburns added a commit that referenced this issue Feb 22, 2024
* fix(tree-sitter): remove slow html highlight patterns

Fixes #4152 by removing all
`((element (start_tag (tag_name) @_tag)` patterns from the `html.scm`
highlights query file.

These patterns will cause a segfault on relatively large documents
and even just one seems a massively expensive operation from some
quick testing.

All tests pass after removing these and I couldn't see they were
actually used anywhere in syntax highlighting, but please correct me
if I'm wrong!

* run tests in ci

* Update changelog

---------

Co-authored-by: Darren Burns <[email protected]>
Copy link

Don't forget to star the repository!

Follow @textualizeio for Textual updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants