Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure get_contents() method of Node objects return 'bytes' and fix decoding errors in Value.get_csig() #3738

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

eugenhu
Copy link

@eugenhu eugenhu commented Jul 7, 2020

Fix #3737

Using Value as a target node sometimes fails in the calculation of its content signature in get_csig().

Most existing get_contents() methods already return bytes/bytearrays, but a few are returning strs. This is causing problems in some functions (e.g. Value.get_text_contents()) as strs do not have a decode() method.

The current implementation of Value.get_csig() uses the return of Value.get_text_contents() as the csig. The method Value.get_text_contents() calculates its return value by converting the Value's value attribute to a string, and then appending its children's content. Child node content are calculated by calling their get_contents() to get their binary data, and then calling .decode() on that. This can cause a UnicodeDecodeError if the binary content of a child node is not valid utf-8.

To fix this, I have looked through the get_contents() method in each Node subclass (the ones I could find at least) and made appropriate changes to make sure they always return bytes. The Dir and Alias nodes construct their binary content entirely as a string, they now return the utf-8 encoding of their original return values.

I have also reimplemented Value.get_contents() similarly to how Value.get_text_contents() works. I don't think this changes the output of Value.get_contents().

I've also reimplemented Value.get_csig() to use Value.get_contents() and decode() that (with errors='backslashreplace') to use as the csig.

Alternatively, Value.get_text_contents() could be modified to work around decoding errors (using something like errors='backslashreplace'). I'm not sure if get_text_contents() is meant to always return something, even if the binary contents of the node is not encoded text. The implementation of File.get_text_contents() uses errors='backslashreplace' to handle decoding errors.

I've added two rough tests with duplicated code to test that Value.get_contents() and Value.get_csig() work even when it has a child node with non-utf8 binary contents.

I've also modified tests that were comparing get_contents() against strings for equality, and some mock objects that were returning strings in their get_contents() method. I don't think all the modifications to the test code were necessary to get my changes to pass the tests, but I've modified them anyway for completeness.

Contributor Checklist:

  • I have created a new test or updated the unit tests to cover the new/changed functionality.
  • I have updated CHANGES.txt (and read the README.rst)
  • I have updated the appropriate documentation

eugenhu added 3 commits July 7, 2020 15:57
Dir, Alias, ActionCaller compute their contents. The get_contents()
of Dir and Alias now return a utf-8 formatted string. The content
of ActionCaller is usually the 'co_code' attribute of the code object of
the function it wraps, when this is unavailable, return the 'repr' of
the wrapped function encoded with default options (currently 'utf-8').
Some mock objects' get_contents() were returning strings, modify them to
return bytes. Some tests were asserting get_contents() equals a string,
modify expected values to be bytes. For tests that check the value of
Dir.get_contents() and Alias.get_contents(), make the tests explicitly
expect 'utf-8' encoded content.
Value.get_contents() is currently calculating its binary contents by
getting its text contents via Value.get_text_contents() and encoding the
returned text content. This can be a problem if get_text_contents()
fails.

The current implementation of File.get_text_contents() never fails, as a
last resort, undecodable bytes are handled by 'backslashreplace', but
Value.get_text_contents() calculates its text content by decoding child
node binary contents manually using bytes.decode() with default
options (encoding='utf-8' and errors='strict') so has the potential to
fail (e.g. a child File node contains non-utf8 binary data).

Reimplement get_contents() to mirror get_text_contents() implementation
as a potential solution.

The current get_csig() also uses the return of get_text_contents() as
the content signature. Because get_text_contents() might fail, make
get_csig() return get_contents() decoded with default encoding and
errors='backslashreplace'.
return text_contents
contents = str(self.value).encode('utf-8')
for kid in self.children(None):
contents = contents + kid.get_contents()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change this to contents = contents + kid.get_csig()? Otherwise the memory usage of Value targets gets out of control.

Copy link
Author

@eugenhu eugenhu Jul 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a few extra commits, first changing Value.get_csig() to use child csigs instead of content, and then making Value.get_text_contents() to just be a concatenation of child csigs prepended with the stringified value attribute like you suggest. The Value.get_contents() then just return an encoded get_text_contents() and Value.get_csig() just wraps get_text_contents() as well.

I'm not sure where the 'contents' of a Value node is actually used, but the second change failed these tests:

SCons/SConfTests.py
test/Configure/ConfigureDryRunError.py
test/Configure/VariantDir-SConscript.py
test/Configure/VariantDir.py
test/Configure/basic.py
test/Configure/cache-not-ok.py
test/Configure/cache-ok.py
test/Configure/config-h.py
test/Configure/custom-tests.py
test/Configure/issue-3469/issue-3469.py
test/Configure/option--config.py
test/Value.py
test/explain/basic.py
test/option-n.py
test/question/Configure.py
test/sconsign/script/Configure.py
test/textfile/textfile.py

The first change passes existing tests (after they've been modified to expect bytes from get_contents() calls in the previous PR commits).

I am running tests with Python 3.8.2 on Ubuntu 20.04 in a virtual environment.

# Already encoded as python2 str are bytes
return text_contents
contents = str(self.value).encode('utf-8')
for kid in self.children(None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

child seems to be in use everywhere else

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer child as well but it looks like kid is still used frequently in SCons/Node/*.

@bdbaddog
Copy link
Contributor

bdbaddog commented Jul 7, 2020

Please add a blurb to CHANGES.txt

@dragon512
Copy link
Contributor

Just a note this seems to be related to #3384 as well.

I would love to have these fixes in

eugenhu added 5 commits July 8, 2020 12:44
Current implementation might lead to high memory usage. New
implementation depends on children csigs instead.

Remove test that makes sure Value.get_csig() works even if child binary
contents are invalid utf-8 since get_csig() no longer depends on
children content.
Current implementation of Value.get_text_contents() returns a
concatenation of all child contents. Instead, make the contents of a
Value similar to an Alias, except prepend the stringified 'value'
attribute as well.

Since Value.get_text_contents() will no longer fail (as long as child
nodes implement a working get_csig() which they should), make
get_contents() return get_text_contents() but utf-8 encoded. Also
make get_csig() just return get_text_contents()

Add new tests for testing get_contents() and get_text_contents() only
rely on child node csigs, and not directly on child content.
@eugenhu
Copy link
Author

eugenhu commented Jul 8, 2020

The matches variable in ActionCallerTestCase.test_get_contents() also looks unused after being redefined on line 2043:

scons/SCons/ActionTests.py

Lines 2043 to 2055 in 9861322

matches = [
b"<built-in function str>",
b"<type 'str'>",
]
af = SCons.Action.ActionFactory(str, strfunc)
ac = SCons.Action.ActionCaller(af, [], {})
c = ac.get_contents([], [], Environment())
assert c == b"<built-in function str>" or \
c == b"<type 'str'>" or \
c == b"<class 'str'>", repr(c)
# ^^ class str for python3

Should I remove it?

contents = str(self.value)
for kid in self.children(None):
contents = contents + kid.get_contents().decode()
contents += ''.join(n.get_csig() for n in self.children())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I've said before I'm not sure concatenating the contents of the child was correct, and this seems even less correct.

@mwichmann
Copy link
Collaborator

Does this also cover issue #3093?

@bdbaddog
Copy link
Contributor

bdbaddog commented Jul 9, 2020

I'd say hold off on any more work on this until we determine if anything breaks if we remove including child nodes contents in get_contents().

@mwichmann
Copy link
Collaborator

Seeing this stalled out completely. I've been gradually sneaking in typing annotations in the codebase - we can do more now that 3.6 is the baseline Python versions, I could really only add return types before. One of the things that shows up as a consistency check problem - it would be nice to declare all get_contents as returning bytes, and all get_text_contents as returning str. Is there a way to move this PR forward? Or do we need to start over? And sneak up on it incrementally?

@eugenhu
Copy link
Author

eugenhu commented Jan 19, 2022

I believe the sticking point from last time was what Value.get_text_contents() should return: concatenation of all child contents (old behaviour, possibly large memory usage) or concatenation of child csigs (breaks documented behaviour).

Using concat of child csigs leads to some tests to fail, and I'm not sure if it was figured out what the rationale for using the concat of child contents was. I had run out of time to look into these two issues unfortunately.

On the topic of ensuring get_contents() returns bytes and get_text_contents() returns str, I can roll back the commits messing with Value.get_text_contents() and that should be good to go. Just need to make sure it integrates with the existing codebase now.

Deciding what Value.get_text_contents() should return can be discussed in a separate issue.

@bdbaddog
Copy link
Contributor

Honestly it should do neither. The node's children shouldn't be added into it's contents.
The scanner/taskmaster should check that the children of the Value() are out of date or not..
I think I have a WIP somewhere for this. Just need to get back to it..

@mwichmann mwichmann added this to the NextRelease milestone Aug 3, 2022
@mwichmann mwichmann removed this from the 4.5 milestone Mar 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

get_contents_entry() and get_contents_dir() returning 'str' instead of 'bytes'
6 participants