Ensure get_contents() method of Node objects return 'bytes' and fix decoding errors in Value.get_csig() #3738

eugenhu · 2020-07-07T14:58:40Z

Using Value as a target node sometimes fails in the calculation of its content signature in get_csig().

Most existing get_contents() methods already return bytes/bytearrays, but a few are returning strs. This is causing problems in some functions (e.g. Value.get_text_contents()) as strs do not have a decode() method.

The current implementation of Value.get_csig() uses the return of Value.get_text_contents() as the csig. The method Value.get_text_contents() calculates its return value by converting the Value's value attribute to a string, and then appending its children's content. Child node content are calculated by calling their get_contents() to get their binary data, and then calling .decode() on that. This can cause a UnicodeDecodeError if the binary content of a child node is not valid utf-8.

To fix this, I have looked through the get_contents() method in each Node subclass (the ones I could find at least) and made appropriate changes to make sure they always return bytes. The Dir and Alias nodes construct their binary content entirely as a string, they now return the utf-8 encoding of their original return values.

I have also reimplemented Value.get_contents() similarly to how Value.get_text_contents() works. I don't think this changes the output of Value.get_contents().

I've also reimplemented Value.get_csig() to use Value.get_contents() and decode() that (with errors='backslashreplace') to use as the csig.

Alternatively, Value.get_text_contents() could be modified to work around decoding errors (using something like errors='backslashreplace'). I'm not sure if get_text_contents() is meant to always return something, even if the binary contents of the node is not encoded text. The implementation of File.get_text_contents() uses errors='backslashreplace' to handle decoding errors.

I've added two rough tests with duplicated code to test that Value.get_contents() and Value.get_csig() work even when it has a child node with non-utf8 binary contents.

I've also modified tests that were comparing get_contents() against strings for equality, and some mock objects that were returning strings in their get_contents() method. I don't think all the modifications to the test code were necessary to get my changes to pass the tests, but I've modified them anyway for completeness.

Contributor Checklist:

I have created a new test or updated the unit tests to cover the new/changed functionality.
I have updated CHANGES.txt (and read the README.rst)
I have updated the appropriate documentation

Dir, Alias, ActionCaller compute their contents. The get_contents() of Dir and Alias now return a utf-8 formatted string. The content of ActionCaller is usually the 'co_code' attribute of the code object of the function it wraps, when this is unavailable, return the 'repr' of the wrapped function encoded with default options (currently 'utf-8').

Some mock objects' get_contents() were returning strings, modify them to return bytes. Some tests were asserting get_contents() equals a string, modify expected values to be bytes. For tests that check the value of Dir.get_contents() and Alias.get_contents(), make the tests explicitly expect 'utf-8' encoded content.

Value.get_contents() is currently calculating its binary contents by getting its text contents via Value.get_text_contents() and encoding the returned text content. This can be a problem if get_text_contents() fails. The current implementation of File.get_text_contents() never fails, as a last resort, undecodable bytes are handled by 'backslashreplace', but Value.get_text_contents() calculates its text content by decoding child node binary contents manually using bytes.decode() with default options (encoding='utf-8' and errors='strict') so has the potential to fail (e.g. a child File node contains non-utf8 binary data). Reimplement get_contents() to mirror get_text_contents() implementation as a potential solution. The current get_csig() also uses the return of get_text_contents() as the content signature. Because get_text_contents() might fail, make get_csig() return get_contents() decoded with default encoding and errors='backslashreplace'.

grossag · 2020-07-07T15:03:20Z

SCons/Node/Python.py

-            return text_contents
+        contents = str(self.value).encode('utf-8')
+        for kid in self.children(None):
+            contents = contents + kid.get_contents()


Can you change this to contents = contents + kid.get_csig()? Otherwise the memory usage of Value targets gets out of control.

I've made a few extra commits, first changing Value.get_csig() to use child csigs instead of content, and then making Value.get_text_contents() to just be a concatenation of child csigs prepended with the stringified value attribute like you suggest. The Value.get_contents() then just return an encoded get_text_contents() and Value.get_csig() just wraps get_text_contents() as well.

I'm not sure where the 'contents' of a Value node is actually used, but the second change failed these tests:

SCons/SConfTests.py test/Configure/ConfigureDryRunError.py test/Configure/VariantDir-SConscript.py test/Configure/VariantDir.py test/Configure/basic.py test/Configure/cache-not-ok.py test/Configure/cache-ok.py test/Configure/config-h.py test/Configure/custom-tests.py test/Configure/issue-3469/issue-3469.py test/Configure/option--config.py test/Value.py test/explain/basic.py test/option-n.py test/question/Configure.py test/sconsign/script/Configure.py test/textfile/textfile.py

The first change passes existing tests (after they've been modified to expect bytes from get_contents() calls in the previous PR commits).

I am running tests with Python 3.8.2 on Ubuntu 20.04 in a virtual environment.

djh82 · 2020-07-07T15:09:37Z

SCons/Node/Python.py

-            # Already encoded as python2 str are bytes
-            return text_contents
+        contents = str(self.value).encode('utf-8')
+        for kid in self.children(None):


child seems to be in use everywhere else

I would prefer child as well but it looks like kid is still used frequently in SCons/Node/*.

bdbaddog · 2020-07-07T17:03:03Z

Please add a blurb to CHANGES.txt

dragon512 · 2020-07-07T17:51:24Z

Just a note this seems to be related to #3384 as well.

I would love to have these fixes in

Current implementation might lead to high memory usage. New implementation depends on children csigs instead. Remove test that makes sure Value.get_csig() works even if child binary contents are invalid utf-8 since get_csig() no longer depends on children content.

Current implementation of Value.get_text_contents() returns a concatenation of all child contents. Instead, make the contents of a Value similar to an Alias, except prepend the stringified 'value' attribute as well. Since Value.get_text_contents() will no longer fail (as long as child nodes implement a working get_csig() which they should), make get_contents() return get_text_contents() but utf-8 encoded. Also make get_csig() just return get_text_contents() Add new tests for testing get_contents() and get_text_contents() only rely on child node csigs, and not directly on child content.

eugenhu · 2020-07-08T03:59:17Z

The matches variable in ActionCallerTestCase.test_get_contents() also looks unused after being redefined on line 2043:

scons/SCons/ActionTests.py

Lines 2043 to 2055 in 9861322

    
           matches = [ 
        
               b"<built-in function str>", 
        
               b"<type 'str'>", 
        
           ] 
        
           af = SCons.Action.ActionFactory(str, strfunc) 
        
           ac = SCons.Action.ActionCaller(af, [], {}) 
        
           c = ac.get_contents([], [], Environment()) 
        
           assert c == b"<built-in function str>" or \ 
        
                  c == b"<type 'str'>" or \ 
        
                  c == b"<class 'str'>", repr(c) 
        
           # ^^ class str for python3

Should I remove it?

bdbaddog · 2020-07-08T04:20:17Z

SCons/Node/Python.py

        contents = str(self.value)
-        for kid in self.children(None):
-            contents = contents + kid.get_contents().decode()
+        contents += ''.join(n.get_csig() for n in self.children())


As I've said before I'm not sure concatenating the contents of the child was correct, and this seems even less correct.

mwichmann · 2020-07-08T14:22:20Z

Does this also cover issue #3093?

bdbaddog · 2020-07-09T06:05:40Z

I'd say hold off on any more work on this until we determine if anything breaks if we remove including child nodes contents in get_contents().

mwichmann · 2022-01-19T00:20:32Z

Seeing this stalled out completely. I've been gradually sneaking in typing annotations in the codebase - we can do more now that 3.6 is the baseline Python versions, I could really only add return types before. One of the things that shows up as a consistency check problem - it would be nice to declare all get_contents as returning bytes, and all get_text_contents as returning str. Is there a way to move this PR forward? Or do we need to start over? And sneak up on it incrementally?

eugenhu · 2022-01-19T01:21:57Z

I believe the sticking point from last time was what Value.get_text_contents() should return: concatenation of all child contents (old behaviour, possibly large memory usage) or concatenation of child csigs (breaks documented behaviour).

Using concat of child csigs leads to some tests to fail, and I'm not sure if it was figured out what the rationale for using the concat of child contents was. I had run out of time to look into these two issues unfortunately.

On the topic of ensuring get_contents() returns bytes and get_text_contents() returns str, I can roll back the commits messing with Value.get_text_contents() and that should be good to go. Just need to make sure it integrates with the existing codebase now.

Deciding what Value.get_text_contents() should return can be discussed in a separate issue.

bdbaddog · 2022-01-19T01:29:30Z

Honestly it should do neither. The node's children shouldn't be added into it's contents.
The scanner/taskmaster should check that the children of the Value() are out of date or not..
I think I have a WIP somewhere for this. Just need to get back to it..

eugenhu added 3 commits July 7, 2020 15:57

grossag reviewed Jul 7, 2020

View reviewed changes

djh82 reviewed Jul 7, 2020

View reviewed changes

eugenhu added 5 commits July 8, 2020 12:44

Update CHANGES.txt

4e79f07

Update CHANGES.txt

8530029

Update CHANGES.txt

9861322

bdbaddog reviewed Jul 8, 2020

View reviewed changes

Merge branch 'master' into master

bbb0fab

mwichmann added the Value Nodes label Jan 11, 2021

mwichmann added this to the NextRelease milestone Aug 3, 2022

mwichmann removed this from the 4.5 milestone Mar 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure get_contents() method of Node objects return 'bytes' and fix decoding errors in Value.get_csig() #3738

Ensure get_contents() method of Node objects return 'bytes' and fix decoding errors in Value.get_csig() #3738

eugenhu commented Jul 7, 2020 •

edited

Loading

grossag Jul 7, 2020

eugenhu Jul 8, 2020 •

edited

Loading

djh82 Jul 7, 2020

eugenhu Jul 8, 2020

bdbaddog commented Jul 7, 2020

dragon512 commented Jul 7, 2020

eugenhu commented Jul 8, 2020

bdbaddog Jul 8, 2020

mwichmann commented Jul 8, 2020

bdbaddog commented Jul 9, 2020

mwichmann commented Jan 19, 2022

eugenhu commented Jan 19, 2022

bdbaddog commented Jan 19, 2022

Ensure get_contents() method of Node objects return 'bytes' and fix decoding errors in Value.get_csig() #3738

Are you sure you want to change the base?

Ensure get_contents() method of Node objects return 'bytes' and fix decoding errors in Value.get_csig() #3738

Conversation

eugenhu commented Jul 7, 2020 • edited Loading

Contributor Checklist:

grossag Jul 7, 2020

Choose a reason for hiding this comment

eugenhu Jul 8, 2020 • edited Loading

Choose a reason for hiding this comment

djh82 Jul 7, 2020

Choose a reason for hiding this comment

eugenhu Jul 8, 2020

Choose a reason for hiding this comment

bdbaddog commented Jul 7, 2020

dragon512 commented Jul 7, 2020

eugenhu commented Jul 8, 2020

bdbaddog Jul 8, 2020

Choose a reason for hiding this comment

mwichmann commented Jul 8, 2020

bdbaddog commented Jul 9, 2020

mwichmann commented Jan 19, 2022

eugenhu commented Jan 19, 2022

bdbaddog commented Jan 19, 2022

eugenhu commented Jul 7, 2020 •

edited

Loading

eugenhu Jul 8, 2020 •

edited

Loading