Generic/ByteOrderMark: improve code coverage #278

rodrigoprimo · 2024-01-16T15:10:51Z

Description

This PR improves code coverage for the Generic.Files.ByteOrderMark sniff. It adds two more tests that need to go into separate files to exercise the code that handles the cases when the files do not contain a BOM character.

There are no tests for the UTF-16 BOM characters, and the sniff does not support BOM characters used in other UTF encodings. I don't know much about this topic, so I'm not sure if we should add that or not.

Related issues/external references

Part of #146

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
- This change is only breaking for integrators, not for external standards or end-users.
Documentation improvement

PR checklist

I have checked there is no other PR open for the same change.
I have read the Contribution Guidelines.
I grant the project the right to include and distribute the code under the BSD-3-Clause license (and I have the right to grant these rights).
I have added tests to cover my changes.
I have verified that the code complies with the projects coding standards.
[Required for new sniffs] I have added XML documentation for the sniff.

jrfnl

Thanks for working on this @rodrigoprimo. Looking good.

There are no tests for the UTF-16 BOM characters, and the sniff does not support BOM characters used in other UTF encodings. I don't know much about this topic, so I'm not sure if we should add that or not.

I'd very much like to see tests added for the UTF-16 BOM characters too. Let me know if you need help generating those files.

As for other encodings - as the sniff does not look for those, we don't need to bother with tests for that.
Also keep in mind, that projects can specify the file encoding used in their file, but that PHPCS defaults to UTF-8 (since PHPCS 3.0.0) and that, in my estimation, it will be exceedingly rare to see PHP files encoded in anything other than ANSI/iso-8859-1 or UTF-8, though, it is, of course, possible.

rodrigoprimo · 2024-01-24T20:43:37Z

I'd very much like to see tests added for the UTF-16 BOM characters too. Let me know if you need help generating those files.

I think I managed to create the files with the UTF-16 BE and UTF-16 LE BOM characters, but there is a significant chance that I missed something, as this was the first time that I was playing with these encodings. So any help is appreciated.

I used the following script to create the files (changing the encoding in the iconv() call):

<?php

$file = fopen('src/Standards/Generic/Tests/Files/ByteOrderMarkUnitTest.5.inc', 'w');

fwrite($file, "\xFE\xFF");

$phpCode = "<?php\n// File with a UTF-16 Big Endian byte order mark (BOM) character before the opening PHP tag.";
$phpCodeUtf16 = iconv('UTF-8', 'UTF-16BE', $phpCode);
fwrite($file, $phpCodeUtf16);

fclose($file);

And I checked the files using file and od -c:

$ file src/Standards/Generic/Tests/Files/ByteOrderMarkUnitTest.5.inc
src/Standards/Generic/Tests/Files/ByteOrderMarkUnitTest.5.inc: PHP script, Unicode text, UTF-16, big-endian text
$ od -c src/Standards/Generic/Tests/Files/ByteOrderMarkUnitTest.5.inc
0000000 376 377  \0   <  \0   ?  \0   p  \0   h  \0   p  \0  \n  \0   /
0000020  \0   /  \0      \0   F  \0   i  \0   l  \0   e  \0      \0   w
0000040  \0   i  \0   t  \0   h  \0      \0   a  \0      \0   U  \0   T
0000060  \0   F  \0   -  \0   1  \0   6  \0      \0   B  \0   i  \0   g
0000100  \0      \0   E  \0   n  \0   d  \0   i  \0   a  \0   n  \0    
0000120  \0   b  \0   y  \0   t  \0   e  \0      \0   o  \0   r  \0   d
0000140  \0   e  \0   r  \0      \0   m  \0   a  \0   r  \0   k  \0    
0000160  \0   (  \0   B  \0   O  \0   M  \0   )  \0      \0   c  \0   h
0000200  \0   a  \0   r  \0   a  \0   c  \0   t  \0   e  \0   r  \0    
0000220  \0   b  \0   e  \0   f  \0   o  \0   r  \0   e  \0      \0   t
0000240  \0   h  \0   e  \0      \0   o  \0   p  \0   e  \0   n  \0   i
0000260  \0   n  \0   g  \0      \0   P  \0   H  \0   P  \0      \0   t
0000300  \0   a  \0   g  \0   .  \0   ;
0000310

One potential problem with the files that I created is that when using iconv(), I specified that the encoding should be BE or LE. I don't know if that defeats the purpose of the BOM character or not. I first tried creating the file using iconv('UTF-8', 'UTF-16', $phpCode), but that resulted in a file that file did not identify as UTF-16 encoded, and that is why I didn't proceed with this approach.

jrfnl · 2024-01-25T16:24:22Z

@rodrigoprimo UTF-16 always needs to be with LE or BE, UTF-16 without indicator is not a valid encoding.

I have checked the encoding of the files (using NotePad++ which has a great little feature for that) and they look fine to me.

I also checked the encoding for the other files - The 1.inc file is as advertised UTF-8 with BOM, the 2.inc and 3.inc are both ANSI/iso-8859-1 (UTF-8 compliant). We could consider to change one of these to UTF-8 without BOM, as that looks to be the only case not covered now.

rodrigoprimo · 2024-01-25T18:57:00Z

I have checked the encoding of the files (using NotePad++ which has a great little feature for that) and they look fine to me.

Thanks for checking!

I also checked the encoding for the other files - The 1.inc file is as advertised UTF-8 with BOM, the 2.inc and 3.inc are both ANSI/iso-8859-1 (UTF-8 compliant). We could consider to change one of these to UTF-8 without BOM, as that looks to be the only case not covered now.

Sure, I can do that. I guess the easiest way is to add a character outside of the ASCII (together with a comment explaining why it is there and why it should not be removed) to either 2.inc or 3.inc. Or do you have a different approach in mind?

Outside the scope of this PR, while checking the sniff code once again, I realized that it might make sense to return the number of tokens instead of void if $stackPtr !== 0. This way, the sniff is not unnecessarily called multiple times if there are multiple T_INLINE_HTML in a file. A tiny performance improvement. Happy to create an issue or a PR for that if you agree.

jrfnl · 2024-01-25T23:11:53Z

Sure, I can do that. I guess the easiest way is to add a character outside of the ASCII (together with a comment explaining why it is there and why it should not be removed) to either 2.inc or 3.inc. Or do you have a different approach in mind?

That would very much depend on what editor you use to make that change and whether it automatically changes the file encoding if needed. It would also depend on the character used as files in iso-8859-1 can contain characters outside ASCII too...

Shall I make it easier and just add a commit to sort this out with a file encoding conversion ?

Outside the scope of this PR, while checking the sniff code once again, I realized that it might make sense to return the number of tokens instead of void if $stackPtr !== 0. This way, the sniff is not unnecessarily called multiple times if there are multiple T_INLINE_HTML in a file. A tiny performance improvement. Happy to create an issue or a PR for that if you agree.

💯 Good catch! I made a change like that for a number of other sniffs not too long ago (PR #95).

For this sniff, I think that change should be made for all exit points of the function - not just the $stackPtr !== 0 case - as any file only needs to be checked once for a BOM.

Go right ahead and create a PR for this 👍🏻

rodrigoprimo · 2024-01-26T13:49:47Z

Shall I make it easier and just add a commit to sort this out with a file encoding conversion ?

Works for me. Thanks.

(I failed to provide more details, but I had checked that my editor defaults to UTF-8 when characters outside of the ANSCII range are added to the file)

💯 Good catch! I made a change like that for a number of other sniffs not too long ago (PR #95).
For this sniff, I think that change should be made for all exit points of the function - not just the $stackPtr !== 0 case - as any file only needs to be checked once for a BOM.
Go right ahead and create a PR for this 👍🏻

Ok, I will create a PR for this.

jrfnl · 2024-01-26T14:29:17Z

Shall I make it easier and just add a commit to sort this out with a file encoding conversion ?

Works for me. Thanks.

I'll do that now and will rebase the PR (and squash some commits) before merging.

To add more tests that need to go into separate files.

This commit adds four more tests case files: * 2 to cover the cases where the sniff is triggered but the file does not contain a BOM character. One of these files is in UTF-8 encoding without a BOM, the other is in ANSI/iso-8859-1 encoding. * 2 to cover the different UTF-16 encodings with a BOM character. Includes adding a comment to the original test case file to document the encoding (UTF-8 with BOM) of that file.

github-actions bot added the Status: triage label Jan 16, 2024

jrfnl added Status: awaiting feedback Type: chores/QA Standard: Generic and removed Status: triage labels Jan 24, 2024

jrfnl added this to the 3.9.0 milestone Jan 24, 2024

jrfnl reviewed Jan 24, 2024

View reviewed changes

rodrigoprimo force-pushed the test-coverage-byte-order-mask branch 2 times, most recently from cf65b45 to 470b05f Compare January 24, 2024 20:27

rodrigoprimo added 2 commits January 26, 2024 15:29

Generic/ByteOrderMark: rename test case file

39b2fed

To add more tests that need to go into separate files.

jrfnl force-pushed the test-coverage-byte-order-mask branch from 470b05f to a9642e5 Compare January 26, 2024 14:33

jrfnl approved these changes Jan 26, 2024

View reviewed changes

jrfnl removed the Status: awaiting feedback label Jan 26, 2024

jrfnl merged commit 6f8efd4 into PHPCSStandards:master Jan 26, 2024
38 checks passed

rodrigoprimo deleted the test-coverage-byte-order-mask branch January 26, 2024 17:23

rodrigoprimo mentioned this pull request Feb 23, 2024

Generic/ByteOrderMark: small performance improvement #360

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generic/ByteOrderMark: improve code coverage #278

Generic/ByteOrderMark: improve code coverage #278

rodrigoprimo commented Jan 16, 2024

jrfnl left a comment •

edited

Loading

rodrigoprimo commented Jan 24, 2024

jrfnl commented Jan 25, 2024

rodrigoprimo commented Jan 25, 2024

jrfnl commented Jan 25, 2024

rodrigoprimo commented Jan 26, 2024

jrfnl commented Jan 26, 2024

Generic/ByteOrderMark: improve code coverage #278

Generic/ByteOrderMark: improve code coverage #278

Conversation

rodrigoprimo commented Jan 16, 2024

Description

Related issues/external references

Types of changes

PR checklist

jrfnl left a comment • edited Loading

Choose a reason for hiding this comment

rodrigoprimo commented Jan 24, 2024

jrfnl commented Jan 25, 2024

rodrigoprimo commented Jan 25, 2024

jrfnl commented Jan 25, 2024

rodrigoprimo commented Jan 26, 2024

jrfnl commented Jan 26, 2024

jrfnl left a comment •

edited

Loading