Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic/ByteOrderMark: improve code coverage #278

Merged

Conversation

rodrigoprimo
Copy link
Contributor

Description

This PR improves code coverage for the Generic.Files.ByteOrderMark sniff. It adds two more tests that need to go into separate files to exercise the code that handles the cases when the files do not contain a BOM character.

There are no tests for the UTF-16 BOM characters, and the sniff does not support BOM characters used in other UTF encodings. I don't know much about this topic, so I'm not sure if we should add that or not.

Related issues/external references

Part of #146

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
    • This change is only breaking for integrators, not for external standards or end-users.
  • Documentation improvement

PR checklist

  • I have checked there is no other PR open for the same change.
  • I have read the Contribution Guidelines.
  • I grant the project the right to include and distribute the code under the BSD-3-Clause license (and I have the right to grant these rights).
  • I have added tests to cover my changes.
  • I have verified that the code complies with the projects coding standards.
  • [Required for new sniffs] I have added XML documentation for the sniff.

Copy link
Member

@jrfnl jrfnl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @rodrigoprimo. Looking good.

There are no tests for the UTF-16 BOM characters, and the sniff does not support BOM characters used in other UTF encodings. I don't know much about this topic, so I'm not sure if we should add that or not.

I'd very much like to see tests added for the UTF-16 BOM characters too. Let me know if you need help generating those files.

As for other encodings - as the sniff does not look for those, we don't need to bother with tests for that.
Also keep in mind, that projects can specify the file encoding used in their file, but that PHPCS defaults to UTF-8 (since PHPCS 3.0.0) and that, in my estimation, it will be exceedingly rare to see PHP files encoded in anything other than ANSI/iso-8859-1 or UTF-8, though, it is, of course, possible.

@rodrigoprimo rodrigoprimo force-pushed the test-coverage-byte-order-mask branch 2 times, most recently from cf65b45 to 470b05f Compare January 24, 2024 20:27
@rodrigoprimo
Copy link
Contributor Author

I'd very much like to see tests added for the UTF-16 BOM characters too. Let me know if you need help generating those files.

I think I managed to create the files with the UTF-16 BE and UTF-16 LE BOM characters, but there is a significant chance that I missed something, as this was the first time that I was playing with these encodings. So any help is appreciated.

I used the following script to create the files (changing the encoding in the iconv() call):

<?php

$file = fopen('src/Standards/Generic/Tests/Files/ByteOrderMarkUnitTest.5.inc', 'w');

fwrite($file, "\xFE\xFF");

$phpCode = "<?php\n// File with a UTF-16 Big Endian byte order mark (BOM) character before the opening PHP tag.";
$phpCodeUtf16 = iconv('UTF-8', 'UTF-16BE', $phpCode);
fwrite($file, $phpCodeUtf16);

fclose($file);

And I checked the files using file and od -c:

$ file src/Standards/Generic/Tests/Files/ByteOrderMarkUnitTest.5.inc
src/Standards/Generic/Tests/Files/ByteOrderMarkUnitTest.5.inc: PHP script, Unicode text, UTF-16, big-endian text
$ od -c src/Standards/Generic/Tests/Files/ByteOrderMarkUnitTest.5.inc
0000000 376 377  \0   <  \0   ?  \0   p  \0   h  \0   p  \0  \n  \0   /
0000020  \0   /  \0      \0   F  \0   i  \0   l  \0   e  \0      \0   w
0000040  \0   i  \0   t  \0   h  \0      \0   a  \0      \0   U  \0   T
0000060  \0   F  \0   -  \0   1  \0   6  \0      \0   B  \0   i  \0   g
0000100  \0      \0   E  \0   n  \0   d  \0   i  \0   a  \0   n  \0    
0000120  \0   b  \0   y  \0   t  \0   e  \0      \0   o  \0   r  \0   d
0000140  \0   e  \0   r  \0      \0   m  \0   a  \0   r  \0   k  \0    
0000160  \0   (  \0   B  \0   O  \0   M  \0   )  \0      \0   c  \0   h
0000200  \0   a  \0   r  \0   a  \0   c  \0   t  \0   e  \0   r  \0    
0000220  \0   b  \0   e  \0   f  \0   o  \0   r  \0   e  \0      \0   t
0000240  \0   h  \0   e  \0      \0   o  \0   p  \0   e  \0   n  \0   i
0000260  \0   n  \0   g  \0      \0   P  \0   H  \0   P  \0      \0   t
0000300  \0   a  \0   g  \0   .  \0   ;
0000310

One potential problem with the files that I created is that when using iconv(), I specified that the encoding should be BE or LE. I don't know if that defeats the purpose of the BOM character or not. I first tried creating the file using iconv('UTF-8', 'UTF-16', $phpCode), but that resulted in a file that file did not identify as UTF-16 encoded, and that is why I didn't proceed with this approach.

@jrfnl
Copy link
Member

jrfnl commented Jan 25, 2024

@rodrigoprimo UTF-16 always needs to be with LE or BE, UTF-16 without indicator is not a valid encoding.

I have checked the encoding of the files (using NotePad++ which has a great little feature for that) and they look fine to me.

I also checked the encoding for the other files - The 1.inc file is as advertised UTF-8 with BOM, the 2.inc and 3.inc are both ANSI/iso-8859-1 (UTF-8 compliant). We could consider to change one of these to UTF-8 without BOM, as that looks to be the only case not covered now.

@rodrigoprimo
Copy link
Contributor Author

I have checked the encoding of the files (using NotePad++ which has a great little feature for that) and they look fine to me.

Thanks for checking!

I also checked the encoding for the other files - The 1.inc file is as advertised UTF-8 with BOM, the 2.inc and 3.inc are both ANSI/iso-8859-1 (UTF-8 compliant). We could consider to change one of these to UTF-8 without BOM, as that looks to be the only case not covered now.

Sure, I can do that. I guess the easiest way is to add a character outside of the ASCII (together with a comment explaining why it is there and why it should not be removed) to either 2.inc or 3.inc. Or do you have a different approach in mind?

Outside the scope of this PR, while checking the sniff code once again, I realized that it might make sense to return the number of tokens instead of void if $stackPtr !== 0. This way, the sniff is not unnecessarily called multiple times if there are multiple T_INLINE_HTML in a file. A tiny performance improvement. Happy to create an issue or a PR for that if you agree.

@jrfnl
Copy link
Member

jrfnl commented Jan 25, 2024

Sure, I can do that. I guess the easiest way is to add a character outside of the ASCII (together with a comment explaining why it is there and why it should not be removed) to either 2.inc or 3.inc. Or do you have a different approach in mind?

That would very much depend on what editor you use to make that change and whether it automatically changes the file encoding if needed. It would also depend on the character used as files in iso-8859-1 can contain characters outside ASCII too...

Shall I make it easier and just add a commit to sort this out with a file encoding conversion ?

Outside the scope of this PR, while checking the sniff code once again, I realized that it might make sense to return the number of tokens instead of void if $stackPtr !== 0. This way, the sniff is not unnecessarily called multiple times if there are multiple T_INLINE_HTML in a file. A tiny performance improvement. Happy to create an issue or a PR for that if you agree.

💯 Good catch! I made a change like that for a number of other sniffs not too long ago (PR #95).

For this sniff, I think that change should be made for all exit points of the function - not just the $stackPtr !== 0 case - as any file only needs to be checked once for a BOM.

Go right ahead and create a PR for this 👍🏻

@rodrigoprimo
Copy link
Contributor Author

Shall I make it easier and just add a commit to sort this out with a file encoding conversion ?

Works for me. Thanks.

(I failed to provide more details, but I had checked that my editor defaults to UTF-8 when characters outside of the ANSCII range are added to the file)

💯 Good catch! I made a change like that for a number of other sniffs not too long ago (PR #95).
For this sniff, I think that change should be made for all exit points of the function - not just the $stackPtr !== 0 case - as any file only needs to be checked once for a BOM.
Go right ahead and create a PR for this 👍🏻

Ok, I will create a PR for this.

@jrfnl
Copy link
Member

jrfnl commented Jan 26, 2024

Shall I make it easier and just add a commit to sort this out with a file encoding conversion ?

Works for me. Thanks.

I'll do that now and will rebase the PR (and squash some commits) before merging.

To add more tests that need to go into separate files.
This commit adds four more tests case files:
* 2 to cover the cases where the sniff is triggered but the file does not contain a BOM character.
    One of these files is in UTF-8 encoding without a BOM, the other is in ANSI/iso-8859-1 encoding.
* 2 to cover the different UTF-16 encodings with a BOM character.

Includes adding a comment to the original test case file to document the encoding (UTF-8 with BOM) of that file.
@jrfnl jrfnl force-pushed the test-coverage-byte-order-mask branch from 470b05f to a9642e5 Compare January 26, 2024 14:33
@jrfnl jrfnl merged commit 6f8efd4 into PHPCSStandards:master Jan 26, 2024
38 checks passed
@rodrigoprimo rodrigoprimo deleted the test-coverage-byte-order-mask branch January 26, 2024 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants