Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incorrect types passed to Py_BuildValue and PyArg_ParseTupleAndKeywords #193

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

QuLogic
Copy link

@QuLogic QuLogic commented Oct 19, 2024

Always using i is broken, as that is 4 bytes, but Py_ssize_t and Py_uintptr_t are 8 bytes.

On little endian machines, this is usually fine, since the first 4 bytes (of 4-byte or 8-byte integers) will be the same, but on big-endian machines, a 4-byte integer corresponds with the last 4 bytes of the 8-byte integer

For small numbers on big-endian machines, it is thus likely that the result will be 0. If the value is over 32 bits, then the result will be wrong on little-endian machines as well.

Fixes #142

In all cases, these calls told Python it was getting ints (`i`), which
are 4 bytes, but several cases passed either `Py_ssize_t` or
`Py_uintptr_t`, which are 8 bytes.

On little-endian machines, this will work, as the bytes align, but on
big-endian machine, Python will most probably see 0, since the top bytes
are more likely to be unfilled unless values are very large.
For a similar reason as `Py_BuildValue`, telling Python to use a smaller
sized value than actually passed may cause strange values to end up in
the result.
@QuLogic
Copy link
Author

QuLogic commented Oct 19, 2024

Here's a downstream build with this PR applied that works on all (little- and big-endian) architectures: https://src.fedoraproject.org/rpms/python-pyahocorasick/pull-request/3

@QuLogic
Copy link
Author

QuLogic commented Oct 27, 2024

There seems to be one more oddity at:

#ifdef PY3K
#ifdef AHOCORASICK_UNICODE
"(u#O)", /*key*/ iter->buffer + 1, depth,
#else
"(y#O)", /*key*/ iter->buffer + 1, depth,
#endif
#else
"(s#O)", /*key*/ iter->char_buffer + 1, depth,
#endif

vs:
// setup supported character set
#ifdef AHOCORASICK_UNICODE
# if defined PEP393_UNICODE || defined Py_UNICODE_WIDE
// Either Python uses UCS-4 or we don't know what Python uses,
// but we use UCS-4
# define TRIE_LETTER_TYPE uint32_t
# define TRIE_LETTER_SIZE 4
# else
// Python use UCS-2
# define TRIE_LETTER_TYPE uint16_t
# define TRIE_LETTER_SIZE 2
# define VARIABLE_LEN_CHARCODES 1
# endif
#else
// only bytes are supported
# define TRIE_LETTER_TYPE uint16_t
# define TRIE_LETTER_SIZE 2
#endif

where if AHOCORASICK_UNICODE is not defined, it states that it is const char * to Python, but the TRIE_LETTER_TYPE is uint16_t. I'm not sure if it's the former or latter that is wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tests test_iter3 and test_iter2 fails on big endian
1 participant