Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String tensor vs utf8 encoding #422

Open
sir-earl opened this issue Feb 16, 2024 · 4 comments
Open

String tensor vs utf8 encoding #422

sir-earl opened this issue Feb 16, 2024 · 4 comments

Comments

@sir-earl
Copy link

I'm trying to use raw_ops::decode_image to load an image directly from a u8 slice (as opposed to from file as per the example), but it seems I must first convert the slice to a scalar tensor string.

It appears I can make this work with something like:

let s = unsafe { String::from_utf8_unchecked(image_bytes.to_vec()) };

My concern is that Rust expects all strings to be utf8 encoded, of which the above certainly is not.

Am I missing something obvious? Is there a better way to approach this?

@dskkato
Copy link
Contributor

dskkato commented Feb 16, 2024

No need to check as valid utf-8, since the TensorFlow uses strings as byte buffer containers.

How about using raw_ops::read_file if you want to decode it using this TensorFlow wrapper.

https://github.com/tensorflow/rust/blob/master/examples%2Fmobilenetv3.rs#L38-L38

@sir-earl
Copy link
Author

sir-earl commented Feb 16, 2024

For my use case, the file is already in memory (received via network) so it would be wasteful to load it from disk with raw_ops::read_file.

My concern with putting non-UTF8 character into a String is that Rust is likely to be unhappy, and may for instance return the wrong length, causing data corruption or other undefined behaviour.

It feels like using a different type to represent the string data type might be more sensible, especially given the hoops required to convert a Rust byte container to a Rust String.

@dskkato
Copy link
Contributor

dskkato commented Feb 16, 2024

As indicated by the namespace raw_ops, this Op is merely a Rust wrapper of TensorFlow's functionalities. For further details on this API, please refer to the following documentation:

https://www.tensorflow.org/api_docs/python/tf/raw_ops/DecodeImage

While it might be possible to wrap raw_ops to create a more Rust-like API, currently, nobody seems to have undertaken that effort.

@adamcrume
Copy link
Contributor

I've tried to convert a rank-1 tensor of dtype=uint8 to a rank-0 tensor of dtype=string using Cast, ReduceJoin, and DecodeRaw, and they all fail. There doesn't seem to be any way to convert individual bytes to a string in TensorFlow (i.e. the inverse of BytesSplit).

I think we actually need to introduce either Tensor<Vec<u8>> or TString and Tensor<TString> (analogous to OsString or CString) and deprecate Tensor<String> since TensorFlow strings are not necessarily UTF-8.

Your concern about calling from_utf8_unchecked on a something that is not valid UTF-8 is quite valid. The docs say

Constructing a non-UTF-8 string slice is not immediate undefined behavior, but any function called on a string slice may assume that it is valid UTF-8, which means that a non-UTF-8 string slice can lead to undefined behavior down the road.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants