Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Unicode replacement characters cause crash upon closing #570

Open
3 tasks done
Knopfi02 opened this issue Oct 31, 2024 · 34 comments
Open
3 tasks done

[Bug]: Unicode replacement characters cause crash upon closing #570

Knopfi02 opened this issue Oct 31, 2024 · 34 comments
Labels
Priority: High An important issue requiring attention System: Windows For Microsoft Windows TagStudio: Library Relating to the TagStudio library system Type: Bug Something isn't working as intended Type: File System File system interactions

Comments

@Knopfi02
Copy link

Checklist

  • I am using an up-to-date version.
  • I have read the documentation.
  • I have searched existing issues.

TagStudio Version

Alpha 9.4.1

Operating System & Version

Windows 10, 22H2 Build:19045.5011

Description

I had some shortcuts in the database that linked to some other files in the database and even though TagStudio was able to show the files in the database, upon closing, it crashed and showed an UnicodeEncodeError: 'utf-8' codec can't encode character and surrogates not allowed error message. This also prevented Tagstudio from saving the database entirely and the saved file was empty.
I made a small python script that tried decoding every file and encoding it with UTF-8 and found out that the .lnk shortcut files were causing the issue. After deleting them my script didn't show any errors and Tagstudio didn't crash anymore. (And was also able to actually save the database)

Expected Behavior

Not crash / sucessfully save the database

Steps to Reproduce

  1. Have file shortcuts in database
  2. Close Tagstudio
    or
  3. Try saving the database and check if it was succesfull

Logs

No response

@Knopfi02 Knopfi02 added the Type: Bug Something isn't working as intended label Oct 31, 2024
@CyanVoxel CyanVoxel added TagStudio: Library Relating to the TagStudio library system Priority: High An important issue requiring attention Type: File System File system interactions System: Windows For Microsoft Windows labels Oct 31, 2024
@CyanVoxel CyanVoxel moved this to 🛠 Ready for Development in TagStudio Development Oct 31, 2024
@Knopfi02
Copy link
Author

Knopfi02 commented Nov 21, 2024

I've tried it with some other random files and those lnk files it didn't seem to cause an issue. Maybe it's only with certain filenames?

@seakrueger
Copy link
Collaborator

Would you mind sharing the file names that you deleted to get your script to pass?

@Computerdores
Copy link
Contributor

I would guess that this has nothing to do with lnk files in particular.
NTFS (the default Filesystem on windows) stores filenames in UTF-16, so judging by the error message the .lnk probably had UTF-16 surrogate characters in its name, which then couldn't be encoded into UTF-8 when Tagstudio tried to save the library.

@Knopfi02
Copy link
Author

Knopfi02 commented Nov 21, 2024

I would guess that this has nothing to do with lnk files in particular. NTFS (the default Filesystem on windows) stores filenames in UTF-16, so judging by the error message the .lnk probably had UTF-16 surrogate characters in its name, which then couldn't be encoded into UTF-8 when Tagstudio tried to save the library.

Would the original file, that the .lnk is named after not also cause the problem?

Ok, so I tracked down a single U+1F534 (the red circle emoji) that was in a filename.
HOWEVER, that wasn't the problem! When making a shortcut of that file, the red circle gets encoded as two replacement characters U+FFFD, instead of it's actual UTF-16 encoding 0xD83D 0xDD34

THAT is why only the .lnk files were a problem. Because only those had "��" in them!
(I'm guessing that this might also happen with other emojis or special unicode characters)

@python357-1
Copy link
Collaborator

@Knopfi02 two things:

  1. are you still having this issue
  2. would you be willing to share your library with us? (not the files, just the json file - <library_directory>/.TagStudio/ts_library.json

@python357-1
Copy link
Collaborator

additionally, would you be willing to provide a screenshot of the error you are getting?

@CyanVoxel
Copy link
Member

This seems related if not the same issue as in #350, however I have unfortunately not been able to reproduce these results so far (Win 10, TS 9.4.1 + 9.5)

@Knopfi02
Copy link
Author

@python357-1 since the issue causes it not not save properly, the ts_library file is empty.
image

All I need to do to cause the crash is have a file with a � in the filename in the library. If I then close it (which means it tries to save the library), the error appears. Manually saving the library (or hitting the "close library" button) instead of closing the entire program, gets it stuck on the "Saving Library..." (or "Closing & Saving Library...") text in the bottom left. At that point closing the program also make the error appear.

I have no idea why the lnk turns emojis into �� on my system. I have version 15 of the Noto Color Emoji font installed, but that also doesn't seem to be the cause.

@python357-1
Copy link
Collaborator

Would you be able to do a screen recording of you making the shortcut to the file, or a list of steps you're doing to make the link?

@Knopfi02
Copy link
Author

Have a file with emoji in name -> use context menu (or drag n drop while holding alt) to create shortcut -> shortcut file has �� in it's name instead of the emoji

Also, apart from the shortcut weirdness, do the unicode replacement characters cause the issue for you aswell?

@python357-1
Copy link
Collaborator

Unfortunately I'm not on windows, but cyan is trying the same stuff and tagstudio is not having issues. However, he is able to alt-drag to create a shortcut and the shortcut file has the actual emoji character in the name, not the "��" characters, which is interesting to me. Possibly unrelated, but still interesting

@CyanVoxel
Copy link
Member

Have a file with emoji in name -> use context menu (or drag n drop while holding alt) to create shortcut -> shortcut file has �� in it's name instead of the emoji

Also, apart from the shortcut weirdness, do the unicode replacement characters cause the issue for you aswell?

image
image

Here's my tests of this - strangely not running into the same issue... (Windows 10 22H2 19045.5131)

@python357-1
Copy link
Collaborator

python357-1 commented Nov 24, 2024

@Knopfi02 Can you try going to cmd/powershell and typing pip list and sending the result?

@python357-1
Copy link
Collaborator

python357-1 commented Nov 24, 2024

@Knopfi02 nevermind, not that - try going to the directory where you extracted the downloaded TagStudio to and you should find a directory called _internal. Open that in cmd/powershell (you can right-click if you are in explorer), and type dir in cmd or ls in powershell.

It may be easier to send the results in a file - you can type ls > output.txt and send output.txt instead.

@Knopfi02
Copy link
Author

@CyanVoxel what happens if you just put a � in the filename directly? Since that is the actual thing causing the unicode error.

@python357-1
These are all of them:
_asyncio.pyd
_bz2.pyd
_ctypes.pyd
_decimal.pyd
_elementtree.pyd
_hashlib.pyd
_lzma.pyd
_multiprocessing.pyd
_overlapped.pyd
_pillow_heif.cp312-win_amd64.pyd
_queue.pyd
_socket.pyd
_ssl.pyd
_testcapi.pyd
_testinternalcapi.pyd
_tkinter.pyd
_uuid.pyd
_wmi.pyd

@CyanVoxel
Copy link
Member

@CyanVoxel what happens if you just put a � in the filename directly? Since that is the actual thing causing the unicode error.

I've done that as well, no issues. Both with the red circle and that character.

@python357-1
Copy link
Collaborator

@Knopfi02 There should be a lot more than that - try the ls > output.txt command and send the contents of that file, maybe through something like https://pastebin.com/

@Knopfi02
Copy link
Author

Oh, yeah my bad. Here's the output https://pastebin.com/KJSsQvfv

@python357-1
Copy link
Collaborator

Can you try running Get-UICulture in Powershell?

@Knopfi02
Copy link
Author

Knopfi02 commented Nov 24, 2024

LCID         Name         DisplayName
-----        -------      ---------------
1033         en-US        English (United States)

@python357-1
Copy link
Collaborator

I know I've just been asking you to do a lot, so I'll tell you a little bit about what I'm thinking - there is another issue (#350) that seems to be having similar problems. My thought was that you were also using a different locale, because your dates are formatted DD.MM.YYYY instead of the default MM/DD/YYYY for en-US. Since you aren't on a different locale, I am kind of stumped here. I'm gonna keep looking into this, but for now we may be stuck.

@python357-1
Copy link
Collaborator

one last try - what is the complete filename of the file that includes U+1F534?

@Knopfi02
Copy link
Author

literally just the red circle, nothing else.
Even if it's normal text, with the circle somwhere in the middle, it still gets replaced when making a shortcut.

@python357-1
Copy link
Collaborator

what type of file is it?

@Knopfi02
Copy link
Author

txt, but same happens for other file types.

@python357-1
Copy link
Collaborator

what happens if you name a file "🎞" (U+1F39E) and try to save the library?

@Knopfi02
Copy link
Author

works, no problem

@python357-1
Copy link
Collaborator

how did you create the file with the red circle (walk me through the steps)? did you create the U+1F39E file the same way? (if not, walk me through the steps)

@Knopfi02
Copy link
Author

Knopfi02 commented Nov 24, 2024

I copy paste the symbol into the filename. that's it. And yes

@python357-1
Copy link
Collaborator

What I'm looking for is what the difference is between how the file got saved. Both U+1F534 and U+1F39E, when encoded as utf-16, include a character that is invalid for utf-8 (0xD83D and 0xD83C, respectively). What U+1F39E working and U+1F534 not tells me is somehow, the files got saved differently such that the ones that are causing you trouble were encoded as UTF-16 and the new movie one got saved as UTF-8.

What application did you originally save the red circle file with? What about the film one?

There is sometimes an option in the "Save as..." dialog to save a file as UTF-16. May you have done that at some point?

@CyanVoxel
Copy link
Member

@Knopfi02 If you're interested in and comfortable with compiling TagStudio from source, there's a couple things I could have you try. One would be to try building from main (9.5 experimental) to see if the issue still persists there. The other would be to build from a new branch I could set up to test one or two changes. If not, that's totally understandable. I just wish I was able to reproduce this error myself - everything just works as intended on my system.

@Knopfi02
Copy link
Author

The red circle DOES work. All emojis that I've tried have worked. It's just that as soon as I make a shortcut in windows, the emojis appear as Unicode replacement characters in the shortcuts filename.
Those replacement characters are what's causing tag studio to crash. So if I make a file and just put a replacement character in its name directly, the issue is there.

@CyanVoxel I'd be happy to tried that, but I'll go to sleep now since it's 5am in my Timezone :P
I'll try it tomorrow

@CyanVoxel
Copy link
Member

The red circle DOES work. All emojis that I've tried have worked. It's just that as soon as I make a shortcut in windows, the emojis appear as Unicode replacement characters in the shortcuts filename. Those replacement characters are what's causing tag studio to crash. So if I make a file and just put a replacement character in its name directly, the issue is there.

Wait, so does Windows itself create the odd characters when creating the .lnk filename? It's not just an issue of TagStudio misreading the filename?

@CyanVoxel I'd be happy to tried that, but I'll go to sleep now since it's 5am in my Timezone :P I'll try it tomorrow

Thank you! And have a good night!

@Knopfi02
Copy link
Author

Knopfi02 commented Nov 24, 2024

Yeah! The only problem with tag studio is the that it has issues with files that have Unicode Replacement Characters in their name.

@Knopfi02 Knopfi02 changed the title [Bug]: Windows Shortcut files (.lnk) cause crash upon closing [Bug]: Unicode replacement characters cause crash upon closing Nov 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: High An important issue requiring attention System: Windows For Microsoft Windows TagStudio: Library Relating to the TagStudio library system Type: Bug Something isn't working as intended Type: File System File system interactions
Projects
Status: 🛠 Ready for Development
Development

No branches or pull requests

5 participants