[BUG] search strings containing umlaut fails to find any results #1535

j-lakeman · 2024-04-12T14:53:57Z

Checks

I have read the troubleshooting section and still think this is a bug.

Describe the bug you encountered:

[I] ➜ ~ fd gung.html finds

Bestätigung.html
Downloads/Bestätigung.html

as expected.

But [I] ➜ ~ fd bestätigung doesn't find anything, even if run with --unrestricted.

This seems not to be a Unicode issue as emoji containing files and folders are found properly.

Describe what you expected to happen:

Same output as first command

What version of `fd` are you using?

fd 9.0.0

Which operating system / distribution are you on?

Darwin 20.6.0 x86_64

Great application nevertheless! Love it!

The text was updated successfully, but these errors were encountered:

tmccombs · 2024-04-12T17:35:29Z

This is likely a duplicate of #638

Is the search using U+75 and U+308(a "u" witha diaresis combining character in front of it", but the filename uses U+00FC (a single ü charachter) or vice versa?

tavianator · 2024-04-12T18:25:55Z

@tmccombs Yeah it would be vice versa. macOS stores filenames in normalization form NFD (D for decomposed), so the actual filenames will have combining characters while most everything else uses the precomposed characters.

tavianator · 2024-04-12T18:29:55Z

Oh I guess my info is out of date. That's true for HFS+, but APFS is normalization-insensitive rather than actually normalizing. So file paths will use whatever normalization you used to create the file, but you can access it by other normalizations too (kinda like how touch foo; cat Foo would work on a case-insensitive FS).

Finder still uses NFD though.

j-lakeman · 2024-04-13T14:51:43Z

@tavianator I'm on a case-sensitive APFS
@tmccombs [I] ➜ ~ printf %x\n "'ä'" outputs e4 and [I] ➜ ~ printf \ue4\n ä again.

However it seems I can't pipe to fd to be able to test the individual characters (#1346).

tmccombs · 2024-04-13T17:04:24Z

What does printf "%x\n" $(ls) in the folder that contains the bestätigung file give

tavianator · 2024-04-13T18:15:36Z

Just copy-pasting from the OP shows what's happening:

[I] ➜ ~ fd gung.html finds
Bestätigung.html
Downloads/Bestätigung.html
as expected.

tavianator@graphene $ echo 'Bestätigung' | xxd
00000000: 4265 7374 61cc 8874 6967 756e 670a       Besta..tigung.

But [I] ➜ ~ fd bestätigung doesn't find anything, even if run with --unrestricted.

tavianator@graphene $ echo 'bestätigung' | xxd
00000000: 6265 7374 c3a4 7469 6775 6e67 0a         best..tigung.

The difference (apart from the case of B) is that fd outputs 61 cc 88 for ä, which is UTF-8 for U+0061 U+0308, while the OP typed c3 a4 for ä, AKA U+00E4.

I suspect if you manually search for the decomposed form, something like

$ fd $'besta\xcc\x88tigung'

it will find it.

j-lakeman · 2024-04-13T20:24:08Z

Yep, that's right! Cheers!
What makes fd outstanding apart from its efficiency is its ease of use IMHO. Though this is quite a workaround, don't you think? Similar characters like that can be found in many European languages.

tmccombs · 2024-04-13T21:54:14Z

I agree that this is not a good user experience.

Unfortunately, it is also a very difficult problem to solve.

The library we use for regex doesn't support normalization, and probably won't anytime soon. See rust-lang/regex#404 (comment). The workaround there of normalizing the regex and input is much easier said than done. Normalizing all the filenames significantly hurts performance. And normalizing the regex isn't as straightforward as normalizing the string of the regex.

For example "ä?" Would need to be converted to "(a\u0308)?".

Perhaps the best path would be to have an option to transform the regex to accept either equivalent form. So for example ä would be transformed into "(ä|a\u0308)".

I'm not familiar enough with unicode to know how feasible that would be in general, or how to create those transformation tables.

tavianator · 2024-04-14T15:18:35Z

Perhaps the best path would be to have an option to transform the regex to accept either equivalent form. So for example ä would be transformed into "(ä|a\u0308)".

I think the worst case here is character classes like [ä-ë]. We'd have to iterate over every code point in the range, apply NFD, and construct a new alternation. It could blow up the regex gigantically.

tavianator · 2024-04-14T15:29:21Z

Here is a quick proof of concept for NFD-izing a regex: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=2ed2dfc074864bbffa5b85f685349d71

tmccombs · 2024-04-28T03:20:11Z

Perhaps we could just do the replacement on literals, and not worry about ranges?

j-lakeman added the bug label Apr 12, 2024

tmccombs added duplicate unicode Bugs and features related to unicode labels Apr 12, 2024

BurntSushi mentioned this issue Apr 16, 2024

support for equivalence classes rust-lang/regex#404

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] search strings containing umlaut fails to find any results #1535

[BUG] search strings containing umlaut fails to find any results #1535

j-lakeman commented Apr 12, 2024 •

edited

Loading

tmccombs commented Apr 12, 2024 •

edited

Loading

tavianator commented Apr 12, 2024

tavianator commented Apr 12, 2024

j-lakeman commented Apr 13, 2024 •

edited

Loading

tmccombs commented Apr 13, 2024

tavianator commented Apr 13, 2024

j-lakeman commented Apr 13, 2024

tmccombs commented Apr 13, 2024

tavianator commented Apr 14, 2024

tavianator commented Apr 14, 2024

tmccombs commented Apr 28, 2024

[BUG] search strings containing umlaut fails to find any results #1535

[BUG] search strings containing umlaut fails to find any results #1535

Comments

j-lakeman commented Apr 12, 2024 • edited Loading

Checks

Describe the bug you encountered:

Describe what you expected to happen:

What version of fd are you using?

Which operating system / distribution are you on?

tmccombs commented Apr 12, 2024 • edited Loading

tavianator commented Apr 12, 2024

tavianator commented Apr 12, 2024

j-lakeman commented Apr 13, 2024 • edited Loading

tmccombs commented Apr 13, 2024

tavianator commented Apr 13, 2024

j-lakeman commented Apr 13, 2024

tmccombs commented Apr 13, 2024

tavianator commented Apr 14, 2024

tavianator commented Apr 14, 2024

tmccombs commented Apr 28, 2024

j-lakeman commented Apr 12, 2024 •

edited

Loading

What version of `fd` are you using?

tmccombs commented Apr 12, 2024 •

edited

Loading

j-lakeman commented Apr 13, 2024 •

edited

Loading