linux-kernel - Re: vfat: Broken case-insensitive support for UTF-8

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200120162701.guxcrmqysejaqw6y@pali>
Date:   Mon, 20 Jan 2020 17:27:01 +0100
From:   Pali Rohár <pali.rohar@...il.com>
To:     David Laight <David.Laight@...LAB.COM>
Cc:     OGAWA Hirofumi <hirofumi@...l.parknet.co.jp>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
        "Theodore Y. Ts'o" <tytso@....edu>,
        Namjae Jeon <linkinjeon@...il.com>,
        Gabriel Krisman Bertazi <krisman@...labora.com>
Subject: Re: vfat: Broken case-insensitive support for UTF-8

On Monday 20 January 2020 15:47:22 David Laight wrote:
> From: Pali Rohár
> > Sent: 20 January 2020 15:20
> ...
> > This is not possible. There is 1:1 mapping between UTF-8 sequence and
> > Unicode code point. wchar_t in kernel represent either one Unicode code
> > point (limited up to U+FFFF in NLS framework functions) or 2bytes in
> > UTF-16 sequence (only in utf8s_to_utf16s() and utf16s_to_utf8s()
> > functions).
> 
> Unfortunately there is neither a 1:1 mapping of all possible byte sequences
> to wchar_t (or unicode code points),

I was talking about valid UTF-8 sequence (invalid, illformed is out of
game and for sure would always cause problems).

> nor a 1:1 mapping of all possible wchar_t values to UTF-8.

This is not truth. There is exactly only one way how to convert sequence
of Unicode code points to UTF-8. UTF is Unicode Transformation Format
and has exact definition how is Unicode Transformed.

If you have valid UTF-8 sequence then it describe one exact sequence of
Unicode code points. And if you have sequence (ordinals) of Unicode code
points there is exactly one and only one its representation in UTF-8.

I would suggest you to read Unicode standard, section 2.5 Encoding Forms.

> Really both need to be defined - even for otherwise 'invalid' sequences.
> 
> Even the 16-bit values above 0xd000 can appear on their own in
> windows filesystems (according to wikipedia).

If you are talking about UTF-16 (which is _not_ 16-bit as you wrote),
look at my previous email:

"MS FAT32 implementations allows half of UTF-16 surrogate pair stored in FS."

> It is all to easy to get sequences of values that cannot be converted
> to/from UTF-8.
> 
> 	David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

-- 
Pali Rohár
pali.rohar@...il.com

Download attachment "signature.asc" of type "application/pgp-signature" (196 bytes)