linux-kernel - Re: [PATCH] console UTF-8 fixes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4616A2C7.3030000@zytor.com>
Date:	Fri, 06 Apr 2007 12:43:03 -0700
From:	"H. Peter Anvin" <hpa@...or.com>
To:	Egmont Koblinger <egmont@...linux.hu>
CC:	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] console UTF-8 fixes

Egmont Koblinger wrote:
> 
> - If a certain (otherwise valid UTF-8) character is not found in the glyph
>   table, the current code does one of these two (depending on other
>   circumstances):
> 
>   - Either it displays the replacement character U+FFFD, falling back to a
>     simple question mark. Note that the Unicode replacement character U+FFFD
>     is to be used for invalid sequences. However, it shouldn't necessarily
>     be used when replacing a valid but undisplayable character. Think of
>     Pango for example that renders these as four hex digits inside a square.
>     To be able to visually distinguish between illegal sequences and legal
>     but undisplayable characters, I think U+FFFD or the question mark are
>     bad choices. In fact, any symbol that may normally occur in the text is
>     a bad choice if is displayed simply. Hence I chose to display an
>     inverted dot.
> 

I strongly disagree.  First of all, you're changing the semantics of a 
13-year-old API.  The semantics of the Linux console is that by 
specifying U+FFFD SUBSTITUTION GLYPH in your unicode table, you have 
specified the fallback glyph.

What's worse, you've hard-coded the uses of specific visual 
representations.  That is completely unacceptable.

>   - Another possible thing the current code may do (for latin1-compatible
>     characters) is to simply display the glyph loaded in that position.
>     Suppose I have loaded a latin2 font. In latin2, 0xFB is an "u with
>     double accent". An applications prints U+00FB, which is an "u with
>     circumflex". Since this glyph is not present in latin2, it cannot be
>     printed with the current font. Still, the current code falls back to
>     printing the glyph from the 0xFB position of the glyph table. Hence my
>     app asked to print "u with circumflex" but an "u with double accent"
>     appears on the screen. This is totally contrary to the goals of Unicode
>     and shouldn't ever happen.

When does that happen?  That is clearly a bug.

> - The replacement character for invalid UTF-8 sequences is U+FFFD, falling
>   back to a question mark. I've changed the fallback version to an inverted
>   question mark. This way it's more similar to the common glyph of U+FFFD,
>   and it's more trivial to the user that it's not a literal question mark
>   but rather some erroneous situation.

Brilliant.  You've picked a fallback glyph which is unlikely to exist in 
all fonts.  The whole point of falling back to ? is that it's an ASCII 
character, which means that if the font designer failed to designate a 
fallback glyph -- which is an error!!! -- there is at least some hope of 
conveying the error back to the user.

> - Overlong sequences are not caught currently, they're displayed as if these
>   were valid representations. This may even have security impacts.
> 
> - Lone continuation bytes (section 3.1 of the UTF-8 stress test) are
>   currently displayed as some "random" glyphs rather than the replacement
>   character.
> 
> - Incomplete sequences (sections 3.2 and 3.3) emit no replacement character,
>   but rather cause the subsequent valid character to be displayed more
>   times(!).

These are valid issues.

> - There's no concept of double-width characters. It's way beyond the scope
>   of my patch to try to display them, but at least I think it's important
>   for the cursor to jump two positions when printing such characters, since
>   this is what applications (such as text editors) expect. If the cursor
>   didn't jump two positions, applications would suffer from displaying and
>   refreshing problems, and editing some English letters that are preceded by
>   some CJK characters in the same line became a nightmare. With my patch an
>   inverted dot followed by an inverted space is displayed for double-width
>   characters so it's quite easy to see that they are tied together.

To be able to do CJK you need something like Kon anyway.  This feels 
like bloat.

> - There's no concept of zero-width characters (such as combining accents)
>   either. Yet again it's beyond the scope of my patch to properly handle
>   them. Instead of the current behavior (write a replacement character) I
>   just ignore them so that full-screen applications can keep track of the
>   cursor position correctly.

There is a concept of combining sequences.  Anything else, I suspect 
it's better to let the user know that something bad is happening.

> - I believe (at least I do hope) that my code is cleaner, more
>   straightforward, easier to understand, and is slightly better documented
>   than the current version. The current code doesn't separate UTF-8 decoding
>   and glyph displaying parts. I clearly separated them. First I perform
>   UTF-8 decoding (this emits U+FFFD for invalid sequences), then check for
>   the width of the resulting character, change it to U+FFFD if it's
>   unprintable (e.g. an UTF-16 surrogate), and finally comes the part that
>   does its best in displaying the character on the screen.
> 
> I hope you like it. :)

Please see above comments.

	-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/