[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4616A2C7.3030000@zytor.com>
Date: Fri, 06 Apr 2007 12:43:03 -0700
From: "H. Peter Anvin" <hpa@...or.com>
To: Egmont Koblinger <egmont@...linux.hu>
CC: linux-kernel@...r.kernel.org
Subject: Re: [PATCH] console UTF-8 fixes
Egmont Koblinger wrote:
>
> - If a certain (otherwise valid UTF-8) character is not found in the glyph
> table, the current code does one of these two (depending on other
> circumstances):
>
> - Either it displays the replacement character U+FFFD, falling back to a
> simple question mark. Note that the Unicode replacement character U+FFFD
> is to be used for invalid sequences. However, it shouldn't necessarily
> be used when replacing a valid but undisplayable character. Think of
> Pango for example that renders these as four hex digits inside a square.
> To be able to visually distinguish between illegal sequences and legal
> but undisplayable characters, I think U+FFFD or the question mark are
> bad choices. In fact, any symbol that may normally occur in the text is
> a bad choice if is displayed simply. Hence I chose to display an
> inverted dot.
>
I strongly disagree. First of all, you're changing the semantics of a
13-year-old API. The semantics of the Linux console is that by
specifying U+FFFD SUBSTITUTION GLYPH in your unicode table, you have
specified the fallback glyph.
What's worse, you've hard-coded the uses of specific visual
representations. That is completely unacceptable.
> - Another possible thing the current code may do (for latin1-compatible
> characters) is to simply display the glyph loaded in that position.
> Suppose I have loaded a latin2 font. In latin2, 0xFB is an "u with
> double accent". An applications prints U+00FB, which is an "u with
> circumflex". Since this glyph is not present in latin2, it cannot be
> printed with the current font. Still, the current code falls back to
> printing the glyph from the 0xFB position of the glyph table. Hence my
> app asked to print "u with circumflex" but an "u with double accent"
> appears on the screen. This is totally contrary to the goals of Unicode
> and shouldn't ever happen.
When does that happen? That is clearly a bug.
> - The replacement character for invalid UTF-8 sequences is U+FFFD, falling
> back to a question mark. I've changed the fallback version to an inverted
> question mark. This way it's more similar to the common glyph of U+FFFD,
> and it's more trivial to the user that it's not a literal question mark
> but rather some erroneous situation.
Brilliant. You've picked a fallback glyph which is unlikely to exist in
all fonts. The whole point of falling back to ? is that it's an ASCII
character, which means that if the font designer failed to designate a
fallback glyph -- which is an error!!! -- there is at least some hope of
conveying the error back to the user.
> - Overlong sequences are not caught currently, they're displayed as if these
> were valid representations. This may even have security impacts.
>
> - Lone continuation bytes (section 3.1 of the UTF-8 stress test) are
> currently displayed as some "random" glyphs rather than the replacement
> character.
>
> - Incomplete sequences (sections 3.2 and 3.3) emit no replacement character,
> but rather cause the subsequent valid character to be displayed more
> times(!).
These are valid issues.
> - There's no concept of double-width characters. It's way beyond the scope
> of my patch to try to display them, but at least I think it's important
> for the cursor to jump two positions when printing such characters, since
> this is what applications (such as text editors) expect. If the cursor
> didn't jump two positions, applications would suffer from displaying and
> refreshing problems, and editing some English letters that are preceded by
> some CJK characters in the same line became a nightmare. With my patch an
> inverted dot followed by an inverted space is displayed for double-width
> characters so it's quite easy to see that they are tied together.
To be able to do CJK you need something like Kon anyway. This feels
like bloat.
> - There's no concept of zero-width characters (such as combining accents)
> either. Yet again it's beyond the scope of my patch to properly handle
> them. Instead of the current behavior (write a replacement character) I
> just ignore them so that full-screen applications can keep track of the
> cursor position correctly.
There is a concept of combining sequences. Anything else, I suspect
it's better to let the user know that something bad is happening.
> - I believe (at least I do hope) that my code is cleaner, more
> straightforward, easier to understand, and is slightly better documented
> than the current version. The current code doesn't separate UTF-8 decoding
> and glyph displaying parts. I clearly separated them. First I perform
> UTF-8 decoding (this emits U+FFFD for invalid sequences), then check for
> the width of the resulting character, change it to U+FFFD if it's
> unprintable (e.g. an UTF-16 surrogate), and finally comes the part that
> does its best in displaying the character on the screen.
>
> I hope you like it. :)
Please see above comments.
-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists