linux-kernel - Re: [PATCH] console UTF-8 fixes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Pine.LNX.4.61.0704112048240.20436@yvahk01.tjqt.qr>
Date:	Wed, 11 Apr 2007 21:00:49 +0200 (MEST)
From:	Jan Engelhardt <jengelh@...ux01.gwdg.de>
To:	Egmont Koblinger <egmont@...linux.hu>
cc:	Alan Cox <alan@...rguk.ukuu.org.uk>,
	"H. Peter Anvin" <hpa@...or.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] console UTF-8 fixes


On Apr 11 2007 20:28, Egmont Koblinger wrote:

>I send a reworked version of the patch.
>
>Removed from the first version:
>  - any sign of '.' as substitute glyph
>  - don't ignore zero-width characters (except for a few zero-width spaces
>    that are ignored in the current kernel too). However, I kept the code
>    organized and commented so that someone can have the other behavior very
>    easily (by removing a pair of comment signs).
>
>Kept features, fixes:
>  - lots of UTF-8 decoder fixes. Emit one U+FFFD for every standalone
>    continuation byte and for every incomplete sequence, as Markus Kuhn
>    recommends. Reject overlong sequences too.
>  - D800..DFFF and FFFE..FFFF are substituted by FFFD too, since these are
>    not valid Unicode code points.
>  - no "random" replacement glyph (e.g. u with double acute instead of
>    u with circumflex) in UTF-8 mode
>  - if U+FFFD is not found in the font, the fallback replacement '?' (ascii
>    question mark) is printed with inverse color attributes
>  - U+200A was ignored so far as a zero-width space character. I think it
>    was a mistake, it's not zero-width.
>  - print an extra space for double-wide characters for the cursor to stand
>    at the right place. Yet again the code is organized so that it is very
>    easy to change to jump only one character cell, should someone prefer
>    that behavior (which I still see no good reason to).
>
>Signed-off-by: Egmont Koblinger <egmont@...linux.hu>
>
>@@ -1934,6 +1943,99 @@
> char con_buf[CON_BUF_SIZE];
> DECLARE_MUTEX(con_buf_sem);
> 
>+/* is_{zero,double}_width() are based on the wcwidth() implementation by
>+ * Markus Kuhn -- 2003-05-20 (Unicode 4.0)
>+ * Latest version: http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
>+ */
>+struct interval {
>+  int first;
>+  int last;
>+};

CodingStyle? uint16_t instead of int?

>+static int is_zero_width(long ucs)
>+{
>+  static const struct interval zero_width[] = {
>+    { 0x0300, 0x0357 }, { 0x035D, 0x036F }, { 0x0483, 0x0486 },
[...]
>+    { 0xFB1E, 0xFB1E }, { 0xFE00, 0xFE0F }, { 0xFE20, 0xFE23 },
>+    { 0xFEFF, 0xFEFF }, { 0xFFF9, 0xFFFB }, { 0x1D167, 0x1D169 },
>+    { 0x1D173, 0x1D182 }, { 0x1D185, 0x1D18B }, { 0x1D1AA, 0x1D1AD },
>+    { 0xE0001, 0xE0001 }, { 0xE0020, 0xE007F }, { 0xE0100, 0xE01EF }
>+  };

Since Unicode above 0xFFFF is unsupported, could not these entries be killed?

>+static int is_double_width(long ucs)
>+{
>+  static const struct interval double_width[] = {
>+    { 0x1100, 0x115F }, { 0x2329, 0x232A }, { 0x2E80, 0x303E },
>+    { 0x3040, 0xA4CF }, { 0xAC00, 0xD7A3 }, { 0xF900, 0xFAFF },
>+    { 0xFE30, 0xFE6F }, { 0xFF00, 0xFF60 }, { 0xFFE0, 0xFFE6 },
>+    { 0x20000, 0x2FFFD }, { 0x30000, 0x3FFFD }
>+  };

Similarly.

>@@ -1950,6 +2052,10 @@
> 	unsigned int currcons;
> 	unsigned long draw_from = 0, draw_to = 0;
> 	struct vc_data *vc;
>+	unsigned char vc_attr;
>+	int rescan;
unsigned int rescan:1;
>+	int inverse;
unsigned int inverse:1;
>+	int width;
unsigned int width; or even uint8_t.

> 	u16 himask, charmask;
> 	const unsigned char *orig_buf = NULL;
> 	int orig_count;

>@@ -2012,51 +2118,81 @@
> 		buf++;
> 		n++;
> 		count--;
>+		rescan = 0;
>+		inverse = 0;
>+		width = 1;
> 
> 		/* Do no translation at all in control states */
> 		if (vc->vc_state != ESnormal) {
> 			tc = c;
> 		} else if (vc->vc_utf && !vc->vc_disp_ctrl) {
>-		    /* Combine UTF-8 into Unicode */
>-		    /* Malformed sequences as sequences of replacement glyphs */
>+		    /* Combine UTF-8 into Unicode in vc_utf_char */
>+		    /* vc_utf_count is the number of continuation bytes still expected to arrive */
>+		    /* vc_npar is the number of continuation bytes arrived so far */
> rescan_last_byte:
>-		    if(c > 0x7f) {
>+		    if ((c & 0xc0) == 0x80) {
>+			/* Continuation byte received */
>+			static const int utf8_length_changes[] = { 0x0000007f, 0x000007ff, 0x0000ffff, 0x001fffff, 0x03ffffff, 0x7fffffff };

I would not mind unsigned.


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/