lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <45A19F3A.5030800@imap.cc>
Date:	Mon, 08 Jan 2007 02:32:42 +0100
From:	Tilman Schmidt <tilman@...p.cc>
To:	Willy Tarreau <w@....eu>
CC:	Adrian Bunk <bunk@...sta.de>,
	Jan Engelhardt <jengelh@...ux01.gwdg.de>,
	Russell King <rmk+lkml@....linux.org.uk>,
	David Woodhouse <dwmw2@...radead.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: OT: character encodings

Am 08.01.2007 01:38 schrieb Willy Tarreau:
> I'm not blaming UTF-8 per se, but people who still believe in encoding
> *whole documents*. Copy-paste, text insertion, git output, etc... everything
> has a good reason not to be in the same encoding as what your MUA believes.
> If major MUAs still have problems with UTF-8 10 years after it was introduced,
> it's clearly the proof of a flaw in the initial design.

Actually it's working amazingly well. In the past 15 years the frequency of
character encoding problems has gone from "frequent, that's just the way it
is, if you want your text to go through unharmed then stick to 7 bit ASCII"
to "rare, it normally works, if it doesn't then we've got a problem to solve".
Copy/paste? Works for me! Mail forwarding? Works for me!

The only thing that doesn't and cannot work is automatic conversion of data
with unknown encoding or with incorrect encoding information, and that has
to be solved at the source. If you store differently encoded text in git
without labelling it accordingly then there is just no way to retrieve it
consistently. There are two ways out of that - a complicated one: storing
encoding information with every piece of text, and a simple one - declaring
a single internal encoding to which everything must be converted upon entry
into the database. Guess which one has more chances of working well.

> And I'm not even
> discussing the stupidity which requires that you read a whole text to get
> its number of characters !

Personally I find the requirement to know the number of characters in a text
rather unusual, so I wouldn't base a decision for an encoding on that. In
fact, I cannot remember ever really wanting to know the actual number of
characters in a text. The number of bytes occupied on storage, ok. The
number of letters, of words, of lines, perhaps even the number of printable
characters, all potentially interesting, depending on the application.
But the raw number of characters? I don't know what that might serve for.

HTH
Tilman

-- 
Tilman Schmidt                          E-Mail: tilman@...p.cc
Bonn, Germany
Diese Nachricht besteht zu 100% aus wiederverwerteten Bits.
Ungeöffnet mindestens haltbar bis: (siehe Rückseite)


Download attachment "signature.asc" of type "application/pgp-signature" (254 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ