linux-kernel - Re: OT: character encodings (was: Linux 2.6.20-rc4)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20070107170656.GC21133@flint.arm.linux.org.uk>
Date:	Sun, 7 Jan 2007 17:06:56 +0000
From:	Russell King <rmk+lkml@....linux.org.uk>
To:	David Woodhouse <dwmw2@...radead.org>
Cc:	Tilman Schmidt <tilman@...p.cc>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: OT: character encodings (was: Linux 2.6.20-rc4)

On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote:
> On Sun, 2007-01-07 at 15:38 +0000, Russell King wrote:
> > When a text file is stored on disk, there's no way to tell what
> > character set the characters in that file belong to.  As a result,
> > ISO-8859-1 folk assume that all text files are ISO-8859-1 encoded.
> > UTF-8 folk assume all text files are UTF-8 encoded.  This leads to
> > utter confusion.
> 
> Only if you are making different assumptions about the _same_ set of
> files, on the _same_ system. But that would be silly.

$ git log | head -n 1000 | tail -n 200 > o
$ file -i o
o: text/plain; charset=us-ascii
$ git log | head -n 1000 | tail -n 300 > o
$ file -i o
o: text/plain; charset=us-ascii
$ git log | head -n 1000 | tail -n 400 > o
$ file -i o
o: text/plain; charset=utf-8

(and you know what charset the file is thought to have with all 1000
lines in it.)

All on a system with LANG set to en_GB (iow ISO-8859-1).

> > To see what I mean, try the following:
> > 
> > $ git log | head -n 1000 > o
> > $ file -i o
> > o: text/x-c; charset=iso-8859-1
> > 
> > According to that, the charset of the 'git log' output (which on that
> > test included Leonard's entry) is iso-8859-1, and by that Linus' mailer
> > was right to include it as ISO-8859-1.
> 
> Yes. When you stored it on disk, the character set information was lost.

The same thing actually happens when I look at it via:

  $ git log | head -n 1000 | less

but in this case the output is always interpreted by the terminal to be
in its character set.

> If you were running a mixed-charset system then attempting to recreating
> the lost information with heuristics and assumptions is obviously going
> to be problematic.

I'm not - I'm running a pure ISO-8859-1 system:

$ echo $LANG
en_GB
$ locale -k LC_CTYPE | grep charmap
charmap="ISO-8859-1"

> Actually, because UTF-8 allows me to run a system which is purely based
> on a single character set, I get better results when I try the same
> trick:
> 	shinybook /shiny/git/mtd-2.6 $ git log | head -n 1000 > o
> 	shinybook /shiny/git/mtd-2.6 $ file -i o
> 	o: text/plain; charset=utf-8

$ LANG=en_GB.UTF-8 locale -k LC_CTYPE | grep charmap
charmap="UTF-8"
$ LANG=en_GB.UTF-8 git log | head -n 1000 > o
$ LANG=en_GB.UTF-8 file -i o
o: text/x-c; charset=iso-8859-1
$ git version
git version 1.4.4.2

Looks like the output is iso-8859-1 even with UTF-8!

> > In reality, the output from git log contains an ad-hoc collection of
> > character sets making its interpretation under any one character set
> > incorrect.
> 
> No, the contents of the git log ought to be UTF-8, unless people have
> been misusing it. Git stores its text in UTF-8 (by default), and is
> capable of converting to and from legacy character sets on input
> (git-commit) and output (git-log).

Git may store its text internally in UTF-8 (I don't know but I have no
evidence to suggest it does - in fact I have some evidence in this test
that it doesn't care about charsets.)  git log output on a non-UTF-8
system certainly is not in the hosts character set.  For example:

$ LANG=en_GB.UTF-8 git log | head -n 1000 > o
$ LANG=en_GB git log | head -n 1000 > o2
$ diff -u o o2

That includes the UTF-8 encoded part of Leonard name.  It also includes
Rafa? Bilski's name which is non-UTF-8 encoded.

So, in both cases, exactly the same output bytestream was created
independent of the character set _actually_ being used, which both
includes untranslated UTF-8 and non-UTF-8 sequences.

There is obviously no character set translation going on with the output.
So we can add 'git' to my list of charset-broken programs.

Also, since we have recent data in the git repository which is non-UTF-8
as well, it is clear that there is no character set translation going on
at input time either.

Looking at the git-commit script, there appears to be no character set
conversion going on in there either.

So, I think you'll find that the contents of git _is_ an ad-hoc collection
of character sets which people happen to have in use on their machines.

> > So, in short, UTF-8 is all fine and dandy if your _entire_ universe
> > is UTF-8 enabled.  If you're operating in a mixed charset environment
> > it's one bloody big pain in the butt.
> 
> A mixed charset environment was _already_ a pain in the butt, because
> almost nobody got labelling right. It's wrong to blame that on UTF-8.

I'm not talking about a mixed charset environment.  I'm talking about
non-UTF-8 single charset environments being broken by programs which
universally think the universe is UTF-8 only.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/