lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20210512122247.5c00c4e4@coco.lan>
Date:   Wed, 12 May 2021 12:22:47 +0200
From:   Mauro Carvalho Chehab <mchehab+huawei@...nel.org>
To:     David Woodhouse <dwmw2@...radead.org>
Cc:     Gabriel Krisman Bertazi <krisman@...labora.com>,
        Linux Doc Mailing List <linux-doc@...r.kernel.org>,
        "Daniel W. S. Almeida" <dwlsalmeida@...il.com>,
        Jonathan Corbet <corbet@....net>,
        Arnd Bergmann <arnd@...db.de>, Borislav Petkov <bp@...en8.de>,
        David Howells <dhowells@...hat.com>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        James Morse <james.morse@....com>,
        Kees Cook <keescook@...omium.org>,
        Mauro Carvalho Chehab <mchehab@...nel.org>,
        Robert Richter <rric@...nel.org>,
        Thorsten Leemhuis <linux@...mhuis.info>,
        Tony Luck <tony.luck@...el.com>, keyrings@...r.kernel.org,
        linux-edac@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 06/53] docs: admin-guide: avoid using UTF-8 chars

Em Wed, 12 May 2021 10:25:35 +0100
David Woodhouse <dwmw2@...radead.org> escreveu:

> On Wed, 2021-05-12 at 10:44 +0200, Mauro Carvalho Chehab wrote:
> > The main point here is that a large amount of those UTF-8 characters
> > appeared as result of document conversion from DocBook/LaTeX/Markdown.
> > 
> > As the conversion ended, I don't expect the need of re-doing a series
> > like that in the near future.
> > 
> > There are even some cases where the UTF-8 were doing wrong things, like
> > using an EN DASH instead of an hyphen in order to pass a command line
> > parameter, and the addition of non-printable BOM characters.
> > 
> > So, IMO, this is a necessarily cleanup after the conversion.  
> 
> That part — fixing characters that are *wrong*, such as converting a
> UTF-8 U+2014 EM DASH to a UTF-8 U+002D HYPHEN-MINUS, is reasonable
> enough.
> 
> But you're not "avoiding using UTF-8 chars" there, as it says in the
> title of this patch. HYPHEN-MINUS encoded as 0x2D *is* UTF-8.

Yeah, you're right, as ASCII is a subset of UTF-8 - as ASCII is
also subset of other charsets as well[1].

[1] ASCII is a subset for all charsets mentioned at:
       https://man7.org/linux/man-pages/man7/charsets.7.html

A more precise title would be something like:

	Use ASCII instead of non-ASCII UTF-8 alternate symbols
or
	Use ASCII subset instead of UTF-8 alternate symbols

See, the goal of this series is to address the cases where there are
multiple UTF-8 alternate symbols with the same meaning as the
original ASCII set. Most of them were introduced by tools like
DocBook/LaTeX/pandoc during document conversions[2], not by design,
but just because the UTF-8 non-ASCII symbols produce a nicer output 
in html or pdf. In another words, it was a toolset decision to change
them, diverging from what the author originally typed.

[2] I suspect that a few of them could have been introduced as a result
    of someone using a text editor like libreoffice (or equivalent),
    that has a similar behavior. 

With ReST, there's no need to use any those, as the building tools will
already do the such conversion when generating html/pdf output.

So, better to stick with ASCII subset on such cases, as it allows
to better use tools like grep and it makes easier to edit such files
on editors like vi, nano, emacs, etc.

Thanks,
Mauro

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ