lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b3th52nczdpeokggs2ogdnxq36m3jfhrw72ogjhlvnn53ocxy2@s6uhcbdgaowg>
Date:   Thu, 14 Dec 2023 01:06:15 +0000
From:   Alvin Šipraga <ALSI@...g-olufsen.dk>
To:     Joe Perches <joe@...ches.com>
CC:     Duje Mihanović <duje.mihanovic@...le.hr>,
        Alvin Šipraga <alvin@...s.dk>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Konstantin Ryabitsev <konstantin@...uxfoundation.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] get_maintainer: correctly parse UTF-8 encoded names in
 files

Hi again,

Sorry to be a nuisance, but could you please have another look below and
reconsider this patch? Otherwise NAK is fine, but I wanted to follow up
on this as it solves an actual, albeit minor, issue for people with
unusual names when sending and receiving patches.

Thanks!

Kind regards,
Alvin

On Mon, Oct 16, 2023 at 11:56:32PM +0000, Alvin Šipraga wrote:
> Hi Joe,
> 
> On Mon, Oct 16, 2023 at 03:17:56PM -0700, Joe Perches wrote:
> > On Mon, 2023-10-16 at 16:37 +0200, Duje Mihanović wrote:
> > > On Saturday, October 14, 2023 7:22:44 PM CEST Alvin Šipraga wrote:
> > > > From: Alvin Šipraga <alsi@...g-olufsen.dk>
> > > > 
> > > > While the script correctly extracts UTF-8 encoded names from the
> > > > MAINTAINERS file, the regular expressions damage my name when parsing
> > > > from .yaml files. Fix this by replacing the Latin-1-compatible regular
> > > > expressions with the unicode property matcher \p{Latin}.
> > 
> > Well, OK
> > 
> > > >  It's also
> > > > necessary to instruct Perl to open all files with UTF-8 encoding.
> > 
> > But I'm not at all sure this is actually desired.
> 
> The whole patch, or just this last part?
> 
> Regarding the last part, it's necessary because Perl defaults to opening files
> with (I think) Latin-1/ISO-8859-1, and this prevents the script from correctly
> parsing UTF-8 encoded strings. It seemed the most practical solution was to just
> open everything as UTF-8, including stdin/out.
> 
> Are you worried that this will cause breakage elsewhere? Indeed, while Latin-1
> and UTF-8 both have the same encoding for printable ASCII, the former is not a
> strict subset of the latter. But I assumed that UTF-8 would be being used
> everywhere in the source tree.
> 
> Now I did a check to see if that is the case using the encguess tool. See below.
> It is a basic test but it seems that the vast majority of the tree is ASCII or
> UTF-8.
> 
> For your reference, below is also test sequence that shows the different results
> with/without my patch, and with modifications to the encoding Perl uses when
> opening files. I hope you reconsider.
> 
> Kind regards,
> Alvin
> 
> ----8<--------- FILE ENCODINGS IN THE TREE -------8<-------------
> 
> linux $ make mrproper
> linux $ find . -type f -not -path './.git/*' \
>         | parallel encguess                  \
> 	| grep -v -e US-ASCII -e UTF-8       \
> 	> out.txt
> linux $ head -n 2 out.txt  # output is <file> <detected encoding>
> ./tools/include/linux/nmi.h	unknown
> ./tools/testing/selftests/tc-testing/plugins/__init__.py	unknown
> linux $ cat out.txt | cut -f1 | xargs wc
>      0      0      0 ./tools/include/linux/nmi.h
> # comment: this file is empty so encguess says unknown; ditto the others
>      0      0      0 ./tools/testing/selftests/tc-testing/plugins/__init__.py
>      0      0      0 ./tools/testing/selftests/powerpc/primitives/asm/processor.h
>      0      0      0 ./tools/testing/selftests/powerpc/primitives/asm/ppc-opcode.h
>      0      0      0 ./tools/testing/selftests/powerpc/primitives/asm/firmware.h
>      0      0      0 ./tools/testing/selftests/powerpc/primitives/linux/stringify.h
>      0      0      0 ./tools/testing/selftests/powerpc/copyloops/asm/processor.h
>      0      0      0 ./tools/testing/selftests/powerpc/copyloops/asm/kasan.h
>      0      0      0 ./tools/testing/selftests/powerpc/copyloops/asm/feature-fixups.h
>      0      0      0 ./tools/testing/selftests/powerpc/copyloops/asm/asm-compat.h
>      0      0      0 ./tools/testing/kunit/test_data/test_insufficient_memory.log
>     66    168   1668 ./tools/perf/util/top.h
> # comment: has a console escape sequence in macro CONSOLE_CLEAR
>      0      0      0 ./tools/perf/util/help-unknown-cmd.h
>    334   1950 141644 ./tools/perf/tests/pe-file.exe.debug
>     58    594  75595 ./tools/perf/tests/pe-file.exe
> # comment: these are binary files
>      0      0      0 ./tools/virtio/linux/hrtimer.h
>      0      0      0 ./tools/virtio/generated/autoconf.h
>      0      0      0 ./tools/virtio/crypto/hash.h
>      0      0      0 ./tools/build/tests/ex/empty/Build
>    252   1088   5563 ./arch/m68k/hp300/hp300map.map
> # comment: seems deliberately crafted, probably OK to ignore
>      0      0      0 ./arch/riscv/Kconfig.debug
>      0      0      0 ./drivers/s390/crypto/zcrypt_cex2c.h
>      0      0      0 ./drivers/s390/crypto/zcrypt_cex2c.c
>      0      0      0 ./drivers/s390/crypto/zcrypt_cex2a.h
>      0      0      0 ./drivers/s390/crypto/zcrypt_cex2a.c
>      0      0      0 ./drivers/staging/axis-fifo/README
>    358   1709  12218 ./drivers/tty/vt/defkeymap.map
> # comment: seems deliberately crafted, probably OK to ignore
>      0      0      0 ./drivers/gpu/drm/ci/xfails/virtio_gpu-none-flakes.txt
>      0      0      0 ./drivers/gpu/drm/ci/xfails/mediatek-mt8173-flakes.txt
>     89    482  16335 ./Documentation/images/logo.gif
> # comment: this is an image
>      0      0      0 ./Documentation/devicetree/bindings/media/s5p-mfc.txt
>      0      0      0 ./scripts/dummy-tools/dummy-plugin-dir/include/plugin-version.h
>   1190   6057 254726 total
> 
> 
> ----8<--------- TEST SEQUENCE FOR THIS PATCH -----8<-------------
> 
> # fetch reference patch which exhibits this issue
> #   => name is corrupted
> linux $ git checkout master
> linux $ b4 shazam -P _ 20231014-alvin-clk-si5351-no-pll-reset-v4-1-a3567024007d@...g-olufsen.dk
> ...
> Applying: dt-bindings: clock: si5351: convert to yaml
> linux $ git format-patch HEAD^
> 0001-dt-bindings-clock-si5351-convert-to-yaml.patch
> linux $ ./scripts/get_maintainer.pl 0001-dt-bindings-clock-si5351-convert-to-yaml.patch | grep alsi
> grep: (standard input): binary file matches
> linux $ ./scripts/get_maintainer.pl 0001-dt-bindings-clock-si5351-convert-to-yaml.patch | grep alsi -a
> " ipraga" <alsi@...g-olufsen.dk> (in file)
> 
> 
> # apply my patch to get_maintainer.pl
> #   => name is OK
> linux $ b4 shazam 20231014-get-maintainers-utf8-v1-1-3af8c7aeb239@...g-olufsen.dk
> ...
> Applying: get_maintainer: correctly parse UTF-8 encoded names in files
> linux $ ./scripts/get_maintainer.pl 0001-dt-bindings-clock-si5351-convert-to-yaml.patch | grep alsi -a
> Alvin Šipraga <alsi@...g-olufsen.dk> (in file)
> 
> 
> # remove 'use open qw(:std :encoding(UTF-8))'
> #   => name is still corrupted, slightly differently
> linux $ sed -i '/^use open/d' -i ./scripts/get_maintainer.pl
> linux $ ./scripts/get_maintainer.pl 0001-dt-bindings-clock-si5351-convert-to-yaml.patch | grep alsi -a
> ipraga <alsi@...g-olufsen.dk> (in file)
> 
> 
> # remove only the :std part
> #   => name is OK(?), but perl complains about wide char
> linux $ git restore .
> linux $ sed -i 's/:std //' -i ./scripts/get_maintainer.pl
> linux $ ./scripts/get_maintainer.pl 0001-dt-bindings-clock-si5351-convert-to-yaml.patch | grep alsi -a
> Wide character in print at ./scripts/get_maintainer.pl line 2522.
> Alvin Šipraga <alsi@...g-olufsen.dk> (in file)

Powered by blists - more mailing lists