linux-kernel - RE: [PATCH 2/2] lib/string.c: Optimize memchr()

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <658586b208ea4339b3ead19378484434@AcuMS.aculab.com>
Date:   Wed, 13 Jul 2022 10:24:15 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'Andrey Semashev' <andrey.semashev@...il.com>,
        'Yu-Jen Chang' <arthurchang09@...il.com>
CC:     "andy@...nel.org" <andy@...nel.org>,
        "akinobu.mita@...il.com" <akinobu.mita@...il.com>,
        Ching-Chun Huang <jserv@...s.ncku.edu.tw>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH 2/2] lib/string.c: Optimize memchr()

From: Andrey Semashev
> Sent: 13 July 2022 11:03
> 
> On 7/13/22 12:49, Andrey Semashev wrote:
> > On 7/13/22 12:39, David Laight wrote:
> >> From: Yu-Jen Chang
> >>> Sent: 12 July 2022 15:59
> >> ...
> >>>> I think you're missing the point. Loads at unaligned addresses may not
> >>>> be allowed by hardware using conventional load instructions or may be
> >>>> inefficient. Given that this memchr implementation is used as a fallback
> >>>> when no hardware-specific version is available, you should be
> >>>> conservative wrt. hardware capabilities and behavior. You should
> >>>> probably have a pre-alignment loop.
> >>>
> >>> Got it. I add  pre-alignment loop. It aligns the address to 8 or 4bytes.
> >>
> >> That should be predicated on !HAS_EFFICIENT_UNALIGNED_ACCESS.
> >>
> >> ...
> >>>         for (; p <= end - 8; p += 8) {
> >>>             val = *(u64*)p ^ mask;
> >>>             if ((val + 0xfefefefefefefeffull)
> >>> & (~val & 0x8080808080808080ull))
> >>>                 break;
> >>
> >> I would add a couple of comments, like:
> >> 	// Convert to check for zero byte.
> >> 	// Standard check for a zero byte in a word.
> >> (But not the big 4 line explanation you had.
> >>
> >> It is also worth looking at how that code compiles
> >> on 32bit arch that don't have a carry flag.
> >> That is everything based on MIPS, including riscv.
> >
> > It may be worth looking at how glibc does it:
> >
> >
> https://sourceware.org/git/?p=glibc.git;a=blob;f=string/memchr.c;h=422bcd0cd646ea46711a57fa3cbdb8a3329
> fc302;hb=refs/heads/release/2.35/master#l46
> >
> > They do use 32-bit words on 32-bit targets and 64-bit on 64-bit ones. I
> > think memchr in the kernel should follow this.
> 
> Also, if by chance this optimization is aimed for x86-64, it may be
> worth adding an arch-specific version that uses ERMS.

Don't believe everything you see in glibc.
The common cases in the kernel are different from the ones they
tend to optimise for..

You might try using:
	#define GEN_MASK(x) (x * (unsigned long)0x0101010101010101ull)
for the mask and the two constants.
Then make all the variables 'long'.

I'm not at all sure what the test for fast multiply is about.
It may be very historic, for modern cpu generating the 64bit
constant is likely to be more problematic (check arm64).
If the input value is 'unsigned char' (or masked) then the
compiler may decide to do the repeated <<= itself.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)