[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Y1HRxNpxYcFHf/8R@shell.armlinux.org.uk>
Date: Thu, 20 Oct 2022 23:55:00 +0100
From: "Russell King (Oracle)" <linux@...linux.org.uk>
To: Yury Norov <yury.norov@...il.com>
Cc: Catalin Marinas <catalin.marinas@....com>,
Mark Rutland <mark.rutland@....com>,
Will Deacon <will@...nel.org>,
linux-arm-kernel@...ts.infradead.org,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Alexey Klimov <klimov.linux@...il.com>,
Andy Shevchenko <andy.shevchenko@...il.com>,
Andy Whitcroft <apw@...onical.com>,
Dennis Zhou <dennis@...nel.org>,
Geert Uytterhoeven <geert@...ux-m68k.org>,
Guenter Roeck <linux@...ck-us.net>,
Kees Cook <keescook@...omium.org>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Rasmus Villemoes <linux@...musvillemoes.dk>
Subject: Re: [RFC PATCH 0/2] Switch ARM to generic find_bit() API
On Thu, Oct 20, 2022 at 01:02:07PM -0700, Yury Norov wrote:
> On Thu, Oct 20, 2022 at 05:51:34PM +0100, Russell King (Oracle) wrote:
> > On Wed, Oct 19, 2022 at 08:20:22PM -0700, Yury Norov wrote:
> > > Hi Russell, all,
> > >
> > > I'd like to respin a patch that switches ARM to generic find_bit()
> > > functions.
> > >
> > > Generic code works on par with arch or better, according to my
> > > testing [1], and with recent improvements merged in v6.1, it should
> > > be even faster.
> > >
> > > ARM already uses many generic find_bit() functions - those that it
> > > doesn't implement. So we are talking about migrating a subset of the
> > > API; most of find_bit() family has only generic implementation on ARM.
> > >
> > > The only concern about this migration is that ARM code supports
> > > byte-aligned bitmap addresses, while generic code is optimized for
> > > word-aligned bitmaps.
> > >
> > > In my practice, I've never seen unaligned bitmaps. But to check that on
> > > ARM, I added a run-time check for bitmap alignment. I gave it run on
> > > several architectures and found nothing.
> > >
> > > Can you please check that on your hardware and compare performance of
> > > generic vs arch code for you? If everything is OK, I suggest switching
> > > ARM to generic find_bit() completely.
> > >
> > > Thanks,
> > > Yury
> > >
> > > [1] https://lore.kernel.org/all/YuWk3titnOiQACzC@yury-laptop/
> >
> > I _really_ don't want to play around with this stuff right now... 6.0
> > appears to have a regression on arm32 early on during boot:
> >
> > [ 1.410115] EXT4-fs error (device sda1): htree_dirblock_to_tree:1093: inode #256: block 8797: comm systemd: bad entry in directory: rec_len % 4 != 0 - offset=0, inode=33188, rec_len=35097, size=4096 fake=0
> >
> > Booting 5.19 with the same filesystem works without issue and without
> > even a fsck, but booting 6.0 always results in some problem that
> > prevents it booting.
> >
> > Debugging this is not easy, because there also seems to be something
> > up with the bloody serial console - sometimes I get nothing, other
> > times I get nothing more than:
> >
> > [ 2.929502] EXT4-fs error (de
> >
> > and then the output stops. Is the console no longer synchronous? If it
> > isn't, that's a huge mistake which can be seen right here with the
> > partial message output... so I also need to work out how to make the
> > console output synchronous again.
>
> Got it.
>
> I you think that EXT4 problems are due to unaligned bitmaps, you can take
> 1st patch from this series to check.
Got to the bottom of it, it wasn't the bit array functions, it was DMA
API issues.
Okay, I've now tested the generic ops vs my updated optimised ops,
and my version still comes out faster (based on three runs). The
random-filled show less difference, but the sparse bitmaps show a
much better win for my optimised code over the generic code where
they exist:
arm: [ 694.614773] find_next_bit: 40078 ns, 656 iterations
gen: [ 88.611973] find_next_bit: 69943 ns, 655 iterations
arm: [ 694.625436] find_next_zero_bit: 3939309 ns, 327025 iterations
gen: [ 88.624227] find_next_zero_bit: 5529553 ns, 327026 iterations
arm: [ 694.646236] find_first_bit: 7301363 ns, 656 iterations
gen: [ 88.645313] find_first_bit: 7589120 ns, 655 iterations
These figures appear to be pretty consistent across the three
runs.
For completness, here's the random-filled results:
arm: [ 694.109190] find_next_bit: 2242618 ns, 163949 iterations
gen: [ 88.167340] find_next_bit: 2632859 ns, 163743 iterations
arm: [ 694.117968] find_next_zero_bit: 2049129 ns, 163732 iterations
gen: [ 88.176912] find_next_zero_bit: 2844221 ns, 163938 iterations
arm: [ 694.151421] find_first_bit: 17778911 ns, 16448 iterations
gen: [ 88.211167] find_first_bit: 18596622 ns, 16401 iterations
So, I don't see much reason to switch to the generic ops for
these, not when we have such a significant speedup on the
find_next_* functions for sparse-filled results..
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
Powered by blists - more mailing lists