linux-kernel - Re: [PATCH v2 1/3] fs/file.c: add fast path in alloc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20240627-laufschuhe-hergibt-8158b7b6b206@brauner>
Date: Thu, 27 Jun 2024 17:33:08 +0200
From: Christian Brauner <brauner@...nel.org>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: Jan Kara <jack@...e.cz>, "Ma, Yu" <yu.ma@...el.com>, 
	viro@...iv.linux.org.uk, edumazet@...gle.com, linux-fsdevel@...r.kernel.org, 
	linux-kernel@...r.kernel.org, pan.deng@...el.com, tianyou.li@...el.com, tim.c.chen@...el.com, 
	tim.c.chen@...ux.intel.com
Subject: Re: [PATCH v2 1/3] fs/file.c: add fast path in alloc_fd()

On Wed, Jun 26, 2024 at 09:13:07PM GMT, Mateusz Guzik wrote:
> On Wed, Jun 26, 2024 at 1:54 PM Jan Kara <jack@...e.cz> wrote:
> > So maybe I'm wrong but I think the biggest benefit of your code compared to
> > plain find_next_fd() is exactly in that we don't have to load full_fds_bits
> > into cache. So I'm afraid that using full_fds_bits in the condition would
> > destroy your performance gains. Thinking about this with a fresh head how
> > about putting implementing your optimization like:
> >
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> >         unsigned int maxbit = maxfd / BITS_PER_LONG;
> >         unsigned int bitbit = start / BITS_PER_LONG;
> >
> > +       /*
> > +        * Optimistically search the first long of the open_fds bitmap. It
> > +        * saves us from loading full_fds_bits into cache in the common case
> > +        * and because BITS_PER_LONG > start >= files->next_fd, we have quite
> > +        * a good chance there's a bit free in there.
> > +        */
> > +       if (start < BITS_PER_LONG) {
> > +               unsigned int bit;
> > +
> > +               bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> > +               if (bit < BITS_PER_LONG)
> > +                       return bit;
> > +       }
> > +
> >         bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
> >         if (bitbit >= maxfd)
> >                 return maxfd;
> >
> > Plus your optimizations with likely / unlikely. This way the code flow in
> > alloc_fd() stays more readable, we avoid loading the first open_fds long
> > into cache if it is full, and we should get all the performance benefits?
> >
> 
> Huh.
> 
> So when I read the patch previously I assumed this is testing the bit
> word for the map containing next_fd (whatever it is), avoiding looking
> at the higher level bitmap and inlining the op (instead of calling the
> fully fledged func for bit scans).
> 
> I did not mentally register this is in fact only checking for the
> beginning of the range of the entire thing. So apologies from my end
> as based on my feedback some work was done and I'm going to ask to
> further redo it.
> 
> blogbench spawns 100 or so workers, say total fd count hovers just
> above 100. say this lines up with about half of more cases in practice
> for that benchmark.
> 
> Even so, that's a benchmark-specific optimization. A busy web server
> can have literally tens of thousands of fds open (and this is a pretty
> mundane case), making the 0-63 range not particularly interesting.
> 
> That aside I think the patchset is in the wrong order -- first patch
> tries to not look at the higher level bitmap, while second reduces
> stores made there. This makes it quite unclear how much is it worth to
> reduce looking there if atomics are conditional.
> 
> So here is what I propose in terms of the patches:
> 1. NULL check removal, sprinkling of likely/unlikely and expand_files
> call avoidance; no measurements done vs stock kernel for some effort
> saving, just denote in the commit message there is less work under the
> lock and treat it as baseline
> 2. conditional higher level bitmap clear as submitted; benchmarked against 1
> 3. open_fds check within the range containing fd, avoiding higher
> level bitmap if a free slot is found. this should not result in any
> func calls if successful; benchmarked against the above
> 
> Optionally the bitmap routines can grow variants which always inline
> and are used here. If so that would probably land between 1 and 2 on
> the list.
> 
> You noted you know about blogbench bugs and have them fixed. Would be
> good to post a link to a pull request or some other spot for a
> reference.
> 
> I'll be best if the vfs folk comment on what they want here.

Optimizing only the < BIT_PER_LONG seems less desirable then making it
work for arbitrary next_fd. Imho, it'll also be easier to follow if
everything follows the same logic.