linux-kernel - Re: [linux-next:master] [lockref] d042dae6ad: unixbench.throughput -33.7% regression

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CAHk-=wgJRoSW=RN=r=YGpG1nBa44=wkDLZ92SZnuuzA_6inJrA@mail.gmail.com>
Date: Wed, 3 Jul 2024 09:47:32 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: Christian Brauner <brauner@...nel.org>, kernel test robot <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev, 
	lkp@...el.com, Linux Memory Management List <linux-mm@...ck.org>, linux-kernel@...r.kernel.org, 
	ying.huang@...el.com, feng.tang@...el.com, fengwei.yin@...el.com
Subject: Re: [linux-next:master] [lockref] d042dae6ad: unixbench.throughput
 -33.7% regression

On Wed, 3 Jul 2024 at 06:53, Mateusz Guzik <mjguzik@...il.com> wrote:
>
> Now I'm confused mate. Based on the convo so far I would expect you
> would consider the xfs thing a no-go for the machinery.
>
> You were rightfully pointing out the relationship dentry<->inode is not
> stable and care is needed to grab the pointer, and even then the pointer
> may be wrong by the time one finishes the work.
>
> I presume you are also worried about callbacks not taking proper steps
> when looking at the inode itself -- after all they can be racing with
> teardown and have to handle it gracefully by returning an error.

No. I'm *assuming* the callbacks don't take proper steps.

IOW,. the reason I think the callback model is the right model is
exactly because I do not believe any user will reasonably understand
and get all the RCU pathwalking rules right.

So my mental picture of the callback model is that it is entirely
speculative. It will *speculatively* fill in the stat data. And it
obviously will *not* fill it in in user space - because you can't do
user space accesses while in an RCU-locked region.

So the stat callback would purely fill in a speculative kernel buffer.

And then the path walking would confirm the sequence numbers *after*
calling the callback, and override the return to ECHILD and finish the
path walk non-speculatively if the sequence numbers don't match.

> Inode changing identities adds potential trouble which does not need to
> be there.

I agree that the XFS stuff may be questionable, but I still don't see
the problem wrt any stat callback. The sequence number tests would be
EXACTLY THE SAME ones that we currently use for regular file open etc.
If they are wrong for one case, they'd be wrong for another one.

I think you are coming into this from a backgroudn that would do the
stat buffer _without_ doing proper validation afterwards.

> Suppose the inode got reused and is now representing a device, i_rdev is
> some funky value.

Tell me how the inode gets re-used with the sequence numbers still
matching, and then tell me why regular path lookup doesn't have this
issue.

> There is also potential trouble with security modules as they
> unfortunately have a hook for getattr.

Oh, absolutel;y. The stat security callbacks would need fixing.
Exactly the same way we had to fix 'permission()' for RCU lookup with
the whole MAY_NOT_BLOCK thing, where they just say "I can't do that",
and we'd have to fall back on the non-RCU case.

And yes, filesystems might disable RCU stat - the same way filesystems
can currently react to revalidate etc under RCU with -ECHILD.

This is all *EXACTLY* why I think that "callback with error cases"
makes sense. Because the callback may well say exactly "do the old
thing" because it can't handle the RCU case, and it will depend on
things that are outside the control of path walking or stat itself.

                 Linus