linux-kernel - Re: [PATCH v2] vfs: shave work on failed file open

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAGudoHHwvOMFqYoBQAoFwD9mMmtq12=EvEGQWeToYT0AMg9V0A@mail.gmail.com>
Date:   Fri, 29 Sep 2023 23:23:04 +0200
From:   Mateusz Guzik <mjguzik@...il.com>
To:     Christian Brauner <brauner@...nel.org>
Cc:     Jann Horn <jannh@...gle.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        viro@...iv.linux.org.uk, linux-kernel@...r.kernel.org,
        linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH v2] vfs: shave work on failed file open

On 9/29/23, Christian Brauner <brauner@...nel.org> wrote:
> On Fri, Sep 29, 2023 at 03:31:29PM +0200, Jann Horn wrote:
>> On Fri, Sep 29, 2023 at 11:20 AM Christian Brauner <brauner@...nel.org>
>> wrote:
>> > > But yes, that protection would be broken by SLAB_TYPESAFE_BY_RCU,
>> > > since then the "f_count is zero" is no longer a final thing.
>> >
>> > I've tried coming up with a patch that is simple enough so the pattern
>> > is easy to follow and then converting all places to rely on a pattern
>> > that combine lookup_fd_rcu() or similar with get_file_rcu(). The
>> > obvious
>> > thing is that we'll force a few places to now always acquire a
>> > reference
>> > when they don't really need one right now and that already may cause
>> > performance issues.
>>
>> (Those places are probably used way less often than the hot
>> open/fget/close paths though.)
>>
>> > We also can't fully get rid of plain get_file_rcu() uses itself because
>> > of users such as mm->exe_file. They don't go from one of the rcu
>> > fdtable
>> > lookup helpers to the struct file obviously. They rcu replace the file
>> > pointer in their struct ofc so we could change get_file_rcu() to take a
>> > struct file __rcu **f and then comparing that the passed in pointer
>> > hasn't changed before we managed to do atomic_long_inc_not_zero().
>> > Which
>> > afaict should work for such cases.
>> >
>> > But overall we would introduce a fairly big and at the same time subtle
>> > semantic change. The idea is pretty neat and it was fun to do but I'm
>> > just not convinced we should do it given how ubiquitous struct file is
>> > used and now to make the semanics even more special by allowing
>> > refcounts.
>> >
>> > I've kept your original release_empty_file() proposal in vfs.misc which
>> > I think is a really nice change.
>> >
>> > Let me know if you all passionately disagree. ;)
>
> So I'm appending the patch I had played with and a fix from Jann on top.
> @Linus, if you have an opinion, let me know what you think.
>
> Also available here:
> https://gitlab.com/brauner/linux/-/commits/vfs.file.rcu
>
> Might be interesting if this could be perfed to see if there is any real
> gain for workloads with massive numbers of fds.
>

I would feel safer with a guaranteed way to tell that the file was reallocated.

I think this could track allocs/frees with a sequence counter embedded
into the object, say odd means deallocated and even means allocated.

Then you would know for a fact whether you raced with the file getting
whacked and would never have to wonder if you double-checked
everything you needed (like that f_mode) thing.

This would also mean that consumers which get away with poking around
the file without getting a ref could still do it, this is at least
true for tid_fd_mode. All of them would need patching though.

Extending struct file is not ideal by any means, but the good news is that:
1. there is a 4 byte hole in there, if one is fine with an int-sized counter
2. if one insists on 8 bytes, the struct is 232 bytes on my kernel
(debian). still some room up to 256, so it may be tolerable?

-- 
Mateusz Guzik <mjguzik gmail.com>