[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240915193443.GK2825852@ZenIV>
Date: Sun, 15 Sep 2024 20:34:43 +0100
From: Al Viro <viro@...iv.linux.org.uk>
To: Alice Ryhl <aliceryhl@...gle.com>
Cc: Paul Moore <paul@...l-moore.com>, James Morris <jmorris@...ei.org>,
"Serge E. Hallyn" <serge@...lyn.com>,
Miguel Ojeda <ojeda@...nel.org>,
Christian Brauner <brauner@...nel.org>,
Alex Gaynor <alex.gaynor@...il.com>,
Wedson Almeida Filho <wedsonaf@...il.com>,
Boqun Feng <boqun.feng@...il.com>, Gary Guo <gary@...yguo.net>,
Björn Roy Baron <bjorn3_gh@...tonmail.com>,
Benno Lossin <benno.lossin@...ton.me>,
Andreas Hindborg <a.hindborg@...sung.com>,
Peter Zijlstra <peterz@...radead.org>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
Arve Hjønnevåg <arve@...roid.com>,
Todd Kjos <tkjos@...roid.com>, Martijn Coenen <maco@...roid.com>,
Joel Fernandes <joel@...lfernandes.org>,
Carlos Llamas <cmllamas@...gle.com>,
Suren Baghdasaryan <surenb@...gle.com>,
Dan Williams <dan.j.williams@...el.com>,
Matthew Wilcox <willy@...radead.org>,
Thomas Gleixner <tglx@...utronix.de>, Daniel Xu <dxu@...uu.xyz>,
Martin Rodriguez Reboredo <yakoyoku@...il.com>,
Trevor Gross <tmgross@...ch.edu>, linux-kernel@...r.kernel.org,
linux-security-module@...r.kernel.org,
rust-for-linux@...r.kernel.org, linux-fsdevel@...r.kernel.org,
Kees Cook <kees@...nel.org>
Subject: Re: [PATCH v10 6/8] rust: file: add `FileDescriptorReservation`
On Sun, Sep 15, 2024 at 07:39:05PM +0100, Al Viro wrote:
> 2) calling thread MUST NOT unshare descriptor table while it has
> any reserved descriptors. I.e.
> fd = get_unused_fd();
> unshare_files();
> fd_install(fd, file);
> is a bug. Reservations are discarded by that. Getting rid of that
> constraint would require tracking the sets of reserved descriptors
> separately for each thread that happens to share the descriptor table.
> Conceptually they *are* per-thread - the same thread that has done
> reservation must either discard it or use it. However, it's easier to
> keep the "it's reserved by some thread" represented in descriptor table
> itself (bit set in ->open_fds bitmap, file reference in ->fd[] array is
> NULL) than try and keep track of who's reserved what. The constraint is
> basically "all reservations can stay with the old copy", i.e. "caller has
> no reservations of its own to transfer into the new private copy it gets".
FWIW, I toyed with the idea of having reservations kept per-thread;
it is possible and it simplifies some things, but I hadn't been able to
find a way to do that without buggering syscall latency for open() et.al.
It would keep track of the set of reservations in task_struct (count,
two-element array for the first two + page pointer for spillovers,
for the rare threads that need more than two reserved simultaneously).
Representation in fdtable:
state open_fds bit value in ->fd[] array
free clear 0UL
reserved set 0UL
uncommitted set 1UL|(unsigned long)file
open set (unsigned long)file
with file lookup treating any odd value as 0 (i.e. as reserved)
fd_install() switching reserved to uncommitted *AND* separate
"commit" operation that does this:
if current->reservation_count == 0
return
if failure
for each descriptor in our reserved set
v = fdtable->fd[descriptor]
if (v) {
fdtable->fd[descriptor] = 0;
fput((struct file *)(v & ~1);
}
clear bit in fdtable->open_fds[]
else
for each descriptor in our reserved set
v = fdtable->fd[descriptor]
if (v)
fdtable->fd[descriptor] = v & ~1;
else
BUG
current->reservation_count = 0
That "commit" thing would be called on return from syscall
for userland threads and would be called explicitly in
kernel threads that have a descriptor table and work with
it.
The benefit would be that fd_install() would *NOT* have to be done
after the last possible failure exit - something that installs
a lot of files wouldn't have to play convoluted games on cleanup.
Simply returning an error would do the right thing.
Two stores into ->fd[] instead of one is not a big deal;
however, anything like task_work_add() to arrange the call
of "commit" ends up being bloody awful.
We could have it called from syscall glue directly, but
that means touching assembler for quite a few architectures,
and I hadn't gotten around to that. Can be done, but it's
not a pleasant work...
Powered by blists - more mailing lists