[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240915193443.GK2825852@ZenIV>
Date: Sun, 15 Sep 2024 20:34:43 +0100
From: Al Viro <viro@...iv.linux.org.uk>
To: Alice Ryhl <aliceryhl@...gle.com>
Cc: Paul Moore <paul@...l-moore.com>, James Morris <jmorris@...ei.org>,
	"Serge E. Hallyn" <serge@...lyn.com>,
	Miguel Ojeda <ojeda@...nel.org>,
	Christian Brauner <brauner@...nel.org>,
	Alex Gaynor <alex.gaynor@...il.com>,
	Wedson Almeida Filho <wedsonaf@...il.com>,
	Boqun Feng <boqun.feng@...il.com>, Gary Guo <gary@...yguo.net>,
	Björn Roy Baron <bjorn3_gh@...tonmail.com>,
	Benno Lossin <benno.lossin@...ton.me>,
	Andreas Hindborg <a.hindborg@...sung.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
	Arve Hjønnevåg <arve@...roid.com>,
	Todd Kjos <tkjos@...roid.com>, Martijn Coenen <maco@...roid.com>,
	Joel Fernandes <joel@...lfernandes.org>,
	Carlos Llamas <cmllamas@...gle.com>,
	Suren Baghdasaryan <surenb@...gle.com>,
	Dan Williams <dan.j.williams@...el.com>,
	Matthew Wilcox <willy@...radead.org>,
	Thomas Gleixner <tglx@...utronix.de>, Daniel Xu <dxu@...uu.xyz>,
	Martin Rodriguez Reboredo <yakoyoku@...il.com>,
	Trevor Gross <tmgross@...ch.edu>, linux-kernel@...r.kernel.org,
	linux-security-module@...r.kernel.org,
	rust-for-linux@...r.kernel.org, linux-fsdevel@...r.kernel.org,
	Kees Cook <kees@...nel.org>
Subject: Re: [PATCH v10 6/8] rust: file: add `FileDescriptorReservation`
On Sun, Sep 15, 2024 at 07:39:05PM +0100, Al Viro wrote:
> 	2) calling thread MUST NOT unshare descriptor table while it has
> any reserved descriptors.  I.e.
> 	fd = get_unused_fd();
> 	unshare_files();
> 	fd_install(fd, file);
> is a bug.  Reservations are discarded by that.  Getting rid of that
> constraint would require tracking the sets of reserved descriptors
> separately for each thread that happens to share the descriptor table.
> Conceptually they *are* per-thread - the same thread that has done
> reservation must either discard it or use it.  However, it's easier to
> keep the "it's reserved by some thread" represented in descriptor table
> itself (bit set in ->open_fds bitmap, file reference in ->fd[] array is
> NULL) than try and keep track of who's reserved what.  The constraint is
> basically "all reservations can stay with the old copy", i.e. "caller has
> no reservations of its own to transfer into the new private copy it gets".
FWIW, I toyed with the idea of having reservations kept per-thread;
it is possible and it simplifies some things, but I hadn't been able to
find a way to do that without buggering syscall latency for open() et.al.
It would keep track of the set of reservations in task_struct (count,
two-element array for the first two + page pointer for spillovers,
for the rare threads that need more than two reserved simultaneously).
Representation in fdtable:
state		open_fds bit	value in ->fd[] array
free		clear		0UL
reserved	set		0UL
uncommitted	set		1UL|(unsigned long)file
open		set		(unsigned long)file
with file lookup treating any odd value as 0 (i.e. as reserved)
fd_install() switching reserved to uncommitted *AND* separate
"commit" operation that does this:
	if current->reservation_count == 0
		return
	if failure
		for each descriptor in our reserved set
			v = fdtable->fd[descriptor]
			if (v) {
				fdtable->fd[descriptor] = 0;
				fput((struct file *)(v & ~1);
			}
			clear bit in fdtable->open_fds[]
	else
		for each descriptor in our reserved set
			v = fdtable->fd[descriptor]
			if (v)
				fdtable->fd[descriptor] = v & ~1;
			else
				BUG
	current->reservation_count = 0
That "commit" thing would be called on return from syscall
for userland threads and would be called explicitly in
kernel threads that have a descriptor table and work with
it.
The benefit would be that fd_install() would *NOT* have to be done
after the last possible failure exit - something that installs
a lot of files wouldn't have to play convoluted games on cleanup.
Simply returning an error would do the right thing.
Two stores into ->fd[] instead of one is not a big deal;
however, anything like task_work_add() to arrange the call
of "commit" ends up being bloody awful.
We could have it called from syscall glue directly, but
that means touching assembler for quite a few architectures,
and I hadn't gotten around to that.  Can be done, but it's
not a pleasant work...
Powered by blists - more mailing lists
 
