[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ZO4t6pnCokUoEsoi@tycho.pizza>
Date: Tue, 29 Aug 2023 11:42:02 -0600
From: Tycho Andersen <tycho@...ho.pizza>
To: Miklos Szeredi <miklos@...redi.hu>
Cc: Jürg Billeter <j@...ron.ch>,
"Eric W. Biederman" <ebiederm@...ssion.com>,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
regressions@...ts.linux.dev
Subject: Re: [REGRESSION] fuse: execve() fails with ETXTBSY due to async
fuse_flush
On Mon, Aug 21, 2023 at 05:31:48PM +0200, Miklos Szeredi wrote:
(Apologies for the delay, I have been away without cell signal for
some time.)
> > I think the idea is that they're saving snapshots of their own threads
> > to the fs for debugging purposes.
>
> This seems a fairly special situation. Have they (whoever they may
> be) thought about fixing this in their server?
Sorry, "we" here is some internal team that works for my employer
Netflix. We can't use imap clients on our corporate e-mails, whee.
> > Whether this is a sane thing to do or not, it doesn't seem like it
> > should deadlock pid ns destruction.
>
> True. So the suggested solution is to allow wait_event_killable() to
> return if a terminal signal is pending in the exiting state and only
> in that case turn the flush into a background request? That would
> still allow for regressions like the one reported, but that would be
> much less likely to happen in real life. Okay, I said this for the
> original solution as well, so this may turn out to be wrong as well.
I wonder if there's room here for a completion that doesn't use the
wait primitives. Something like an atomic + queuing in task_work()
would both fix this bug and not exhibit this regression, IIUC.
> Anyway, I'd prefer if this was fixed in the server code, as it looks
> fairly special and adding complexity to the kernel for this case might
> not be justifiable. But I'm also open to suggestions on fixing this
> in the kernel in a not too complex manner.
I don't think this is specific to the server-accessing-its-own-file
case. My reproducer uses that because I didn't quite understand the
bug fully at the time. I believe that *any* task that is killed with
an inflight fuse request will exhibit this. We have seen this fairly
rarely on another fuse fs we use throughout the fleet:
https://github.com/lxc/lxcfs and it doesn't really do anything
strange, and is mounted from the host's pid ns.
Tycho
Powered by blists - more mailing lists