[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87ha1ic8rd.fsf@x220.int.ebiederm.org>
Date: Tue, 12 Aug 2014 03:17:10 -0700
From: ebiederm@...ssion.com (Eric W. Biederman)
To: Al Viro <viro@...IV.linux.org.uk>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
"Serge E. Hallyn" <serge@...lyn.com>,
Linux-Fsdevel <linux-fsdevel@...r.kernel.org>,
Kernel Mailing List <linux-kernel@...r.kernel.org>,
Andy Lutomirski <luto@...capital.net>,
Rob Landley <rob@...dley.net>,
Miklos Szeredi <miklos@...redi.hu>,
Christoph Hellwig <hch@...radead.org>,
Karel Zak <kzak@...hat.com>,
"J. Bruce Fields" <bfields@...ldses.org>,
Fengguang Wu <fengguang.wu@...el.com>, tytso@....edu
Subject: Re: [GIT PULL] Detaching mounts on unlink for 3.15
Al Viro <viro@...IV.linux.org.uk> writes:
> On Sun, May 11, 2014 at 05:45:30PM +0100, Al Viro wrote:
>> Sigh... It's really messy.
>> All versions since lazy fput introduction have acct_auto_close()
>> doing the wrong thing on r/o remount of superblock; we want the damn file
>> closed *before* we go further than acct_auto_close(). Worse, we are
>> holding ->s_umount there, so any kind of waiting would have to be very
>> careful to avoid deadlocks. What's more, prevention of open for write
>> hits acct_auto_close(), so even if we wait there, we still have a window
>> when new acct file could be opened and not auto-closed.
>> All versions have problems with acct_process() in the middle of
>> umount(); originally it was a blatant call of ->write() happening without
>> any regard for file getting closed, then it was file getting written to
>> and closed in the middle of fs shutdown, then - write/close capable of
>> pushing fs shutdown past the return from umount(2).
>> All versions have problems with acct(NULL) vs. umount - the latter
>> does not wait for the former. Eric's patches plug that one, but there's
>> a serious deadlock potential.
>
> OK, I think I've sorted that one out. Eric, could you take a look at
> vfs.git#for-eric? That's for-next + fix that ought to go into -stable +
> delayed-mntput() thing. The real PITA had been kernel/acct.c mess;
> that's dealt with in -next.
>
> I think it solves the problem with "mntput in deep call chain" cases
> added in your series. Final mntput() does fs shutdown, etc. on a shallow
> stack, via task_work_add() if at all possible. MNT_INTERNAL vfsmounts
> are dealt with synchronously, which solves the problem of failure exits
> halfway through module_init needing to tear down an internal vfsmount, etc.
> But those call sites are all on fairly shallow stack anyway. And such
> vfsmounts are not mounted on anything, so it's not something your changes
> could possibly step into. No extra context switches per syscall, at that...
>
> I hadn't added mntput_sync() - no visible use cases. If one shows up,
> it wouldn't be hard to add such primitive. And unlike fput() we do not
> try to support mntput() from interrupt, etc. - too much PITA with no
> obvious use cases. We'd need to decide whether we want to disable IRQs
> on lock_mount_hash(), etc. It's doable, but let's leave that until we
> get a serious reason to mess with all that.
I have read through it all and I nothing jumps out at me.
It takes a pretty special call of mntput to know that it is the final
reference count deference and care if we are synchronous or not, and all
of the once I have found that do care except do_unmount are MNT_INTERNAL
so we are good. Because do_unmount triggers the task_work_add path
in mntput that is also good.
The acct.c changes don't always have the best names (i.e. pin_get seems
a bit too generic) and the locking/refcounting makes me want to stare
at the the code and see if it is possible to solve things without
both a reference count and a mutex. But I did not see any problems
with that code.
My biggest technical comment is that the count in struct fs_pin
has a maximum value of just over PID_MAX_LIMIT (4*1024*1024) so it does
not need to be a long, a simple 32 bit integer would be fine.
So to the whole thing:
Acked-by: "Eric W. Biederman" <ebiederm@...ssion.com>
I have rebased my changes against vfs.git#for-eric and my changes work
just fine on top of the base you have built. The changes are avaiable
in user-namespace.git#vfs-detach-mounts10 so you just be able to just
pull the changes in.
Reading your pile #1 pull request to Linus it sounds like you are
planning to suck all of this into the vfs tree. If you have another
plan please let me know so I don't step on your toes by accident.
Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists