lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 9 Dec 2020 16:54:11 +0000
From:   Al Viro <viro@...iv.linux.org.uk>
To:     "Eric W. Biederman" <ebiederm@...ssion.com>
Cc:     linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        criu@...nvz.org, bpf@...r.kernel.org,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Christian Brauner <christian.brauner@...ntu.com>,
        Oleg Nesterov <oleg@...hat.com>,
        Cyrill Gorcunov <gorcunov@...il.com>,
        Jann Horn <jann@...jh.net>, Kees Cook <keescook@...omium.org>,
        Daniel P . Berrangé <berrange@...hat.com>,
        Jeff Layton <jlayton@...hat.com>,
        Miklos Szeredi <miklos@...redi.hu>,
        Matthew Wilcox <willy@...radead.org>,
        "J. Bruce Fields" <bfields@...ldses.org>,
        Trond Myklebust <trond.myklebust@...merspace.com>,
        Chris Wright <chrisw@...hat.com>,
        Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Martin KaFai Lau <kafai@...com>,
        Song Liu <songliubraving@...com>, Yonghong Song <yhs@...com>,
        Andrii Nakryiko <andriin@...com>,
        John Fastabend <john.fastabend@...il.com>,
        KP Singh <kpsingh@...omium.org>,
        "Paul E. McKenney" <paulmck@...nel.org>
Subject: Re: [PATCH v2 09/24] file: Replace fcheck_files with
 files_lookup_fd_rcu

[paulmck Cc'd]

On Mon, Dec 07, 2020 at 10:49:04PM +0000, Al Viro wrote:
> On Mon, Dec 07, 2020 at 10:46:57PM +0000, Al Viro wrote:
> > On Fri, Nov 20, 2020 at 05:14:26PM -0600, Eric W. Biederman wrote:
> > 
> > >  /*
> > >   * Check whether the specified fd has an open file.
> > >   */
> > > -#define fcheck(fd)	fcheck_files(current->files, fd)
> > > +#define fcheck(fd)	files_lookup_fd_rcu(current->files, fd)
> > 
> > Huh?
> > fs/file.c:1113: file = fcheck(oldfd);
> > 	dup3(), under ->file_lock, no rcu_read_lock() in sight
> > 
> > fs/locks.c:2548:                f = fcheck(fd);
> > 	fcntl_setlk(), ditto
> > 
> > fs/locks.c:2679:                f = fcheck(fd);
> > 	fcntl_setlk64(), ditto
> > 
> > fs/notify/dnotify/dnotify.c:330:        f = fcheck(fd);
> > 	fcntl_dirnotify(); this one _is_ under rcu_read_lock().
> > 
> > 
> > IOW, unless I've missed something earlier in the series, this is wrong.
> 
> I have missed something, all right.  Ignore that comment...

Actually, I take that back - the use of fcheck() in dnotify _is_ interesting.
At the very least it needs to be commented upon; what that code is trying
to prevent is a race between fcntl_dirnotify() and close(2)/dup2(2) closing
the descriptor in question.  The problem is, dnotify marks are removed
when we detach from descriptor table; that's done by filp_close() calling
dnotify_flush().

Suppose fcntl(fd, F_NOTIFY, 0) in one thread races with close(fd) in another
(both sharing the same descriptor table).  If the former had created and
inserted a mark *after* the latter has gotten past dnotify_flush(), there
would be nothing to evict that mark.

That's the reason that fcheck() is there.  rcu_read_lock() used to be
sufficient, but the locking has changed since then and even if it is
still enough, that's not at all obvious.

Exclusion is not an issue; barriers, OTOH...  Back then we had
->i_lock taken both by dnotify_flush() before any checks and
by fcntl_dirnotify() around the fcheck+insertion.  So on close
side we had
	store NULL into descriptor table
	acquire ->i_lock
	fetch ->i_dnotify
	...
	release ->i_lock
while on fcntl() side we had
	acquire ->i_lock
	fetch from descriptor table, sod off if not our file
	...
	store ->i_dnotify
	...
	release ->i_lock
Storing NULL into descriptor table could get reordered into
->i_lock-protected area in dnotify_flush(), but it could not
get reordered past releasing ->i_lock.  So fcntl_dirnotify()
either grabbed ->i_lock before dnotify_flush() (in which case
missing the store of NULL into descriptor table wouldn't
matter, since dnotify_flush() would've observed the mark
we'd inserted) or it would've seen that store to descriptor
table.

Nowadays it's nowhere near as straightforward; in fcntl_dirnotify()
we have
        /* this is needed to prevent the fcntl/close race described below */
        mutex_lock(&dnotify_group->mark_mutex);
and it would appear to be similar to the original situation, with
->mark_mutex serving in place of ->i_lock.  However, dnotify_flush()
might not take that mutex at all - it has
        fsn_mark = fsnotify_find_mark(&inode->i_fsnotify_marks, dnotify_group);
        if (!fsn_mark)
                return;
before grabbing that thing.  So the things are trickier - we actually
rely upon the barriers in fsnotify_find_mark().  And those are complicated.
The case when it returns non-NULL is not a problem - there we have that
mutex providing the barriers we need.  NULL can be returned in two cases:
	a) ->i_fsnotify_marks is not empty, but it contains no
dnotify marks.  In that case we have ->i_fsnotify_marks.lock acquired
and released.  By the time it gets around to fcheck(), fcntl_dirnotify() has
either found or created and inserted a dnotify mark into that list, with
->i_fsnotify_marks.lock acquired/released around the insertion, so we
are fine - either fcntl_dirnotify() gets there first (in which case
dnotify_flush() will observe it), or release of that lock by
fsnotify_find_mark() called by dnotify_flush() will serve as a barrier,
making sure that store to descriptor table will be observed.
	b) fsnotify_find_mark() (fsnotify_grab_connector() it calls,
actually) finds ->i_fsnotify_marks empty.  That's where the things
get interesting; we have
        idx = srcu_read_lock(&fsnotify_mark_srcu);
        conn = srcu_dereference(*connp, &fsnotify_mark_srcu);
        if (!conn)
                goto out;
on dnotify_flush() side.  The matching store of fcntl_dirnotify()
side would be in fsnotify_attach_connector_to_object(), where
we have
        /*
         * cmpxchg() provides the barrier so that readers of *connp can see
         * only initialized structure
         */
        if (cmpxchg(connp, NULL, conn)) {
                /* Someone else created list structure for us */

So we have
A:
	store NULL to descriptor table
	srcu_read_lock
	srcu_dereference fetches NULL from ->i_fsnotify_marks
vs.
B:
	cmpxchg replaces NULL with non-NULL in ->i_fsnotify_marks
	fetch from descriptor table, can't miss the store done by A

Which might be safe, but the whole thing *RELLY* needs to be discussed
in fcntl_dirnotify() in more details.  fs/notify/* guts are convoluted
enough to confuse anyone unfamiliar with them.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ