linux-kernel - Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240131075740.660e7634@rorschach.local.home>
Date: Wed, 31 Jan 2024 07:57:40 -0500
From: Steven Rostedt <rostedt@...dmis.org>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Al Viro <viro@...iv.linux.org.uk>, Masami Hiramatsu
 <mhiramat@...nel.org>, linux-kernel@...r.kernel.org,
 linux-trace-kernel@...r.kernel.org
Subject: Re: [PATCH 5/6] eventfs: get rid of dentry pointers without
 refcounts

On Tue, 30 Jan 2024 22:20:24 -0800
Linus Torvalds <torvalds@...ux-foundation.org> wrote:

> On Tue, 30 Jan 2024 at 21:57, Linus Torvalds
> <torvalds@...ux-foundation.org> wrote:
> >
> > Ugh.  
> 
> Oh, and double-ugh on that tracefs_syscall_mkdir/rmdir(). I hate how
> it does that "unlock and re-lock inode" thing.

I'd figured you'd like that one.

> 
> It's a disease, and no, it's not an acceptable response to "lockdep
> shows there's a locking problem".
> 
> The comment says "It is up to the individual mkdir routine to handle
> races" but that's *completely* bogus and shows a fundamental
> misunderstanding of locking.
> 
> Dropping the lock in the middle of a locked section does NOT affect
> just the section that you dropped the lock around.
> 
> It affects the *caller*, who took the lock in the first place, and who
> may well assume that the lock protects things for the whole duration
> of the lock.
> 
> And so dropping the lock in the middle of an operation is a bad BAD
> *BAD* thing to do.
> 
> Honestly, I had only been looking at the eventfs_inode.c lookup code.
> But now that I started looking at mkdir/rmdir, I'm seeing the same
> signs of horrible mistakes there too.
> 
> And yes, it might be a real problem. For example, for the rmdir case,
> the actual lock was taken by 'vfs_rmdir()', and it does *not* protect
> only the ->rmdir() call itself.
> 
> It also, for example, is supposed to make the ->rmdir be atomic wrt things like
> 
>         dentry->d_inode->i_flags |= S_DEAD;
> 
> and
> 
>         dont_mount(dentry);
>         detach_mounts(dentry);
> 
> but now because the inode lock was dropped in the middle, for all we
> know a "mkdir()" could come in on the same name, and make a mockery of
> the whole inode locking.
> 
> The inode lock on that directory that is getting removed is also what
> is supposed to make it impossible to add new files to the directory
> while the rmdir is going on.
> 
> Again, dropping the lock violates those kinds of guarantees.
> 
> "git blame" actually fingers Al for that "unlock and relock", but
> that's because Al did a search-and-replace on
> "mutex_[un]lock(&inode->i_mutex)" and replaced it with
> "inode_[un]lock(inode)" back in 2016.
> 
> The real culprit is you, and that sh*t-show goes back to commit
> 277ba04461c2 ("tracing: Add interface to allow multiple trace
> buffers").
> 
> Christ. Now I guess I need to start looking if there is anything else
> horrid lurking in that mkdir/rmdir path.
> 
> I doubt the inode locking problem ends up mattering in this situation.
> Mostly since this is only tracefs, and normal users shouldn't be able
> to access these things anyway. And I think (hope?) that you only
> accept mkdir/rmdir in specific places.

Yes, this was very deliberate. It was for a very "special" feature, and
thus very limited.

> 
> But this really is another case of "This code smells *fundamentally* wrong".
> 
> Adding Al, in the hopes that he will take a look at this too.

This is something I asked Al about when I wrote it. This isn't a normal
rmdir. I remember telling Al what this was doing. Basically, it's just
a way to tell the tracing infrastructure to create a new instance. It
doesn't actually create a normal directory. It's similar to the
kprobe_events interface, where writing to a file would create
directories and files. Ideally I wanted a mkdir interface as it felt
more realistic, and I was ready to have another "echo foo > make_instance"
if this didn't work. But after talking with Al, I felt that it could
work. The main issue is that mkdir doesn't just make a directory, it
creates the entire tree (like what is done in /sys/fs/cgroup). If this
was more like kernfs instead of debugfs, I may not have had this
problem. That was another time I tried to understand how krenfs worked.

This is the reason all the opens in the tracing code has:

        struct trace_array *tr = inode->i_private;
        int ret;

        ret = tracing_check_open_get_tr(tr);
	if (ret)
		return ret;


In kernel/trace/trace.c:

LIST_HEAD(ftrace_trace_arrays);

int trace_array_get(struct trace_array *this_tr)
{
        struct trace_array *tr;
        int ret = -ENODEV;
        
        mutex_lock(&trace_types_lock);
        list_for_each_entry(tr, &ftrace_trace_arrays, list) {
                if (tr == this_tr) {
                        tr->ref++;
                        ret = 0;
                        break;
                }
        }
        mutex_unlock(&trace_types_lock);
                
        return ret;
}       

static void __trace_array_put(struct trace_array *this_tr)
{
        WARN_ON(!this_tr->ref);
        this_tr->ref--;       
}

void trace_array_put(struct trace_array *this_tr)
{
        if (!this_tr)
                return;

        mutex_lock(&trace_types_lock);
        __trace_array_put(this_tr);
        mutex_unlock(&trace_types_lock);
}

int tracing_check_open_get_tr(struct trace_array *tr)
{
        int ret;

        ret = security_locked_down(LOCKDOWN_TRACEFS);
        if (ret)
                return ret;

        if (tracing_disabled)
                return -ENODEV;

        if (tr && trace_array_get(tr) < 0)
                return -ENODEV;

        return 0;
}

The rmdir code that tracefs calls is also in kernel/trace/trace.c:

static int instance_rmdir(const char *name)
{
        struct trace_array *tr;
        int ret;

        mutex_lock(&event_mutex);
        mutex_lock(&trace_types_lock);

        ret = -ENODEV;
        tr = trace_array_find(name);
        if (tr)
                ret = __remove_instance(tr);

        mutex_unlock(&trace_types_lock);
        mutex_unlock(&event_mutex);

        return ret;
}

The mkdir creates a new trace_array (tr) and rmdir destroys it. If
there's any reference on a trace array the rmdir fails. On every open,
a check is made to see if the trace array that was added to the inode
still exists, and if it does, it ups its ref count to prevent a rmdir
from happening.

It's a very limited mkdir and rmdir. I remember Al asking me questions
about what it can and can't do, and me telling him to make sure the
lock release wasn't going to cause problems.

I wrote several fuzzers on this code that perform mkdir and rmdir on the
instances while other tasks are trying to access them (reading and
writing). In the beginning, it found a few bugs. But I was finally able
to get my fuzzers to work non-stop for over a month and that's when I
felt that this is fine.

I was always weary of this code because of the locking situation. The
kernel selftests has a stress test on this code too.

  tools/testing/selftests/ftrace/test.d/instances/instance-event.tc
  tools/testing/selftests/ftrace/test.d/instances/instance.tc

Which is run as part of the selftest suite, which is run by most people
testing the kernel, including KernelCI.

I haven't had a problem with this code in years, and unless we find a
real bug that needs to be fixed, I'm afraid to touch it.

-- Steve