[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALvZod5CpPhvzB99VZTc33Sb5YCbJNHFe3k33k+HwNfJvJbpJQ@mail.gmail.com>
Date: Tue, 1 Dec 2020 12:53:46 -0800
From: Shakeel Butt <shakeelb@...gle.com>
To: Axel Rasmussen <axelrasmussen@...gle.com>,
Tejun Heo <tj@...nel.org>
Cc: Greg Thelen <gthelen@...gle.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Chinwen Chang <chinwen.chang@...iatek.com>,
Daniel Jordan <daniel.m.jordan@...cle.com>,
David Rientjes <rientjes@...gle.com>,
Davidlohr Bueso <dbueso@...e.de>,
Ingo Molnar <mingo@...hat.com>, Jann Horn <jannh@...gle.com>,
Laurent Dufour <ldufour@...ux.ibm.com>,
Michel Lespinasse <walken@...gle.com>,
Stephen Rothwell <sfr@...b.auug.org.au>,
Steven Rostedt <rostedt@...dmis.org>,
Vlastimil Babka <vbabka@...e.cz>,
Yafang Shao <laoar.shao@...il.com>,
"David S . Miller" <davem@...emloft.net>, dsahern@...nel.org,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
Jakub Kicinski <kuba@...nel.org>, liuhangbin@...il.com,
LKML <linux-kernel@...r.kernel.org>,
Linux MM <linux-mm@...ck.org>
Subject: Re: [PATCH] mm: mmap_lock: fix use-after-free race and css ref leak
in tracepoints
+Tejun Heo
On Tue, Dec 1, 2020 at 11:14 AM Axel Rasmussen <axelrasmussen@...gle.com> wrote:
>
> On Tue, Dec 1, 2020 at 10:42 AM Shakeel Butt <shakeelb@...gle.com> wrote:
> >
> > On Tue, Dec 1, 2020 at 9:56 AM Greg Thelen <gthelen@...gle.com> wrote:
> > >
> > > Axel Rasmussen <axelrasmussen@...gle.com> wrote:
> > >
> > > > On Mon, Nov 30, 2020 at 5:34 PM Shakeel Butt <shakeelb@...gle.com> wrote:
> > > >>
> > > >> On Mon, Nov 30, 2020 at 3:43 PM Axel Rasmussen <axelrasmussen@...gle.com> wrote:
> > > >> >
> > > >> > syzbot reported[1] a use-after-free introduced in 0f818c4bc1f3. The bug
> > > >> > is that an ongoing trace event might race with the tracepoint being
> > > >> > disabled (and therefore the _unreg() callback being called). Consider
> > > >> > this ordering:
> > > >> >
> > > >> > T1: trace event fires, get_mm_memcg_path() is called
> > > >> > T1: get_memcg_path_buf() returns a buffer pointer
> > > >> > T2: trace_mmap_lock_unreg() is called, buffers are freed
> > > >> > T1: cgroup_path() is called with the now-freed buffer
> > > >>
> > > >> Any reason to use the cgroup_path instead of the cgroup_ino? There are
> > > >> other examples of trace points using cgroup_ino and no need to
> > > >> allocate buffers. Also cgroup namespace might complicate the path
> > > >> usage.
> > > >
> > > > Hmm, so in general I would love to use a numeric identifier instead of a string.
> > > >
> > > > I did some reading, and it looks like the cgroup_ino() mainly has to
> > > > do with writeback, instead of being just a general identifier?
> > > > https://www.kernel.org/doc/Documentation/cgroup-v2.txt
> >
> > I think you are confusing cgroup inodes with real filesystem inodes in that doc.
> >
> > > >
> > > > There is cgroup_id() which I think is almost what I'd want, but there
> > > > are a couple problems with it:
> > > >
> > > > - I don't know of a way for userspace to translate IDs -> paths, to
> > > > make them human readable?
> > >
> > > The id => name map can be built from user space with a tree walk.
> > > Example:
> > >
> > > $ find /sys/fs/cgroup/memory -type d -printf '%i %P\n' # ~ [main]
> > > 20387 init.scope
> > > 31 system.slice
> > >
> > > > - Also I think the ID implementation we use for this is "dense",
> > > > meaning if a cgroup is removed, its ID is likely to be quickly reused.
> > > >
> >
> > The ID for cgroup nodes (underlying it is kernfs) are allocated from
> > idr_alloc_cyclic() which gives new ID after the last allocated ID and
> > wrap after around INT_MAX IDs. So, likeliness of repetition is very
> > low. Also the file_handle returned by name_to_handle_at() for cgroupfs
> > returns the inode ID which gives confidence to the claim of low chance
> > of ID reusing.
>
> Ah, for some reason I remembered it using idr_alloc(), but you're
> right, it does use cyclical IDs. Even so, tracepoints which expose
> these IDs would still be difficult to use I think.
The writeback tracepoint in include/trace/events/writeback.h is
already using the cgroup IDs. Actually it used to use cgroup_path but
converted to cgroup_ino.
Tejun, how do you use these tracepoints?
> Say we're trying to
> collect a histogram of lock latencies over the course of some test
> we're running. At the end, we want to produce some kind of
> human-readable report.
>
I am assuming the test infra and the tracing infra are decoupled
entities and test infra is orchestrating the cgroups as well.
> cgroups may come and go throughout the test. Even if we never re-use
> IDs, in order to be able to map all of them to human-readable paths,
> it seems like we'd need some background process to poll the
> /sys/fs/cgroup/memory directory tree as Greg described, keeping track
> of the ID<->path mapping. This seems expensive, and even if we poll
> relatively frequently we might still miss short-lived cgroups.
>
> Trying to aggregate such statistics across physical machines, or
> reboots of the same machine, is further complicated. The machine(s)
> may be running the same application, which runs in a container with
> the same path, but it'll end up with different IDs. So we'd have to
> collect the ID<->path mapping from each, and then try to match up the
> names for aggregation.
How about adding another tracepoint in cgroup_create which will output
the ID along with the name or path? With a little post processing you
can get the same information. Also note that if the test is
deleting/creating the cgroup with the same name, you will miss that
information if filtering with just path.
IMHO cgroup IDs will make the kernel code much simpler with the
tradeoff of a bit more work in user space.
Powered by blists - more mailing lists