[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20190625162156.GA6128@tower.DHCP.thefacebook.com>
Date: Tue, 25 Jun 2019 16:22:00 +0000
From: Roman Gushchin <guro@...com>
To: Alexei Starovoitov <ast@...com>
CC: Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
Tejun Heo <tj@...nel.org>,
"bpf@...r.kernel.org" <bpf@...r.kernel.org>,
Kernel Team <Kernel-team@...com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH bpf-next] bpf: fix cgroup bpf release synchronization
On Tue, Jun 25, 2019 at 08:50:55AM -0700, Alexei Starovoitov wrote:
> On 6/23/19 9:02 PM, Roman Gushchin wrote:
> > On Sun, Jun 23, 2019 at 08:29:21PM -0700, Alexei Starovoitov wrote:
> >> On 6/23/19 7:30 PM, Roman Gushchin wrote:
> >>> Since commit 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf
> >>> from cgroup itself"), cgroup_bpf release occurs asynchronously
> >>> (from a worker context), and before the release of the cgroup itself.
> >>>
> >>> This introduced a previously non-existing race between the release
> >>> and update paths. E.g. if a leaf's cgroup_bpf is released and a new
> >>> bpf program is attached to the one of ancestor cgroups at the same
> >>> time. The race may result in double-free and other memory corruptions.
> >>>
> >>> To fix the problem, let's protect the body of cgroup_bpf_release()
> >>> with cgroup_mutex, as it was effectively previously, when all this
> >>> code was called from the cgroup release path with cgroup mutex held.
> >>>
> >>> Also make sure, that we don't leave already freed pointers to the
> >>> effective prog arrays. Otherwise, they can be released again by
> >>> the update path. It wasn't necessary before, because previously
> >>> the update path couldn't see such a cgroup, as cgroup_bpf and cgroup
> >>> itself were released together.
> >>
> >> I thought dying cgroup won't have any children cgroups ?
> >
> > It's not completely true, a dying cgroup can't have living children.
> >
> >> It should have been empty with no tasks inside it?
> >
> > Right.
> >
> >> Only some resources are still held?
> >
> > Right.
> >
> >> mutex and zero init are highly suspicious.
> >> It feels that cgroup_bpf_release is called too early.
> >
> > An alternative solution is to bump the refcounter on
> > every update path, and explicitly skip de-bpf'ed cgroups.
> >
> >>
> >> Thinking from another angle... if child cgroups can still attach then
> >> this bpf_release is broken.
> >
> > Hm, what do you mean under attach? It's not possible to attach
> > a new prog, but if a prog is attached to a parent cgroup,
> > a pointer can spill through "effective" array.
> >
> > But I agree, it's broken. Update path should ignore such
> > cgroups (cgroups, which cgroup_bpf was released). I'll take a look.
> >
> >> The code should be
> >> calling __cgroup_bpf_detach() one by one to make sure
> >> update_effective_progs() is called, since descendant are still
> >> sort-of alive and can attach?
> >
> > Not sure I get you. Dying cgroup is a leaf cgroup.
> >
> >>
> >> My money is on 'too early'.
> >> May be cgroup is not dying ?
> >> Just cgroup_sk_free() is called on the last socket and
> >> this auto-detach logic got triggered incorrectly?
> >
> > So, once again, what's my picture:
> >
> > A
> > A/B
> > A/B/C
> >
> > cpu1: cpu2:
> > rmdir C attach new prog to A
> > C got dying update A, update B, update C...
> > C's cgroup_bpf is released C's effective progs is replaced with new one
> > old is double freed
> >
> > It looks like it can be reproduced without any sockets.
>
> I see.
> Does it mean that css_for_each_descendant walks dying cgroups ?
Yes.
> I guess the fix then is to avoid walking them in update_effective_progs ?
>
Yes, this is close to what I'm testing now. We basically need to skip cgroups,
which bpf refcounter is 0 (and in atomic mode). These cgroups can't invoke bpf
programs, so there is no point in updates.
Powered by blists - more mailing lists