netdev - Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200630224829.GC37586@carbon.dhcp.thefacebook.com>
Date:   Tue, 30 Jun 2020 15:48:29 -0700
From:   Roman Gushchin <guro@...com>
To:     Cong Wang <xiyou.wangcong@...il.com>
CC:     Cameron Berkenpas <cam@...-zeon.de>, Zefan Li <lizefan@...wei.com>,
        Linux Kernel Network Developers <netdev@...r.kernel.org>,
        Peter Geis <pgwipeout@...il.com>,
        Lu Fengqi <lufq.fnst@...fujitsu.com>,
        Daniël Sonck <dsonck92@...il.com>,
        Daniel Borkmann <daniel@...earbox.net>,
        Tejun Heo <tj@...nel.org>
Subject: Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()

On Tue, Jun 30, 2020 at 03:22:34PM -0700, Cong Wang wrote:
> On Sat, Jun 27, 2020 at 4:41 PM Roman Gushchin <guro@...com> wrote:
> >
> > On Fri, Jun 26, 2020 at 10:58:14AM -0700, Cong Wang wrote:
> > > On Thu, Jun 25, 2020 at 10:23 PM Cameron Berkenpas <cam@...-zeon.de> wrote:
> > > >
> > > > Hello,
> > > >
> > > > Somewhere along the way I got the impression that it generally takes
> > > > those affected hours before their systems lock up. I'm (generally) able
> > > > to reproduce this issue much faster than that. Regardless, I can help test.
> > > >
> > > > Are there any patches that need testing or is this all still pending
> > > > discussion around the  best way to resolve the issue?
> > >
> > > Yes. I come up with a (hopefully) much better patch in the attachment.
> > > Can you help to test it? You need to unapply the previous patch before
> > > applying this one.
> > >
> > > (Just in case of any confusion: I still believe we should check NULL on
> > > top of this refcnt fix. But it should be a separate patch.)
> > >
> > > Thank you!
> >
> > Not opposing the patch, but the Fixes tag is still confusing me.
> > Do we have an explanation for what's wrong with 4bfc0bb2c60e?
> >
> > It looks like we have cgroup_bpf_get()/put() exactly where we have
> > cgroup_get()/put(), so it would be nice to understand what's different
> > if the problem is bpf-related.
> 
> Hmm, I think it is Zefan who believes cgroup refcnt is fine, the bug
> is just in cgroup bpf refcnt, in our previous discussion.
> 
> Although I agree cgroup refcnt is buggy too, it may not necessarily
> cause any real problem, otherwise we would receive bug report
> much earlier than just recently, right?
> 
> If the Fixes tag is confusing, I can certainly remove it, but this also
> means the patch will not be backported to stable. I am fine either
> way, this crash is only reported after Zefan's recent change anyway.

Well, I'm not trying to protect my commit, I just don't understand
the whole picture and what I see doesn't make complete sense to me.

I understand a problem which can be described as copying the cgroup pointer
on cgroup cloning without bumping the reference counter.
It seems that this problem is not caused by bpf changes, so if we're adding
a Fixes tag, it must point at an earlier commit. Most likely, it was there from
scratch, i.e. from bd1060ad671 ("sock, cgroup: add sock->sk_cgroup").
Do we know why Zefan's change made it reproducible?

Btw if we want to backport the problem but can't blame a specific commit,
we can always use something like "Cc: <stable@...r.kernel.org>    [3.1+]".

Thanks!