netdev - Re: BUG: kernel NULL pointer dereference in __cgroup_bpf_run_filter

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAM_iQpXuNX+75gybo3Qo9HhiZVPaDgwo3oEQuRS-ExDZGRCUCQ@mail.gmail.com>
Date:   Mon, 15 Jun 2020 14:34:14 -0700
From:   Cong Wang <xiyou.wangcong@...il.com>
To:     Daniël Sonck <dsonck92@...il.com>
Cc:     Linux Kernel Network Developers <netdev@...r.kernel.org>
Subject: Re: BUG: kernel NULL pointer dereference in __cgroup_bpf_run_filter_skb

On Mon, Jun 15, 2020 at 6:06 AM Daniël Sonck <dsonck92@...il.com> wrote:
>
> Op zo 14 jun. 2020 om 22:43 schreef Daniël Sonck <dsonck92@...il.com>:
> >
> > Hello,
> >
> > Op zo 14 jun. 2020 om 20:29 schreef Cong Wang <xiyou.wangcong@...il.com>:
> > >
> > > Hello,
> > >
> > > On Sun, Jun 14, 2020 at 5:39 AM Daniël Sonck <dsonck92@...il.com> wrote:
> > > >
> > > > Hello,
> > > >
> > > > I found on the archive that this bug I encountered also happened to
> > > > others. I too have a very similar stacktrace. The issue I'm
> > > > experiencing is:
> > > >
> > > > Whenever I fully boot my cluster, in some time, the host crashes with
> > > > the __cgroup_bpf_run_filter_skb NULL pointer dereference. This has
> > > > been sporadic enough before not to cause real issues. However, as of
> > > > lately, the bug is triggered much more frequently. I've changed my
> > > > server hardware so I could capture serial output in order to get the
> > > > trace. This trace looked very similar as reported by Lu Fengqi. As it
> > > > currently stands, I cannot run the cluster as it's almost instantly
> > > > crashing the host.
> > >
> > > This has been reported for multiple times. Are you able to test the
> > > attached patch? And let me know if everything goes fine with it.
> >
> > I will try out the patch. Since the host reliably crashed each time as
> > I booted up
> > the cluster VMs I will be able to tell whether it has any positive effect.
> > >
> > > I suspect we may still leak some cgroup refcnt even with the patch,
> > > but it might be much harder to trigger with this patch applied.
> >
> > Currently applying the patch to the kernel and compiling so I should
> > know in a few hours
>
> The compilation with the patch has finished and I've since rebooted to the
> new kernel about 12 hours ago, so far this bug did not trigger whereas without
> the patch, by this time it would have triggered. Regardless, I will keep my
> serial connection in case something pops up.

That is great. Please keep it running as this is a race condition which
is not easy to trigger reliably.

Thanks for testing!