linux-kernel - Re: BUG: corrupted list in freeary

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CACT4Y+Z73E_tLPbPk9gmy0GCTPoJwv9tvA-HPSVBXBVmmKRfeg@mail.gmail.com>
Date:   Tue, 26 Mar 2019 09:47:05 +0100
From:   Dmitry Vyukov <dvyukov@...gle.com>
To:     manfred <manfred@...orfullife.com>
Cc:     syzbot <syzbot+c92d3646e35bc5d1a909@...kaller.appspotmail.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Arnd Bergmann <arnd@...db.de>,
        Davidlohr Bueso <dave@...olabs.net>,
        "Eric W. Biederman" <ebiederm@...ssion.com>,
        LKML <linux-kernel@...r.kernel.org>, linux@...inikbrodowski.net,
        syzkaller-bugs <syzkaller-bugs@...glegroups.com>
Subject: Re: BUG: corrupted list in freeary

On Mon, Dec 3, 2018 at 3:53 PM Dmitry Vyukov <dvyukov@...gle.com> wrote:
>
> On Sat, Dec 1, 2018 at 9:22 PM Manfred Spraul <manfred@...orfullife.com> wrote:
> >
> > Hi Dmitry,
> >
> > On 11/30/18 6:58 PM, Dmitry Vyukov wrote:
> > > On Thu, Nov 29, 2018 at 9:13 AM, Manfred Spraul
> > > <manfred@...orfullife.com> wrote:
> > >> Hello together,
> > >>
> > >> On 11/27/18 4:52 PM, syzbot wrote:
> > >>
> > >> Hello,
> > >>
> > >> syzbot found the following crash on:
> > >>
> > >> HEAD commit:    e195ca6cb6f2 Merge branch 'for-linus' of git://git.kernel...
> > >> git tree:       upstream
> > >> console output: https://syzkaller.appspot.com/x/log.txt?x=10d3e6a3400000
> > [...]
> > >> Isn't this a kernel stack overrun?
> > >>
> > >> RSP: 0x..83e008. Assuming 8 kB kernel stack, and 8 kB alignment, we have
> > >> used up everything.
> > > I don't exact answer, that's just the kernel output that we captured
> > > from console.
> > >
> > > FWIW with KASAN stacks are 16K:
> > > https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/page_64_types.h#L10
> > Ok, thanks. And stack overrun detection is enabled as well -> a real
> > stack overrun is unlikely.
> > > Well, generally everything except for kernel crashes is expected.
> > >
> > > We actually sandbox it with memcg quite aggressively:
> > > https://github.com/google/syzkaller/blob/master/executor/common_linux.h#L2159
> > > But it seems to manage to either break the limits, or cause some
> > > massive memory leaks. The nature of that is yet unknown.
> >
> > Is it possible to start from that side?
> >
> > Are there other syzcaller runs where the OOM killer triggers that much?
>
> Lots of them:
>
> https://groups.google.com/forum/#!searchin/syzkaller-upstream-moderation/lowmem_reserve
> https://groups.google.com/forum/#!searchin/syzkaller-bugs/lowmem_reserve
>
> But nobody got any hook on the reasons.
>
>
> > >> - Which stress tests are enabled? By chance, I found:
> > >>
> > >> [  433.304586] FAULT_INJECTION: forcing a failure.^M
> > >> [  433.304586] name fail_page_alloc, interval 1, probability 0, space 0,
> > >> times 0^M
> > >> [  433.316471] CPU: 1 PID: 19653 Comm: syz-executor4 Not tainted 4.20.0-rc3+
> > >> #348^M
> > >> [  433.323841] Hardware name: Google Google Compute Engine/Google Compute
> > >> Engine, BIOS Google 01/01/2011^M
> > >>
> > >> I need some more background, then I can review the code.
> > > What exactly do you mean by "Which stress tests"?
> > > Fault injection is enabled. Also random workload from userspace.
> > >
> > >
> > >> Right now, I would put it into my "unknown syzcaller finding" folder.
> >
> > One more idea: Are there further syzcaller runs that end up with
> > 0x010000 in a pointer?
>
> Hard to say. syzbot triggered millions of crashes. I can't say that I
> remember this as distinctive pattern that come up before.
>
> >  From what I see, the sysv sem code that is used is trivial, I don't see
> > that it could cause the observed behavior.
>
> I propose that we postpone further investigation of this until we have
> a reproducer, or this happens more than once, or we gather some other
> information.
> Half of bugs are simple, so even for a crash happened once it makes
> sense to spend 10 minutes looking at the code in case the root cause
> is easy to spot. And hundreds of bugs were fixed this way. But I
> assume you already did this.
> The thing is that there are 100+ known bugs in kernel that lead to
> memory corruptions:
> https://syzkaller.appspot.com/#upstream-open
> We try to catch them reliably with KASAN, but KASAN does not give 100%
> guarantee. So if just one instance of a known bug gets unnoticed,
> leads to a memory corruption, then later it can lead to an
> unexplainable one-off crash like this. At this point higher ROI will
> probably be from spending more time on hundreds of other known bugs
> that have reproducers, happened lots of times, or just simpler. Once
> we get rid of most of them, hopefully such unexplainable crashes will
> go down too.

The working hypothesis for this bug is as follows.

semget provokes OOMs. OOMs now cause kernel stack overflow/corruption
in wb_workfn. So semget is kinda red herring.
Since we now sandbox tests processes with sem sysctl and friends, I
think we can close this report.

#syz invalid

Though the kernel memory corruption on OOMs is still there.