linux-kernel - Re: x86: WARNING: kernel stack frame pointer in kworker has bad value

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CACT4Y+YLXGoo4qoGVTwpEmY1OqPNsw+Hvw57Hvvxi_+bV60S8g@mail.gmail.com>
Date:   Tue, 24 Jan 2017 17:06:15 +0100
From:   Dmitry Vyukov <dvyukov@...gle.com>
To:     Josh Poimboeuf <jpoimboe@...hat.com>
Cc:     Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>,
        "H. Peter Anvin" <hpa@...or.com>,
        "x86@...nel.org" <x86@...nel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        syzkaller <syzkaller@...glegroups.com>
Subject: Re: x86: WARNING: kernel stack frame pointer in kworker has bad value

On Tue, Jan 24, 2017 at 4:48 PM, Josh Poimboeuf <jpoimboe@...hat.com> wrote:
> On Tue, Jan 24, 2017 at 02:37:28PM +0100, Dmitry Vyukov wrote:
>> On Mon, Jan 23, 2017 at 11:50 PM, Josh Poimboeuf <jpoimboe@...hat.com> wrote:
>> > On Mon, Jan 23, 2017 at 07:43:18PM +0100, Dmitry Vyukov wrote:
>> >> Hello,
>> >>
>> >> I am getting the following warnings while running syzkaller fuzzer on
>> >> 7a308bb3016f57e5be11a677d15b821536419d36:
>> >>
>> >> WARNING: kernel stack frame pointer at ffff880065f7fef8 in
>> >> kworker/1:3:21075 has bad value ffffffff81598080
>> >> unwind stack type:0 next_sp:          (null) mask:2 graph_idx:0
>> >> ...
>> >>
>> >> Here are 4 examples:
>> >>
>> >> https://gist.githubusercontent.com/dvyukov/7258fca11e5c3db6e8f8fb684e6168ff/raw/2ca14f3c7294c7e466611f8ecfb3072c676a657c/gistfile1.txt
>> >> https://gist.githubusercontent.com/dvyukov/cc442b144ae712f0e1bdeb065085e36a/raw/f713cef7e9689914fc94838a0a7d4192224fb6d2/gistfile1.txt
>> >> https://gist.githubusercontent.com/dvyukov/9058cce2647e3d37157c283c6400498c/raw/7a8ec8d71b9e95eeb24156073eeef96ba61d95ea/gistfile1.txt
>> >> https://gist.githubusercontent.com/dvyukov/b899dfa72d611b511a674b63676921f3/raw/90203ed26b6674dedba8c1c16dd76309ef21fbea/gistfile1.txt
>> >>
>> >> Unfortunately this is not reproducible, but there is a stack dump, so
>> >> maybe it's enough to figure out the root cause.
>> >
>> > This seems like some kind of stack corruption.  Here are the stack
>> > traces:
>> >
>> >> https://gist.githubusercontent.com/dvyukov/7258fca11e5c3db6e8f8fb684e6168ff/raw/2ca14f3c7294c7e466611f8ecfb3072c676a657c/gistfile1.txt
>> >
>> >   ffff8800520eef68: ffffffff812b148e (__save_stack_trace+0x6e/0xd0)
>> >   ffff8800520eefe8: ffffffff812b1506 (save_stack_trace+0x16/0x20)
>> >   ffff8800520eeff8: ffffffff81a0c843 (save_stack+0x43/0xd0)
>> >   ffff8800520ef228: ffffffff81a0d17f (kasan_slab_free+0x6f/0xb0)
>> >   ffff8800520ef250: ffffffff81a0a4e1 (kmem_cache_free+0x71/0x240)
>> >   ffff8800520ef280: ffffffff822924d4 (ioc_release_fn+0x2e4/0x380)
>> >   ffff8800520ef550: ffffffff814b4900 (process_one_work+0xbd0/0x1c10)
>> >   ffff8800520ef9a8: ffffffff814b5b63 (worker_thread+0x223/0x1990)
>> >   ffff8800520eff00: ffffffff814ce526 (kthread+0x326/0x3f0)
>> >
>> >> https://gist.githubusercontent.com/dvyukov/cc442b144ae712f0e1bdeb065085e36a/raw/f713cef7e9689914fc94838a0a7d4192224fb6d2/gistfile1.txt
>> >
>> >   ffff88005b40ef68: ffffffff812b260e (__save_stack_trace+0x6e/0xd0)
>> >   ffff88005b40efe8: ffffffff812b2686 (save_stack_trace+0x16/0x20)
>> >   ffff88005b40eff8: ffffffff81a0e8c3 (save_stack+0x43/0xd0)
>> >   ffff88005b40f228: ffffffff81a0f1ff (kasan_slab_free+0x6f/0xb0)
>> >   ffff88005b40f250: ffffffff81a0c561 (kmem_cache_free+0x71/0x240)
>> >   ffff88005b40f280: ffffffff822943b4 (ioc_release_fn+0x2e4/0x380)
>> >   ffff88005b40f550: ffffffff814b6980 (process_one_work+0xbd0/0x1c10)
>> >   ffff88005b40f9a8: ffffffff814b7be3 (worker_thread+0x223/0x1990)
>> >   ffff88005b40ff00: ffffffff814d05a6 (kthread+0x326/0x3f0)
>> >
>> >> https://gist.githubusercontent.com/dvyukov/9058cce2647e3d37157c283c6400498c/raw/7a8ec8d71b9e95eeb24156073eeef96ba61d95ea/gistfile1.txt
>> >
>> >   ffff880065f7ef68: ffffffff812b260e (__save_stack_trace+0x6e/0xd0)
>> >   ffff880065f7efe8: ffffffff812b2686 (save_stack_trace+0x16/0x20)
>> >   ffff880065f7eff8: ffffffff81a0e8c3 (save_stack+0x43/0xd0)
>> >   ffff880065f7f228: ffffffff81a0f1ff (kasan_slab_free+0x6f/0xb0)
>> >   ffff880065f7f250: ffffffff81a0c561 (kmem_cache_free+0x71/0x240)
>> >   ffff880065f7f280: ffffffff822943b4 (ioc_release_fn+0x2e4/0x380)
>> >   ffff880065f7f550: ffffffff814b6980 (process_one_work+0xbd0/0x1c10)
>> >   ffff880065f7f9a8: ffffffff814b7be3 (worker_thread+0x223/0x1990)
>> >   ffff880065f7ff00: ffffffff814d05a6 (kthread+0x326/0x3f0)
>> >
>> >> https://gist.githubusercontent.com/dvyukov/b899dfa72d611b511a674b63676921f3/raw/90203ed26b6674dedba8c1c16dd76309ef21fbea/gistfile1.txt
>> >
>> >   ffff880065e57220: ffffffff812b148e (__save_stack_trace+0x6e/0xd0)
>> >   ffff880065e572a0: ffffffff812b1506 (save_stack_trace+0x16/0x20)
>> >   ffff880065e572b0: ffffffff81a0c843 (save_stack+0x43/0xd0)
>> >   ffff880065e574e0: ffffffff81a0cb0a (kasan_kmalloc+0xaa/0xd0)
>> >   ffff880065e57520: ffffffff81a0d102 (kasan_slab_alloc+0x12/0x20)
>> >   ffff880065e57530: ffffffff81a08704 (kmem_cache_alloc+0xf4/0x680)
>> >   ffff880065e57598: ffffffff823293b3 (ida_pre_get+0x1f3/0x3b0)
>> >   ffff880065e57720: ffffffff823296d8 (ida_simple_get+0x168/0x320)
>> >   ffff880065e578e8: ffffffff814ad926 (create_worker+0x96/0x620)
>> >   ffff880065e579a8: ffffffff814b68ad (worker_thread+0xf6d/0x1990)
>> >   ffff880065e57f00: ffffffff814ce526 (kthread+0x326/0x3f0)
>> >
>> > In all four cases, the frame pointer pushed by worker_thread() is
>> > somehow changed to be the address at the end of
>> > debug_check_no_locks_freed():
>> >
>> >   ffff880065f7fef8: ffffffff81598080 (debug_check_no_locks_freed+0x5c0/0x5c0)
>> >
>> > Can you please share the following:
>> >
>> > - The output of "grep ffffffff81598080 System.map"
>> > - The .config file
>>
>>
>> Hi,
>>
>> Thanks for looking at this.
>>
>> I've attached my config.
>> Unfortunately I don't have that System.map/vmlinux, I've rebased my
>> kernel and rebuilt it. But here is what after
>> debug_check_no_locks_freed how:
>>
>> vmlinux:ffffffff815956a0 0000000000000467 T lockdep_count_backward_deps
>> vmlinux:ffffffff81595b10 00000000000005b6 T debug_check_no_locks_freed
>> vmlinux:ffffffff815960d0 0000000000003428 t __lock_acquire
>>
>>
>> But If you suspect a corruption, then I would suggest to stop spending
>> more time on this now. At least until I have more data.
>> This happened while I stress tested KVM, and I've seen a low rate of
>> what looks like random memory corruptions (e.g. corrupted slab
>> freelists and other unexplainable things). So let's write it off onto
>> KVM for now.
>
> Ok.  But there are some interesting clues there, so if you see it again,
> please provide the System.map and .config and I'll take another look.

ack

> What really has me confused is why the worker_thread() stack seems so
> big.  According to the stack dumps, worker_thread() had a stack size of
> 1368 bytes, but when I build with your .config, it's only 208 bytes.


My compiler supports KASAN stack instrumentation (--param
asan-stack=1) and KASAN use-after-scope
(-fsanitize-address-use-after-scope). Both increase frames. But you
need gcc 7.0 or something.


> Other interesting features of this corruption:
>
> - always after kthread() called worker_thread()
> - always the same value written, presumably the address of
>   __lock_acquire()
> - always to the same offset on the stack (the location of the frame
>   pointer pushed by worker_thread())
>
> ... so it's not exactly random memory corruption.  In fact it looks like
> "normal" stack corruption.  But then again, KASAN is enabled, which
> should catch most cases of stack corruption (buffer overflows).

Good catch. The other corruptions that I saw were more like typical
corruptions -- e.g. freelist link was NULL or completely random
garbage.
My current (weak) hypothesis is that the corruptions are KVM-related,
and so there are good chances that KASAN won't catch them (e.g. writes
from guest corrupt host kernel).