linux-kernel - Re: [RFC PATCH] mm: kasan: suppress soft lockup in slub when !CONFIG

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <3fabaa44-4767-bfcf-bf86-f1fce573d5e1@alibaba-inc.com>
Date:   Tue, 12 Dec 2017 02:00:10 +0800
From:   "Yang Shi" <yang.s@...baba-inc.com>
To:     Andrey Ryabinin <aryabinin@...tuozzo.com>,
        Dmitry Vyukov <dvyukov@...gle.com>,
        Matthew Wilcox <willy@...radead.org>
Cc:     Alexander Potapenko <glider@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Linux-MM <linux-mm@...ck.org>,
        kasan-dev <kasan-dev@...glegroups.com>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH] mm: kasan: suppress soft lockup in slub when
 !CONFIG_PREEMPT



On 12/8/17 1:16 AM, Andrey Ryabinin wrote:
> On 12/08/2017 11:26 AM, Dmitry Vyukov wrote:
>> On Fri, Dec 8, 2017 at 12:40 AM, Matthew Wilcox <willy@...radead.org> wrote:
>>> On Fri, Dec 08, 2017 at 07:30:07AM +0800, Yang Shi wrote:
>>>> When running stress test with KASAN enabled, the below softlockup may
>>>> happen occasionally:
>>>>
>>>> NMI watchdog: BUG: soft lockup - CPU#7 stuck for 22s!
>>>> hardirqs last  enabled at (0): [<          (null)>]      (null)
>>>> hardirqs last disabled at (0): [] copy_process.part.30+0x5c6/0x1f50
>>>> softirqs last  enabled at (0): [] copy_process.part.30+0x5c6/0x1f50
>>>> softirqs last disabled at (0): [<          (null)>]      (null)
>>>
>>>> Call Trace:
>>>>   [] __slab_free+0x19c/0x270
>>>>   [] ___cache_free+0xa6/0xb0
>>>>   [] qlist_free_all+0x47/0x80
>>>>   [] quarantine_reduce+0x159/0x190
>>>>   [] kasan_kmalloc+0xaf/0xc0
>>>>   [] kasan_slab_alloc+0x12/0x20
>>>>   [] kmem_cache_alloc+0xfa/0x360
>>>>   [] ? getname_flags+0x4f/0x1f0
>>>>   [] getname_flags+0x4f/0x1f0
>>>>   [] getname+0x12/0x20
>>>>   [] do_sys_open+0xf9/0x210
>>>>   [] SyS_open+0x1e/0x20
>>>>   [] entry_SYSCALL_64_fastpath+0x1f/0xc2
>>>
>>> This feels like papering over a problem.  KASAN only calls
>>> quarantine_reduce() when it's allowed to block.  Presumably it has
>>> millions of entries on the free list at this point.  I think the right
>>> thing to do is for qlist_free_all() to call cond_resched() after freeing
>>> every N items.
>>
>>
>> Agree. Adding touch_softlockup_watchdog() to a random low-level
>> function looks like a wrong thing to do.
>> quarantine_reduce() already has this logic. Look at
>> QUARANTINE_BATCHES. It's meant to do exactly this -- limit amount of
>> work in quarantine_reduce() and in quarantine_remove_cache() to
>> reasonably-sized batches. We could simply increase number of batches
>> to make them smaller. But it would be good to understand what exactly
>> happens in this case. Batches should on a par of ~~1MB. Why freeing
>> 1MB worth of objects (smallest of which is 32b) takes 22 seconds?
>>
> 
> I think the problem here is that kernel 4.9.44-003.ali3000.alios7.x86_64.debug
> doesn't have 64abdcb24351 ("kasan: eliminate long stalls during quarantine reduction").
> 
> We probably should ask that commit to be included in stable, but it would be good to hear
> a confirmation from Yang that it really helps.

Thanks, folks. Yes, my kernel doesn't have this commit. It sounds the 
commit batches the quarantine to smaller group. I will run some tests 
against this commit to see if it could help. Reading the code tells me 
it is likely to help.

Yang

>