linux-kernel - Re: INFO: rcu detected stall in shmem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Wed, 10 Oct 2018 11:33:14 +0200
From:   Dmitry Vyukov <dvyukov@...gle.com>
To:     Michal Hocko <mhocko@...nel.org>
Cc:     David Rientjes <rientjes@...gle.com>,
        Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>,
        syzbot <syzbot+77e6b28a7a7106ad0def@...kaller.appspotmail.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Andrew Morton <akpm@...ux-foundation.org>, guro@...com,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        LKML <linux-kernel@...r.kernel.org>,
        Linux-MM <linux-mm@...ck.org>,
        syzkaller-bugs <syzkaller-bugs@...glegroups.com>,
        Yang Shi <yang.s@...baba-inc.com>
Subject: Re: INFO: rcu detected stall in shmem_fault

On Wed, Oct 10, 2018 at 11:13 AM, Michal Hocko <mhocko@...nel.org> wrote:
> On Wed 10-10-18 09:55:57, Dmitry Vyukov wrote:
>> On Wed, Oct 10, 2018 at 6:11 AM, 'David Rientjes' via syzkaller-bugs
>> <syzkaller-bugs@...glegroups.com> wrote:
>> > On Wed, 10 Oct 2018, Tetsuo Handa wrote:
>> >
>> >> syzbot is hitting RCU stall due to memcg-OOM event.
>> >> https://syzkaller.appspot.com/bug?id=4ae3fff7fcf4c33a47c1192d2d62d2e03efffa64
>> >>
>> >> What should we do if memcg-OOM found no killable task because the allocating task
>> >> was oom_score_adj == -1000 ? Flooding printk() until RCU stall watchdog fires
>> >> (which seems to be caused by commit 3100dab2aa09dc6e ("mm: memcontrol: print proper
>> >> OOM header when no eligible victim left") because syzbot was terminating the test
>> >> upon WARN(1) removed by that commit) is not a good behavior.
>>
>>
>> You want to say that most of the recent hangs and stalls are actually
>> caused by our attempt to sandbox test processes with memory cgroup?
>> The process with oom_score_adj == -1000 is not supposed to consume any
>> significant memory; we have another (test) process with oom_score_adj
>> == 0 that's actually consuming memory.
>> But should we refrain from using -1000? Perhaps it would be better to
>> use -500/500 for control/test process, or -999/1000?
>
> oom disable on a task (especially when this is the only task in the
> memcg) is tricky. Look at the memcg report
> [  935.562389] Memory limit reached of cgroup /syz0
> [  935.567398] memory: usage 204808kB, limit 204800kB, failcnt 6081
> [  935.573768] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
> [  935.580650] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
> [  935.586923] Memory cgroup stats for /syz0: cache:152KB rss:176336KB rss_huge:163840KB shmem:344KB mapped_file:264KB dirty:0KB writeback:0KB swap:0KB inactive_anon:260KB active_anon:176448KB inactive_file:4KB active_file:0KB
>
> There is still somebody holding anonymous (THP) memory. If there is no
> other eligible oom victim then it must be some of the oom disabled ones.
> You have suppressed the task list information so we do not know who that
> might be though.
>
> So it looks like there is some misconfiguration or a bug in the oom
> victim selection.


I afraid KASAN can interfere with memory accounting/OMM killing too.
KASAN quarantines up to 1/32-th of physical memory (in our case
7.5GB/32 = 230MB) that is already freed by the task, but as far as I
understand is still accounted against memcg. So maybe making cgroup
limit >> quarantine size will help to resolve this too.

But of course there can be a plain memory leak too.