linux-kernel - Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4029c079-b1f3-f290-26b6-a819c52f5200@suse.cz>
Date:   Thu, 5 Nov 2020 13:53:24 +0100
From:   Vlastimil Babka <vbabka@...e.cz>
To:     Michal Hocko <mhocko@...e.com>, Feng Tang <feng.tang@...el.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Johannes Weiner <hannes@...xchg.org>,
        Matthew Wilcox <willy@...radead.org>,
        Mel Gorman <mgorman@...e.de>, dave.hansen@...el.com,
        ying.huang@...el.com, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable
 zone only node

On 11/5/20 1:08 PM, Michal Hocko wrote:
> On Thu 05-11-20 09:40:28, Feng Tang wrote:
>> > 
>> > Could you be more specific? This sounds like a bug. Allocations
>> > shouldn't spill over to a node which is not in the cpuset. There are few
>> > exceptions like IRQ context but that shouldn't happen regurarly.
>> 
>> I mean when the docker starts, it will spawn many processes which obey
>> the mem binding set, and they have some kernel page requests, which got
>> successfully allocated, like the following callstack:
>> 
>> 	[  567.044953] CPU: 1 PID: 2021 Comm: runc:[1:CHILD] Tainted: G        W I       5.9.0-rc8+ #6
>> 	[  567.044956] Hardware name:  /NUC6i5SYB, BIOS SYSKLi35.86A.0051.2016.0804.1114 08/04/2016
>> 	[  567.044958] Call Trace:
>> 	[  567.044972]  dump_stack+0x74/0x9a
>> 	[  567.044978]  __alloc_pages_nodemask.cold+0x22/0xe5
>> 	[  567.044986]  alloc_pages_current+0x87/0xe0
>> 	[  567.044991]  allocate_slab+0x2e5/0x4f0
>> 	[  567.044996]  ___slab_alloc+0x380/0x5d0
>> 	[  567.045021]  __slab_alloc+0x20/0x40
>> 	[  567.045025]  kmem_cache_alloc+0x2a0/0x2e0
>> 	[  567.045033]  mqueue_alloc_inode+0x1a/0x30
>> 	[  567.045041]  alloc_inode+0x22/0xa0
>> 	[  567.045045]  new_inode_pseudo+0x12/0x60
>> 	[  567.045049]  new_inode+0x17/0x30
>> 	[  567.045052]  mqueue_get_inode+0x45/0x3b0
>> 	[  567.045060]  mqueue_fill_super+0x41/0x70
>> 	[  567.045067]  vfs_get_super+0x7f/0x100
>> 	[  567.045074]  get_tree_keyed+0x1d/0x20
>> 	[  567.045080]  mqueue_get_tree+0x1c/0x20
>> 	[  567.045086]  vfs_get_tree+0x2a/0xc0
>> 	[  567.045092]  fc_mount+0x13/0x50
>> 	[  567.045099]  mq_create_mount+0x92/0xe0
>> 	[  567.045102]  mq_init_ns+0x3b/0x50
>> 	[  567.045106]  copy_ipcs+0x10a/0x1b0
>> 	[  567.045113]  create_new_namespaces+0xa6/0x2b0
>> 	[  567.045118]  unshare_nsproxy_namespaces+0x5a/0xb0
>> 	[  567.045124]  ksys_unshare+0x19f/0x360
>> 	[  567.045129]  __x64_sys_unshare+0x12/0x20
>> 	[  567.045135]  do_syscall_64+0x38/0x90
>> 	[  567.045143]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> 
>> For it, the __alloc_pages_nodemask() will first try process's targed
>> nodemask(unmovable node here), and there is no availabe zone, so it
>> goes with the NULL nodemask, and get a page in the slowpath.
> 
> OK, I see your point now. I was not aware of the slab allocator not
> following cpusets. Sounds like a bug to me.

SLAB and SLUB seem to not care about cpusets in the fast path. But this stack 
shows that it went all the way to the page allocator, so the cpusets should have 
been obeyed there at least.