linux-kernel - Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20201104084021.GB15700@shbuild999.sh.intel.com>
Date:   Wed, 4 Nov 2020 16:40:21 +0800
From:   Feng Tang <feng.tang@...el.com>
To:     Michal Hocko <mhocko@...e.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Johannes Weiner <hannes@...xchg.org>,
        Matthew Wilcox <willy@...radead.org>,
        Mel Gorman <mgorman@...e.de>, dave.hansen@...el.com,
        ying.huang@...el.com, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable
 zone only node

On Wed, Nov 04, 2020 at 08:58:19AM +0100, Michal Hocko wrote:
> On Wed 04-11-20 15:38:26, Feng Tang wrote:
> [...]
> > > Could you be more specific about the usecase here? Why do you need a
> > > binding to a pure movable node? 
> > 
> > One common configuration for a platform is small size of DRAM plus huge
> > size of PMEM (which is slower but cheaper), and my guess of their use
> > is to try to lead the bulk of user space allocation (GFP_HIGHUSER_MOVABLE)
> > to PMEM node, and only let DRAM be used as less as possible. 
> 
> While this is possible, it is a tricky configuration. It is essentially 
> get us back to 32b and highmem...

:) Another possible case is similar binding on a memory hotplugable
platform, which has one unplugable node and several other nodes configured
as movable only to be hot removable when needed

> As I've said in reply to your second patch. I think we can make the oom
> killer behavior more sensible in this misconfigured cases but I do not
> think we want break the cpuset isolation for such a configuration.

Do you mean we skip the killing and just let the allocation fail? We've
checked the oom killer code first, when the oom happens, both DRAM
node and unmovable node have lots of free memory, and killing process
won't improve the situation.

(Folloing is copied from your comments for 2/2) 
> This allows to spill memory allocations over to any other node which
> has Normal (or other lower) zones and as such it breaks cpuset isolation.
> As I've pointed out in the reply to your cover letter it seems that
> this is more of a misconfiguration than a bug.

For the usage case (docker container running), the spilling is already
happening, I traced its memory allocation requests, many of them are
movable, and got fallback to the normal node naturally with current
code, only a few got blocked, as many of __alloc_pages_nodemask are
called witih 'NULL' nodemask parameter.

And I made this RFC patch inspired by code in __alloc_pages_may_oom():

	if (gfp_mask & __GFP_NOFAIL)
		page = __alloc_pages_cpuset_fallback(gfp_mask, order,
				ALLOC_NO_WATERMARKS, ac);

Thanks,
Feng

> -- 
> Michal Hocko
> SUSE Labs