linux-kernel - Re: [PATCH] mm: page_alloc: consume available CMA space first

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <167455a3-3a9f-6064-4063-5b74231141f9@suse.cz>
Date:   Thu, 27 Jul 2023 10:33:20 +0200
From:   Vlastimil Babka <vbabka@...e.cz>
To:     Roman Gushchin <roman.gushchin@...ux.dev>,
        Johannes Weiner <hannes@...xchg.org>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Rik van Riel <riel@...riel.com>,
        Joonsoo Kim <js1304@...il.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH] mm: page_alloc: consume available CMA space first

On 7/27/23 01:38, Roman Gushchin wrote:
> On Wed, Jul 26, 2023 at 10:53:04AM -0400, Johannes Weiner wrote:
>> On a memcache setup with heavy anon usage and no swap, we routinely
>> see premature OOM kills with multiple gigabytes of free space left:
>> 
>>     Node 0 Normal free:4978632kB [...] free_cma:4893276kB
>> 
>> This free space turns out to be CMA. We set CMA regions aside for
>> potential hugetlb users on all of our machines, figuring that even if
>> there aren't any, the memory is available to userspace allocations.
>> 
>> When the OOMs trigger, it's from unmovable and reclaimable allocations
>> that aren't allowed to dip into CMA. The non-CMA regions meanwhile are
>> dominated by the anon pages.
>> 
>> 
>> Because we have more options for CMA pages, change the policy to
>> always fill up CMA first. This reduces the risk of premature OOMs.
> 
> I suspect it might cause regressions on small(er) devices where
> a relatively small cma area (Mb's) is often reserved for a use by various
> device drivers, which can't handle allocation failures well (even interim
> allocation failures). A startup time can regress too: migrating pages out of
> cma will take time.

Agreed, we should be more careful here.

> And given the velocity of kernel upgrades on such devices, we won't learn about
> it for next couple of years.
> 
>> Movable pages can be migrated out of CMA when necessary, but we don't
>> have a mechanism to migrate them *into* CMA to make room for unmovable
>> allocations. The only recourse we have for these pages is reclaim,
>> which due to a lack of swap is unavailable in our case.
> 
> Idk, should we introduce such a mechanism? Or use some alternative heuristics,
> which will be a better compromise between those who need cma allocations always
> pass and those who use large cma areas for opportunistic huge page allocations.
> Of course, we can add a boot flag/sysctl/per-cma-area flag, but I doubt we want
> really this.

At some point the solution was supposed to be ZONE_MOVABLE:
https://lore.kernel.org/linux-mm/1512114786-5085-1-git-send-email-iamjoonsoo.kim@lge.com/

But it was reverted due to IIRC some bugs, and Joonsoo going MIA.

> Thanks!