linux-kernel - Re: [PATCH mm-unstable v1] mm/page_alloc: try not to overestimate free highatomic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b594795a-7f26-4c0d-80f4-88d242fdc0fb@suse.cz>
Date: Thu, 24 Oct 2024 10:16:12 +0200
From: Vlastimil Babka <vbabka@...e.cz>
To: Yu Zhao <yuzhao@...gle.com>, Mel Gorman <mgorman@...hsingularity.net>
Cc: Michal Hocko <mhocko@...e.com>, Andrew Morton
 <akpm@...ux-foundation.org>, David Rientjes <rientjes@...gle.com>,
 linux-mm@...ck.org, linux-kernel@...r.kernel.org, Link Lin
 <linkl@...gle.com>, Matt Fleming <mfleming@...udflare.com>
Subject: Re: [PATCH mm-unstable v1] mm/page_alloc: try not to overestimate
 free highatomic

On 10/24/24 06:35, Yu Zhao wrote:
> On Wed, Oct 23, 2024 at 1:35 AM Vlastimil Babka <vbabka@...e.cz> wrote:
>>
>> On 10/23/24 08:36, Yu Zhao wrote:
>> > On Tue, Oct 22, 2024 at 4:53 AM Vlastimil Babka <vbabka@...e.cz> wrote:
>> >>
>> >> +Cc Mel and Matt
>> >>
>> >> On 10/21/24 19:25, Michal Hocko wrote:
>> >>
>> >> Hm I don't think it's completely WAI. The intention is that we should be
>> >> able to unreserve the highatomic pageblocks before going OOM, and there
>> >> seems to be an unintended corner case that if the pageblocks are fully
>> >> exhausted, they are not reachable for unreserving.
>> >
>> > I still think unreserving should only apply to highatomic PBs that
>> > contain free pages. Otherwise, it seems to me that it'd be
>> > self-defecting because:
>> > 1. Unreserving fully used hightatomic PBs can't fulfill the alloc
>> > demand immediately.
>>
>> I thought the alloc demand is only blocked on the pessimistic watermark
>> calculation. Usable free pages exist, but the allocation is not allowed to
>> use them.
> 
> I think we are talking about two different problems here:
> 1. The estimation problem.
> 2. The unreserving policy problem.
> 
> What you said here is correct w.r.t. the first problem, and I was
> talking about the second problem.

OK but the problem with unreserving currently makes the problem of
estimation worse and unfixable.

>> > 2. More importantly, it only takes one alloc failure in
>> > __alloc_pages_direct_reclaim() to reset nr_reserved_highatomic to 2MB,
>> > from as high as 1% of a zone (in this case 1GB). IOW, it makes more
>> > sense to me that highatomic only unreserves what it doesn't fully use
>> > each time unreserve_highatomic_pageblock() is called, not everything
>> > it got (except the last PB).
>>
>> But if the highatomic pageblocks are already full, we are not really
>> removing any actual highatomic reserves just by changing the migratetype and
>> decreasing nr_reserved_highatomic?
> 
> If we change the MT, they can be fragmented a lot faster, i.e., from
> the next near OOM condition to upon becoming free. Trying to persist
> over time is what actually makes those PBs more fragmentation
> resistant.

If we assume the allocations there have similar sizes and lifetimes, then I
guess yeah.

>> In fact that would allow the reserves
>> grow with some actual free pages in the future.
> 
> Good point. I think I can explain it better along this line.
> 
> If highatomic is under the limit, both your proposal and the current
> implementation would try to grow, making not much difference. However,
> the current implementation can also reuse previously full PBs when
> they become available. So there is a clear winner here: the current
> implementation.

I'd say it depends on the user of the highatomic blocks (the workload),
which way ends up better.

> If highatomic has reached the limit, with your proposal, the growth
> can only happen after unreserve, and unreserve only happens under
> memory pressure. This means it's likely that it tries to grow under
> memory pressure, which is more difficult than the condition where
> there is plenty of memory. For the current implementation, it doesn't
> try to grow, rather, it keeps what it already has, betting those full
> PBs becoming available for reuse. So I don't see a clear winner
> between trying to grow under memory pressure and betting on becoming
> available for reuse.

Understood. But also note there are many conditions where the current
implementation and my proposal behave the same. If highatomic pageblocks
become full and then only one or few pages from each is freed, it suddenly
becomes possible to unreserve them due to memory pressure, and there is no
reuse for those highatomic allocations anymore. This very different outcome
only depends on whether a single page is free for the unreserve to work, but
from the efficiency of pageblock reusal you describe above a single page is
only a minor difference. My proposal would at least remove the sudden change
of behavior when going from a single free page to no free page.

>> Hm that assumes we're adding some checks in free fastpath, and for that to
>> work also that there will be a freed page in highatomic PC in near enough
>> future from the decision we need to unreserve something. Which is not so
>> much different from the current assumption we'll find such a free page
>> already in the free list immediately.
>>
>> > To summarize, I think this is an estimation problem, which I would
>> > categorize as a lesser problem than accounting problems. But it sounds
>> > to me that you think it's a policy problem, i.e., the highatomic
>> > unreserving policy is wrong or not properly implemented?
>>
>> Yeah I'd say not properly implemented, but that sounds like a mechanism, not
>> policy problem to me :)
> 
> What about adding a new counter to keep track of the size of free
> pages reserved for highatomic?

That's doable but not so trivial and means starting to handle the highatomic
pageblocks much more carefully, like we do with CMA pageblocks and
NR_FREE_CMA_PAGES counter, otherwise we risk drifting the counter unrecoverably.

> Mel?