linux-kernel - Re: [PATCH 2/2] mm/mempolicy: Skip extra call to __alloc_pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250630153933.3942464-1-joshua.hahnjy@gmail.com>
Date: Mon, 30 Jun 2025 08:39:33 -0700
From: Joshua Hahn <joshua.hahnjy@...il.com>
To: Joshua Hahn <joshua.hahnjy@...il.com>
Cc: Gregory Price <gourry@...rry.net>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Alistair Popple <apopple@...dia.com>,
	Byungchul Park <byungchul@...com>,
	David Hildenbrand <david@...hat.com>,
	Matthew Brost <matthew.brost@...el.com>,
	Rakie Kim <rakie.kim@...com>,
	Ying Huang <ying.huang@...ux.alibaba.com>,
	Zi Yan <ziy@...dia.com>,
	linux-kernel@...r.kernel.org,
	linux-mm@...ck.org,
	kernel-team@...a.com
Subject: Re: [PATCH 2/2] mm/mempolicy: Skip extra call to __alloc_pages_bulk in weighted interleave

On Fri, 27 Jun 2025 09:13:18 -0700 Joshua Hahn <joshua.hahnjy@...il.com> wrote:

> On Fri, 27 Jun 2025 00:28:33 -0400 Gregory Price <gourry@...rry.net> wrote:
> 
> Hi Gregory,

[...snip...]

> > I will say that, at least at the time, I took the core math and
> > validated the edge conditions in a separate program.
> > 
> > If you get it wrong in the kernel, you'd either fail to allocate - or more
> > likely just get the wrong distribution of pages.  The latter is
> > non-obvious unless you go looking for it, so it might be good to at
> > least add this test result in the change log.  It's hard to write this
> > in LTP or kernel selftest unfortunately.
> 
> I think so too : -(
> 
> One test that I wanted to do while developing this feature was to see if
> I could figure out how many pages are really allocated from each node. The
> difficulty in doing this (as you pointed out above) is that because there are
> other ways that move the round robin forward (without necessarily calling the
> bulk alloc function), it's hard to directly attribute the page allocations.
> 
> If this was the only place that we were modifying these values, then a
> correctness check would be equivalent to just adding a printk of each node
> and how many pages were allocated on it, then adding all the numbers up to
> see if it matches the weight ratios in sysfs.
> 
> So I think I will do what you did as well -- I think that performing some
> tests, at least on the edge cases in a separate program will help give
> some confidence that the code works as intended. I'll also continue to think
> about if there are better ways to be testing this instead.

Like you suggested, I decided to run a simulation just to see if the number
of nodes allocated from each page lined up with the old version, and if the
numbers made sense for both cases. I found a few issues with my version of the
code:

The math is just incorrect when rounds == 0. Let's say there's a 2-node machine
with weights [10, 7]. We should start allocating from node0 with 7 remaining
pages, and we want to allocate 14 pages. Here's how the new math goes:

- First node should allocate 7 pages, let carryover = 7
- Then remaining pages = 14 - 7 = 7
- Allocate rounds * weight + min(weight, delta) + carryover:
    = 0 * 10 + min(10, 7) + 7
    = 7 + 7
    = 14

This is incorrect, since we will be allocating all 14 pages from the first
node, and the real distribution should be 7 pages from the first node, and
7 pages from the second node. I think this can also lead to some overallocation
issues.

So there are a few options now:
- Change the addition to be:
        rounds * weight + min(min(weight, delta + carryover), weight)
  and adjust the remaining math accordingly. But this looks very bad and
  is not intuitive at all, so I don't like this idea. 
- This can easily be solved, if instead of figuring out how many pages have to
  be allocated as we iterate through the nodes, we do one pass and figure out
  how many pages must be allocated. The problem here is that we re-introduce
  another loop, which adds to the code complexity and may actually be a
  performance decrease instead. So again, I don't think this is the best idea.
- Skip this patch! This is a small optimization on performance, and I think it
  simplifies the code, but turns out the original math to do is just much harder
  without doing two separate calculations. I'll keep this running in the back
  of my mind to see if I can figure out a way to solve it later...

I also think it makes sense to drop the first patch as well, since there's no
optimization included with the cleanup. But since it already got a few reviews,
maybe it makes to keep that one : -)

Anyways, I really appreciate the review Gregory! Maybe I'll have a brekathrough
idea on how to correctly do the math here sometime in the future.

Have a great day!
Joshua

Sent using hkml (https://github.com/sjp38/hackermail)