[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <428ec528-5471-4152-bcb6-d36dd32c3311@suse.cz>
Date: Tue, 14 Oct 2025 11:38:00 +0200
From: Vlastimil Babka <vbabka@...e.cz>
To: Joshua Hahn <joshua.hahnjy@...il.com>,
Andrew Morton <akpm@...ux-foundation.org>
Cc: Chris Mason <clm@...com>, Kiryl Shutsemau <kirill@...temov.name>,
Brendan Jackman <jackmanb@...gle.com>, Johannes Weiner <hannes@...xchg.org>,
Michal Hocko <mhocko@...e.com>, Suren Baghdasaryan <surenb@...gle.com>,
Zi Yan <ziy@...dia.com>, linux-kernel@...r.kernel.org, linux-mm@...ck.org,
kernel-team@...a.com
Subject: Re: [PATCH v4 3/3] mm/page_alloc: Batch page freeing in
free_frozen_page_commit
On 10/13/25 21:08, Joshua Hahn wrote:
> Before returning, free_frozen_page_commit calls free_pcppages_bulk using
> nr_pcp_free to determine how many pages can appropritately be freed,
> based on the tunable parameters stored in pcp. While this number is an
> accurate representation of how many pages should be freed in total, it
> is not an appropriate number of pages to free at once using
> free_pcppages_bulk, since we have seen the value consistently go above
> 2000 in the Meta fleet on larger machines.
>
> As such, perform batched page freeing in free_pcppages_bulk by using
> pcp->batch member. In order to ensure that other processes are not
> starved of the zone lock, free both the zone lock and pcp lock to yield to
> other threads.
>
> Note that because free_frozen_page_commit now performs a spinlock inside the
> function (and can fail), the function may now return with a freed pcp.
> To handle this, return true if the pcp is locked on exit and false otherwise.
>
> In addition, since free_frozen_page_commit must now be aware of what UP
> flags were stored at the time of the spin lock, and because we must be
> able to report new UP flags to the callers, add a new unsigned long*
> parameter UP_flags to keep track of this.
>
> The following are a few synthetic benchmarks, made on three machines. The
> first is a large machine with 754GiB memory and 316 processors.
> The second is a relatively smaller machine with 251GiB memory and 176
> processors. The third and final is the smallest of the three, which has 62GiB
> memory and 36 processors.
>
> On all machines, I kick off a kernel build with -j$(nproc).
> Negative delta is better (faster compilation)
>
> Large machine (754GiB memory, 316 processors)
> make -j$(nproc)
> +------------+---------------+-----------+
> | Metric (s) | Variation (%) | Delta(%) |
> +------------+---------------+-----------+
> | real | 0.8070 | - 1.4865 |
> | user | 0.2823 | + 0.4081 |
> | sys | 5.0267 | -11.8737 |
> +------------+---------------+-----------+
>
> Medium machine (251GiB memory, 176 processors)
> make -j$(nproc)
> +------------+---------------+----------+
> | Metric (s) | Variation (%) | Delta(%) |
> +------------+---------------+----------+
> | real | 0.2806 | +0.0351 |
> | user | 0.0994 | +0.3170 |
> | sys | 0.6229 | -0.6277 |
> +------------+---------------+----------+
>
> Small machine (62GiB memory, 36 processors)
> make -j$(nproc)
> +------------+---------------+----------+
> | Metric (s) | Variation (%) | Delta(%) |
> +------------+---------------+----------+
> | real | 0.1503 | -2.6585 |
> | user | 0.0431 | -2.2984 |
> | sys | 0.1870 | -3.2013 |
> +------------+---------------+----------+
>
> Here, variation is the coefficient of variation, i.e. standard deviation / mean.
>
> Suggested-by: Chris Mason <clm@...com>
> Co-developed-by: Johannes Weiner <hannes@...xchg.org>
> Signed-off-by: Joshua Hahn <joshua.hahnjy@...il.com>
> ---
> mm/page_alloc.c | 66 ++++++++++++++++++++++++++++++++++++++++++-------
> 1 file changed, 57 insertions(+), 9 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8ecd48be8bdd..e85770dd54bd 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2818,12 +2818,22 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
> return high;
> }
>
> -static void free_frozen_page_commit(struct zone *zone,
> +/*
> + * Tune pcp alloc factor and adjust count & free_count. Free pages to bring the
> + * pcp's watermarks below high.
> + *
> + * May return a freed pcp, if during page freeing the pcp spinlock cannot be
> + * reacquired. Return true if pcp is locked, false otherwise.
> + */
> +static bool free_frozen_page_commit(struct zone *zone,
> struct per_cpu_pages *pcp, struct page *page, int migratetype,
> - unsigned int order, fpi_t fpi_flags)
> + unsigned int order, fpi_t fpi_flags, unsigned long *UP_flags)
> {
> int high, batch;
> + int to_free, to_free_batched;
> int pindex;
> + int cpu = smp_processor_id();
> + int ret = true;
> bool free_high = false;
>
> /*
> @@ -2861,15 +2871,47 @@ static void free_frozen_page_commit(struct zone *zone,
> * Do not attempt to take a zone lock. Let pcp->count get
> * over high mark temporarily.
> */
> - return;
> + return true;
> }
>
> high = nr_pcp_high(pcp, zone, batch, free_high);
> if (pcp->count < high)
> - return;
> + return true;
> +
> + to_free = nr_pcp_free(pcp, batch, high, free_high);
> + if (to_free == 0)
> + return true;
I think this is an unnecessary shortcut. The while() condition covers this
and it's likely rare enough that we don't gain anything (if the goal was to
skip the ZONE_BELOW_HIGH check below).
> +
> + while (to_free > 0 && pcp->count >= high) {
The "&& pcp->count >= high" is AFAICS still changing how much we free
compared to before the patch. I.e. we might terminate as soon as freeing
"to_free_batched" in some iteration gets us below "high", while previously
we would free the whole "to_free" and get way further below the "high".
It should be changed to "&& pcp->count > 0" intended only to prevent useless
iterations that decrement to_free by to_free_batched while
free_pcppages_bulk() does nothing.
> + to_free_batched = min(to_free, batch);
> + free_pcppages_bulk(zone, to_free_batched, pcp, pindex);
> + to_free -= to_free_batched;
> + if (pcp->count >= high) {
> + pcp_spin_unlock(pcp);
> + pcp_trylock_finish(*UP_flags);
> +
> + pcp_trylock_prepare(*UP_flags);
> + pcp = pcp_spin_trylock(zone->per_cpu_pageset);
> + if (!pcp) {
> + pcp_trylock_finish(*UP_flags);
> + ret = false;
> + break;
> + }
> +
> + /*
> + * Check if this thread has been migrated to a different
> + * CPU. If that is the case, give up and indicate that
> + * the pcp is returned in an unlocked state.
> + */
> + if (smp_processor_id() != cpu) {
> + pcp_spin_unlock(pcp);
> + pcp_trylock_finish(*UP_flags);
> + ret = false;
> + break;
> + }
> + }
> + }
>
> - free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
> - pcp, pindex);
> if (test_bit(ZONE_BELOW_HIGH, &zone->flags) &&
> zone_watermark_ok(zone, 0, high_wmark_pages(zone),
> ZONE_MOVABLE, 0)) {
> @@ -2887,6 +2929,7 @@ static void free_frozen_page_commit(struct zone *zone,
> next_memory_node(pgdat->node_id) < MAX_NUMNODES)
> atomic_set(&pgdat->kswapd_failures, 0);
> }
> + return ret;
> }
>
> /*
> @@ -2934,7 +2977,9 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
> pcp_trylock_prepare(UP_flags);
> pcp = pcp_spin_trylock(zone->per_cpu_pageset);
> if (pcp) {
> - free_frozen_page_commit(zone, pcp, page, migratetype, order, fpi_flags);
> + if (!free_frozen_page_commit(zone, pcp, page, migratetype,
> + order, fpi_flags, &UP_flags))
> + return;
> pcp_spin_unlock(pcp);
> } else {
> free_one_page(zone, page, pfn, order, fpi_flags);
> @@ -3034,8 +3079,11 @@ void free_unref_folios(struct folio_batch *folios)
> migratetype = MIGRATE_MOVABLE;
>
> trace_mm_page_free_batched(&folio->page);
> - free_frozen_page_commit(zone, pcp, &folio->page, migratetype,
> - order, FPI_NONE);
> + if (!free_frozen_page_commit(zone, pcp, &folio->page,
> + migratetype, order, FPI_NONE, &UP_flags)) {
> + pcp = NULL;
> + locked_zone = NULL;
> + }
> }
>
> if (pcp) {
Powered by blists - more mailing lists