[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20240705135911.4a6e38379ae95c3fc6bbe7e2@linux-foundation.org>
Date: Fri, 5 Jul 2024 13:59:11 -0700
From: Andrew Morton <akpm@...ux-foundation.org>
To: Saurabh Sengar <ssengar@...ux.microsoft.com>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org, ssengar@...rosoft.com,
wei.liu@...nel.org
Subject: Re: [PATCH] mm/vmstat: Defer the refresh_zone_stat_thresholds after
all CPUs bringup
On Fri, 5 Jul 2024 01:48:21 -0700 Saurabh Sengar <ssengar@...ux.microsoft.com> wrote:
> refresh_zone_stat_thresholds function has two loops which is expensive for
> higher number of CPUs and NUMA nodes.
>
> Below is the rough estimation of total iterations done by these loops
> based on number of NUMA and CPUs.
>
> Total number of iterations: nCPU * 2 * Numa * mCPU
> Where:
> nCPU = total number of CPUs
> Numa = total number of NUMA nodes
> mCPU = mean value of total CPUs (e.g., 512 for 1024 total CPUs)
>
> For the system under test with 16 NUMA nodes and 1024 CPUs, this
> results in a substantial increase in the number of loop iterations
> during boot-up when NUMA is enabled:
>
> No NUMA = 1024*2*1*512 = 1,048,576 : Here refresh_zone_stat_thresholds
> takes around 224 ms total for all the CPUs in the system under test.
> 16 NUMA = 1024*2*16*512 = 16,777,216 : Here refresh_zone_stat_thresholds
> takes around 4.5 seconds total for all the CPUs in the system under test.
Did you measure the overall before-and-after times? IOW, how much of
that 4.5s do we reclaim?
> Calling this for each CPU is expensive when there are large number
> of CPUs along with multiple NUMAs. Fix this by deferring
> refresh_zone_stat_thresholds to be called later at once when all the
> secondary CPUs are up. Also, register the DYN hooks to keep the
> existing hotplug functionality intact.
>
Seems risky - we'll now have online CPUs which have unintialized data,
yes? What assurance do we have that this data won't be accessed?
Another approach might be to make the code a bit smarter - instead of
calculating thresholds for the whole world, we make incremental changes
to the existing thresholds on behalf of the new resource which just
became available?
Powered by blists - more mailing lists