[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240809054956.GA12044@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>
Date: Thu, 8 Aug 2024 22:49:56 -0700
From: Saurabh Singh Sengar <ssengar@...ux.microsoft.com>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org, ssengar@...rosoft.com,
wei.liu@...nel.org
Subject: Re: [PATCH] mm/vmstat: Defer the refresh_zone_stat_thresholds after
all CPUs bringup
On Thu, Aug 08, 2024 at 10:20:06PM -0700, Andrew Morton wrote:
> On Fri, 5 Jul 2024 01:48:21 -0700 Saurabh Sengar <ssengar@...ux.microsoft.com> wrote:
>
> > refresh_zone_stat_thresholds function has two loops which is expensive for
> > higher number of CPUs and NUMA nodes.
> >
> > Below is the rough estimation of total iterations done by these loops
> > based on number of NUMA and CPUs.
> >
> > Total number of iterations: nCPU * 2 * Numa * mCPU
> > Where:
> > nCPU = total number of CPUs
> > Numa = total number of NUMA nodes
> > mCPU = mean value of total CPUs (e.g., 512 for 1024 total CPUs)
> >
> > For the system under test with 16 NUMA nodes and 1024 CPUs, this
> > results in a substantial increase in the number of loop iterations
> > during boot-up when NUMA is enabled:
> >
> > No NUMA = 1024*2*1*512 = 1,048,576 : Here refresh_zone_stat_thresholds
> > takes around 224 ms total for all the CPUs in the system under test.
> > 16 NUMA = 1024*2*16*512 = 16,777,216 : Here refresh_zone_stat_thresholds
> > takes around 4.5 seconds total for all the CPUs in the system under test.
> >
> > Calling this for each CPU is expensive when there are large number
> > of CPUs along with multiple NUMAs. Fix this by deferring
> > refresh_zone_stat_thresholds to be called later at once when all the
> > secondary CPUs are up. Also, register the DYN hooks to keep the
> > existing hotplug functionality intact.
> >
> > ...
> >
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -31,6 +31,7 @@
> >
> > #include "internal.h"
> >
> > +static int vmstat_late_init_done;
> > #ifdef CONFIG_NUMA
> > int sysctl_vm_numa_stat = ENABLE_NUMA_STAT;
> >
> > @@ -2107,7 +2108,8 @@ static void __init init_cpu_node_state(void)
> >
> > static int vmstat_cpu_online(unsigned int cpu)
> > {
> > - refresh_zone_stat_thresholds();
> > + if (vmstat_late_init_done)
> > + refresh_zone_stat_thresholds();
> >
> > if (!node_state(cpu_to_node(cpu), N_CPU)) {
> > node_set_state(cpu_to_node(cpu), N_CPU);
> > @@ -2139,6 +2141,14 @@ static int vmstat_cpu_dead(unsigned int cpu)
> > return 0;
> > }
> >
> > +static int __init vmstat_late_init(void)
> > +{
> > + refresh_zone_stat_thresholds();
> > + vmstat_late_init_done = 1;
> > +
> > + return 0;
> > +}
> > +late_initcall(vmstat_late_init);
>
> OK, so what's happening here. Once all CPUs are online and running
> around doing heaven knows what, one of the CPUs sets up everyone's
> thresholds. So for a period, all the other CPUs are running with
> inappropriate threshold values.
>
> So what are all the other CPUs doing at this point in time, and why is
> it safe to leave their thresholds in an inappropriate state while they
> are doing it?
>From what I undersatnd these threshold values are primarily used by
userspace tools, and this data will be useful post late_initcall only.
If there’s a more effective approach to handle this, please let me know,
and I can investigate further.
- Saurabh
Powered by blists - more mailing lists