[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZCGsJB36YYbgsgBW@chenyu5-mobl1>
Date: Mon, 27 Mar 2023 22:45:56 +0800
From: Chen Yu <yu.c.chen@...el.com>
To: Aaron Lu <aaron.lu@...el.com>
CC: Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...hat.com>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>,
"Mel Gorman" <mgorman@...e.de>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
Valentin Schneider <vschneid@...hat.com>,
Tim Chen <tim.c.chen@...el.com>,
Nitin Tekchandani <nitin.tekchandani@...el.com>,
Waiman Long <longman@...hat.com>,
<linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH] sched/fair: Make tg->load_avg per node
On 2023-03-27 at 13:39:55 +0800, Aaron Lu wrote:
> When using sysbench to benchmark Postgres in a single docker instance
> with sysbench's nr_threads set to nr_cpu, it is observed there are times
> update_cfs_group() and update_load_avg() shows noticeable overhead on
> cpus of one node of a 2sockets/112core/224cpu Intel Sapphire Rapids:
>
> 10.01% 9.86% [kernel.vmlinux] [k] update_cfs_group
> 7.84% 7.43% [kernel.vmlinux] [k] update_load_avg
>
> While cpus of the other node normally sees a lower cycle percent:
>
> 4.46% 4.36% [kernel.vmlinux] [k] update_cfs_group
> 4.02% 3.40% [kernel.vmlinux] [k] update_load_avg
>
> Annotate shows the cycles are mostly spent on accessing tg->load_avg
> with update_load_avg() being the write side and update_cfs_group() being
> the read side.
>
> The reason why only cpus of one node has bigger overhead is: task_group
> is allocated on demand from a slab and whichever cpu happens to do the
> allocation, the allocated tg will be located on that node and accessing
> to tg->load_avg will have a lower cost for cpus on the same node and
> a higer cost for cpus of the remote node.
>
> Tim Chen told me that PeterZ once mentioned a way to solve a similar
> problem by making a counter per node so do the same for tg->load_avg.
> After this change, the worst number I saw during a 5 minutes run from
> both nodes are:
>
> 2.77% 2.11% [kernel.vmlinux] [k] update_load_avg
> 2.72% 2.59% [kernel.vmlinux] [k] update_cfs_group
>
> Another observation of this workload is: it has a lot of wakeup time
> task migrations and that is the reason why update_load_avg() and
> update_cfs_group() shows noticeable cost. Running this workload in N
> instances setup where N >= 2 with sysbench's nr_threads set to 1/N nr_cpu,
> task migrations on wake up time are greatly reduced and the overhead from
> the two above mentioned functions also dropped a lot. It's not clear to
> me why running in multiple instances can reduce task migrations on
> wakeup path yet.
>
Looks interesting, when the sysbench is 1 instance and nr_threads = nr_cpu,
and when the launches more than 1 instance of sysbench, while nr_threads set
to 1/N * nr_cpu, do both cases have similar CPU utilization? Currently the
task wakeup inhibits migration wakeup if the system is overloaded.
[...]
> struct task_group *sched_create_group(struct task_group *parent)
> {
> + size_t size = sizeof(struct task_group);
> + int __maybe_unused i, nodes;
> struct task_group *tg;
>
> - tg = kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
> +#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
> + nodes = num_possible_nodes();
> + size += nodes * sizeof(void *);
> + tg = kzalloc(size, GFP_KERNEL);
> + if (!tg)
> + return ERR_PTR(-ENOMEM);
> +
> + for_each_node(i) {
> + tg->node_info[i] = kzalloc_node(sizeof(struct tg_node_info), GFP_KERNEL, i);
> + if (!tg->node_info[i])
> + return ERR_PTR(-ENOMEM);
Do we need to free tg above in case of memory leak?
thanks,
Chenyu
Powered by blists - more mailing lists