linux-kernel - Re: [RFC PATCH] sched/fair: Make tg->load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20230412141116.GB155769@ziqianlu-desk2>
Date:   Wed, 12 Apr 2023 22:11:16 +0800
From:   Aaron Lu <aaron.lu@...el.com>
To:     Peter Zijlstra <peterz@...radead.org>
CC:     Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        "Daniel Bristot de Oliveira" <bristot@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>,
        Tim Chen <tim.c.chen@...el.com>,
        Nitin Tekchandani <nitin.tekchandani@...el.com>,
        Waiman Long <longman@...hat.com>,
        Yu Chen <yu.c.chen@...el.com>, <linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH] sched/fair: Make tg->load_avg per node

On Wed, Apr 12, 2023 at 03:58:28PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 12, 2023 at 01:59:36PM +0200, Peter Zijlstra wrote:
> > On Mon, Mar 27, 2023 at 01:39:55PM +0800, Aaron Lu wrote:
> > > When using sysbench to benchmark Postgres in a single docker instance
> > > with sysbench's nr_threads set to nr_cpu, it is observed there are times
> > > update_cfs_group() and update_load_avg() shows noticeable overhead on
> > > cpus of one node of a 2sockets/112core/224cpu Intel Sapphire Rapids:
> > > 
> > >     10.01%     9.86%  [kernel.vmlinux]        [k] update_cfs_group
> > >      7.84%     7.43%  [kernel.vmlinux]        [k] update_load_avg
> > > 
> > > While cpus of the other node normally sees a lower cycle percent:
> > > 
> > >      4.46%     4.36%  [kernel.vmlinux]        [k] update_cfs_group
> > >      4.02%     3.40%  [kernel.vmlinux]        [k] update_load_avg
> > > 
> > > Annotate shows the cycles are mostly spent on accessing tg->load_avg
> > > with update_load_avg() being the write side and update_cfs_group() being
> > > the read side.
> > > 
> > > The reason why only cpus of one node has bigger overhead is: task_group
> > > is allocated on demand from a slab and whichever cpu happens to do the
> > > allocation, the allocated tg will be located on that node and accessing
> > > to tg->load_avg will have a lower cost for cpus on the same node and
> > > a higer cost for cpus of the remote node.
> > > 
> > > Tim Chen told me that PeterZ once mentioned a way to solve a similar
> > > problem by making a counter per node so do the same for tg->load_avg.
> > 
> > Yeah, I send him a very similar patch (except horrible) some 5 years ago
> > for testing.
> > 
> > > After this change, the worst number I saw during a 5 minutes run from
> > > both nodes are:
> > > 
> > >      2.77%     2.11%  [kernel.vmlinux]        [k] update_load_avg
> > >      2.72%     2.59%  [kernel.vmlinux]        [k] update_cfs_group
> > 
> > Nice!
> > 
> > > Another observation of this workload is: it has a lot of wakeup time
> > > task migrations and that is the reason why update_load_avg() and
> > > update_cfs_group() shows noticeable cost. Running this workload in N
> > > instances setup where N >= 2 with sysbench's nr_threads set to 1/N nr_cpu,
> > > task migrations on wake up time are greatly reduced and the overhead from
> > > the two above mentioned functions also dropped a lot. It's not clear to
> > > me why running in multiple instances can reduce task migrations on
> > > wakeup path yet.
> > 
> > If there is *any* idle time, we're rather agressive at moving tasks to
> > idle CPUs in an attempt to avoid said idle time. If you're running at
> > about the number of CPUs there will be a fair amount of idle time and
> > hence significant migrations.
> > 
> > When you overload, there will no longer be idle time and hence no more
> > migrations.
> > 
> > > Reported-by: Nitin Tekchandani <nitin.tekchandani@...el.com>
> > > Signed-off-by: Aaron Lu <aaron.lu@...el.com>
> > 
> > If you want to make things more complicated you can check
> > num_possible_nodes()==1 on boot and then avoid the indirection, but
> 
> ... finishing emails is hard :-)
> 
> I think I meant to say we should check if there's measurable overhead on
> single-node systems before we go overboard or somesuch.

Got it, hopefully there is no measurable overhead :-)