linux-kernel - Re: [RFC PATCH] sched/fair: Make tg->load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20230422060450.GA264770@ziqianlu-desk2>
Date:   Sat, 22 Apr 2023 14:04:50 +0800
From:   Aaron Lu <aaron.lu@...el.com>
To:     Chen Yu <yu.c.chen@...el.com>
CC:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>,
        "Mel Gorman" <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>,
        Tim Chen <tim.c.chen@...el.com>,
        Nitin Tekchandani <nitin.tekchandani@...el.com>,
        Waiman Long <longman@...hat.com>,
        <linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH] sched/fair: Make tg->load_avg per node

On Sat, Apr 22, 2023 at 12:01:59PM +0800, Chen Yu wrote:
> On 2023-03-27 at 13:39:55 +0800, Aaron Lu wrote:
> > 
> > The reason why only cpus of one node has bigger overhead is: task_group
> > is allocated on demand from a slab and whichever cpu happens to do the
> > allocation, the allocated tg will be located on that node and accessing
> > to tg->load_avg will have a lower cost for cpus on the same node and
> > a higer cost for cpus of the remote node.
> [...]
> >  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> >  {
> >  	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> > +	int node = cpu_to_node(cfs_rq->rq->cpu);
> >  
> >  	/*
> >  	 * No need to update load_avg for root_task_group as it is not used.
> > @@ -3616,7 +3617,7 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> >  		return;
> >  
> >  	if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> > -		atomic_long_add(delta, &cfs_rq->tg->load_avg);
> > +		atomic_long_add(delta, &cfs_rq->tg->node_info[node]->load_avg);
> When entered enqueue_entity(cfs_rq, se) -> update_tg_load_avg(cfs_rq)
> the cfs_rq->rq->cpu is not necessary the current cpu, so the node returned by
> cpu_to_node(cfs_rq->rq->cpu) is not necessary the current node as the current
> CPU, would atomic_add still introduce cross-node overhead due to remote access
> to cfs_rq->tg->node_info[node]->load_avg in this case, or do I miss something?

That's good point.

The chance cpu being different is high, but the node being different is
pretty low though. The wakeup path will not do cross LLC activation with
TTWU_QUEUE, but the load balance path does make it possible to dequeue
a task that is on a remote node and I think that's the only path where
the two cpus(current vs cfs_rq->rq->cpu) can be on different nodes.

A quick test shows during a 5s window of running hackbench on a VM, the
number of times when the two nodes derived from current and cfs_rq->rq->cpu
being different are less than 100 while being equal are 992548.

I'll switch to using smp_processor_id() in next posting since it's the
right thing to do, thanks for spotting this!