linux-kernel - Re: [RFC PATCH 1/1] sched: Extend cpu idle state for 1ms

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZMke3RuZzkyR64Lg@chenyu5-mobl2>
Date:   Tue, 1 Aug 2023 23:03:57 +0800
From:   Chen Yu <yu.c.chen@...el.com>
To:     Aaron Lu <aaron.lu@...el.com>
CC:     Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Shrikanth Hegde <sshegde@...ux.vnet.ibm.com>,
        <linux-kernel@...r.kernel.org>, Ingo Molnar <mingo@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        "Vincent Guittot" <vincent.guittot@...aro.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Swapnil Sapkal <Swapnil.Sapkal@....com>, <x86@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Srikar Dronamraju <srikar@...ux.vnet.ibm.com>
Subject: Re: [RFC PATCH 1/1] sched: Extend cpu idle state for 1ms

On 2023-08-01 at 15:24:03 +0800, Aaron Lu wrote:
> On Wed, Jul 26, 2023 at 02:56:19PM -0400, Mathieu Desnoyers wrote:
> 
> ... ...
> 
> > The updated patch:
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index a68d1276bab0..1c7d5bd2968b 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -7300,6 +7300,10 @@ int idle_cpu(int cpu)
> >  {
> >  	struct rq *rq = cpu_rq(cpu);
> > +	if (READ_ONCE(rq->nr_running) <= IDLE_CPU_DELAY_MAX_RUNNING &&
> > +	    sched_clock_cpu(cpu_of(rq)) < READ_ONCE(rq->clock_idle) + IDLE_CPU_DELAY_NS)
> > +		return 1;
> > +
> >  	if (rq->curr != rq->idle)
> >  		return 0;
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 81ac605b9cd5..57a49a5524f0 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -97,6 +97,9 @@
> >  # define SCHED_WARN_ON(x)      ({ (void)(x), 0; })
> >  #endif
> > +#define IDLE_CPU_DELAY_NS		1000000		/* 1ms */
> > +#define IDLE_CPU_DELAY_MAX_RUNNING	4
> > +
> >  struct rq;
> >  struct cpuidle_state;
> >
> 
> I gave this patch a run on Intel SPR(2 sockets/112cores/224cpus) and I
> also noticed huge improvement when running hackbench, especially for
> group=32/fds=20 case:
> 
> when group=10/fds=20(400 tasks):
>             time   wakeups/migration  tg->load_avg%
> base:        43s  27874246/13953871      25%
> this patch:  32s  33200766/244457         2%
> my patch:    37s  29186608/16307254       2%
> 
> when group=20/fds=20(800 tasks):
>             time   wakeups/migrations tg->load_avg%
> base:        65s  27108751/16238701      27%
> this patch:  45s  35718552/1691220        3%
> my patch:    48s  37506974/24797284       2%
> 
> when group=32/fds=20(1280 tasks):
>             time   wakeups/migrations tg->load_avg%
> base:       150s  36902527/16423914      36%
> this patch:  57s  30536830/6035346        6%
> my patch:    73s  45264605/21595791       3%
> 
> One thing I noticed is, after this patch, the migration on wakeup path
> has dramatically reduced(see above wakeups/migrations, the number were
> captured for 5s during the run). I think this makes sense because now a
> cpu is more likely to be considered idle so a wakeup task will more
> likely stay on its prev_cpu. And when migrations is reduced, the cost of
> accessing tg->load_avg is also reduced(tg->load_avg% is the sum of
> update_cfs_group()% + update_load_avg()% as reported by perf). I think
> this is part of the reason why performance improved on this machine.
> 
> Since I've been working on reducing the cost of accessing tg->load_avg[1],
> I also gave my patch a run. According to the result, even when the cost
> of accessing tg->load_avg is smaller for my patch, Mathieu's patch is
> still faster. It's not clear to me why, maybe it has something to do
> with cache reuse since my patch doesn't inhibit migration? I suppose ipc
> could reflect this?
>
Yeah, probably. I'm thinking that, since in hackbench the sender send the
data to the receiver, and the receiver reads it, if the cache is still hot
for the wakee(receiver) during every wakeup, it could improve performance.
Maybe increase the default transfer data size 100 bytes could evict the
L1/L2 more offen for each data send/receive to figure out if it is related
to cache locallity.

thanks,
Chenyu
> [1]: https://lore.kernel.org/lkml/20230718134120.81199-1-aaron.lu@intel.com/
> 
> Thanks,
> Aaron