linux-kernel - Re: [RFC PATCH 1/1] sched: Extend cpu idle state for 1ms

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <becdb227-cd7c-7724-f79e-f5ee70692e42@efficios.com>
Date:   Thu, 3 Aug 2023 16:21:37 -0400
From:   Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To:     Aaron Lu <aaron.lu@...el.com>
Cc:     Shrikanth Hegde <sshegde@...ux.vnet.ibm.com>,
        linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Swapnil Sapkal <Swapnil.Sapkal@....com>, x86@...nel.org,
        Peter Zijlstra <peterz@...radead.org>,
        Srikar Dronamraju <srikar@...ux.vnet.ibm.com>
Subject: Re: [RFC PATCH 1/1] sched: Extend cpu idle state for 1ms

On 8/1/23 03:24, Aaron Lu wrote:
> On Wed, Jul 26, 2023 at 02:56:19PM -0400, Mathieu Desnoyers wrote:
> 
> ... ...
> 
>> The updated patch:
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index a68d1276bab0..1c7d5bd2968b 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -7300,6 +7300,10 @@ int idle_cpu(int cpu)
>>   {
>>   	struct rq *rq = cpu_rq(cpu);
>> +	if (READ_ONCE(rq->nr_running) <= IDLE_CPU_DELAY_MAX_RUNNING &&
>> +	    sched_clock_cpu(cpu_of(rq)) < READ_ONCE(rq->clock_idle) + IDLE_CPU_DELAY_NS)
>> +		return 1;
>> +
>>   	if (rq->curr != rq->idle)
>>   		return 0;
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 81ac605b9cd5..57a49a5524f0 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -97,6 +97,9 @@
>>   # define SCHED_WARN_ON(x)      ({ (void)(x), 0; })
>>   #endif
>> +#define IDLE_CPU_DELAY_NS		1000000		/* 1ms */
>> +#define IDLE_CPU_DELAY_MAX_RUNNING	4
>> +
>>   struct rq;
>>   struct cpuidle_state;
>>
> 
> I gave this patch a run on Intel SPR(2 sockets/112cores/224cpus) and I
> also noticed huge improvement when running hackbench, especially for
> group=32/fds=20 case:
> 
> when group=10/fds=20(400 tasks):
>              time   wakeups/migration  tg->load_avg%
> base:        43s  27874246/13953871      25%
> this patch:  32s  33200766/244457         2%
> my patch:    37s  29186608/16307254       2%
> 
> when group=20/fds=20(800 tasks):
>              time   wakeups/migrations tg->load_avg%
> base:        65s  27108751/16238701      27%
> this patch:  45s  35718552/1691220        3%
> my patch:    48s  37506974/24797284       2%
> 
> when group=32/fds=20(1280 tasks):
>              time   wakeups/migrations tg->load_avg%
> base:       150s  36902527/16423914      36%
> this patch:  57s  30536830/6035346        6%
> my patch:    73s  45264605/21595791       3%
> 
> One thing I noticed is, after this patch, the migration on wakeup path
> has dramatically reduced(see above wakeups/migrations, the number were
> captured for 5s during the run). I think this makes sense because now a
> cpu is more likely to be considered idle so a wakeup task will more
> likely stay on its prev_cpu. And when migrations is reduced, the cost of
> accessing tg->load_avg is also reduced(tg->load_avg% is the sum of
> update_cfs_group()% + update_load_avg()% as reported by perf). I think
> this is part of the reason why performance improved on this machine.
> 
> Since I've been working on reducing the cost of accessing tg->load_avg[1],
> I also gave my patch a run. According to the result, even when the cost
> of accessing tg->load_avg is smaller for my patch, Mathieu's patch is
> still faster. It's not clear to me why, maybe it has something to do
> with cache reuse since my patch doesn't inhibit migration? I suppose ipc
> could reflect this?

I've also noticed a drastic reduction in the number of migrations with 
my patch. I have noticed that the behavior of select_task_rq changes 
drastically, but I have not figured out why yet.

I tried adding tons of schedstats counters within select_task_rq to try 
to compare the decisions taken in the baseline vs modified 
implementations of cpu_idle. I also tried to count how many times the 
target task rq changes (which implies a migration) with a breakdown by 
cause (which branch within select_task_rq cause it). I could not find a 
clear culprit yet though (and I am currently on vacation, so not working 
on this actively).

Thanks,

Mathieu


> 
> [1]: https://lore.kernel.org/lkml/20230718134120.81199-1-aaron.lu@intel.com/
> 
> Thanks,
> Aaron

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com