linux-kernel - Re: [PATCH RFC] select_idle

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <570428B2.5020900@gmail.com>
Date:	Tue, 5 Apr 2016 17:05:54 -0400
From:	Bastien Philbert <bastienphilbert@...il.com>
To:	Matt Fleming <matt@...eblueprint.co.uk>, Chris Mason <clm@...com>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...nel.org>,
	Mike Galbraith <mgalbraith@...e.de>,
	linux-kernel@...r.kernel.org
Cc:	Mel Gorman <mgorman@...hsingularity.net>
Subject: Re: [PATCH RFC] select_idle_sibling experiments



On 2016-04-05 04:03 PM, Matt Fleming wrote:
> On Tue, 05 Apr, at 02:08:22PM, Chris Mason wrote:
>>
>> I started with a small-ish program to benchmark wakeup latencies.  The
>> basic idea is a bunch of worker threads who sit around and burn CPU.
>> Every once and a while they send a message to a message thread.
>  
> This reminds me of something I've been looking at recently; a similar
> workload in Mel's mmtests based on pgbench with 1-client that also has
> this problem of cpu_idle() being false at an inconvenient time in
> select_idle_sibling(), so we move the task off the cpu and the cpu
> then immediately goes idle.
> 
> This leads to tasks bouncing around the socket as we search for idle
> cpus.
> 
>> It has knobs for cpu think time, and for how long the messenger thread
>> waits before replying.  Here's how I'm running it with my patch:
>  
> [...]
> 
> Cool, I'll go have a play with this.
> 
>> Now, on to the patch.  I pushed some code around and narrowed the
>> problem down to select_idle_sibling()   We have cores going into and out
>> of idle fast enough that even this cut our latencies in half:
>>
>> static int select_idle_sibling(struct task_struct *p, int target)
>>                                 goto next;
>>  
>>                         for_each_cpu(i, sched_group_cpus(sg)) {
>> -                               if (i == target || !idle_cpu(i))
>> +                               if (!idle_cpu(i))
>>                                         goto next;
>>                         }
>>  
>> IOW, by the time we get down to for_each_cpu(), the idle_cpu() check
>> done at the top of the function is no longer valid.
>  
> Yeah. The problem is that because we're racing with the cpu going in
> and out of idle, and since you're exploiting that race condition, this
> is highly tuned to your specific workload.
> 
> Which is a roundabout way of saying, this is probably going to
> negatively impact other workloads.
> 
>> I tried a few variations on select_idle_sibling() that preserved the
>> underlying goal of returning idle cores before idle SMT threads.  They
>> were all horrible in different ways, and none of them were fast.
>  
> I toyed with ignoring cpu_idle() in select_idle_sibling() for my
> workload. That actually was faster ;)
> 
>> The patch below just makes select_idle_sibling pick the first idle
>> thread it can find.  When I ran it through production workloads here, it
>> was faster than the patch we've been carrying around for the last few
>> years.
> 
> It would be really nice if we had a lightweight way to gauge the
> "idleness" of a cpu, and whether we expect it to be idle again soon.
> 
The best way to do this is either embed it in a already used structure to
allow us to check it quickly. Otherwise I am curious if writing a marco
may prove useful for this. Seems that idleness checking needs to accounted
for when scheduling, in order to make this lightweight enough to avoid using
it during a context switch, the challenge however is to make the reference
counting lightweight enough to out weight it being done during current scheduling
functions.
> Failing that, could we just force the task onto 'target' when it makes
> sense and skip the idle search (and the race) altogether?
> 
Doesn't this possibly cause a context switch or even a extensive move to another
CPU instruction(s) on certain architectures. Seems we need to add reference counting
or tracking of idle CPUS somewhere.
Bastien