linux-kernel - Re: [PATCH] sched/fair: Skip wake

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <560C3507.3040906@odin.com>
Date:	Wed, 30 Sep 2015 22:16:23 +0300
From:	Kirill Tkhai <ktkhai@...n.com>
To:	Mike Galbraith <umgwanakikbuti@...il.com>
CC:	<linux-kernel@...r.kernel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>
Subject: Re: [PATCH] sched/fair: Skip wake_affine() for core siblings



On 29.09.2015 20:29, Mike Galbraith wrote:
> On Tue, 2015-09-29 at 19:00 +0300, Kirill Tkhai wrote:
>>
>> On 29.09.2015 17:55, Mike Galbraith wrote:
>>> On Mon, 2015-09-28 at 18:36 +0300, Kirill Tkhai wrote:
>>>
>>>> ---
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 4df37a4..dfbe06b 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -4930,8 +4930,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>>>>  	int want_affine = 0;
>>>>  	int sync = wake_flags & WF_SYNC;
>>>>  
>>>> -	if (sd_flag & SD_BALANCE_WAKE)
>>>> -		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
>>>> +	if (sd_flag & SD_BALANCE_WAKE) {
>>>> +		want_affine = 1;
>>>> +		if (cpu == prev_cpu || !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
>>>> +			goto want_affine;
>>>> +		if (wake_wide(p))
>>>> +			goto want_affine;
>>>> +	}
>>>
>>> That blew wake_wide() right out of the water.
>>>
>>> It's not only about things like pgbench.  Drive multiple tasks in a Xen
>>> guest (single event channel dom0 -> domu, and no select_idle_sibling()
>>> to save the day) via network, and watch workers fail to be all they can
>>> be because they keep being stacked up on the irq source.  Load balancing
>>> yanks them apart, next irq stacks them right back up.  I met that in
>>> enterprise land, thought wake_wide() should cure it, and indeed it did.
>>
>> 1)Hm.. The patch makes select_task_rq_fair() to prefer old cpu instead of
>> current, doesn't it? We more often don't set affine_sd. So, the skipped
>> part of patch (skipped in quote) selects prev_cpu.
> 
> Not the way I read it..
> 
>>> -    if (affine_sd) {
>>> +want_affine:
>>> +    if (want_affine) {
>>>              sd = NULL; /* Prefer wake_affine over balance flags */
>>> -            if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
>>> +            if (affine_sd && wake_affine(affine_sd, p, sync))
>>>                      new_cpu = cpu;
>>> -    }
>>> -
>>> -    if (!sd) {
>>> -            if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
>>> -                    new_cpu = select_idle_sibling(p, new_cpu);
>>> -
>>> +            new_cpu = select_idle_sibling(p, new_cpu);
> 
> ..it sets new_cpu = cpu if wake_affine() says Ok, wake_wide() has no say
> in the matter.
>  
>> 2)I thought about waking by irq handler and even was going to ask why
>> we use affine logic for such wakeups. Device handlers usually aren't
>> bound, timers may migrate since NO_HZ logic presents. The only explanation
>> I found is unbound timers is very unlikely case (I added statistics printk
>> to my local sched_debug to check that). But if we have the situations like
>> you described above, don't we have to disable affine logic for in_interrupt()
>> cases?
> 
> BTDT.  In my experience, the more you try to differentiate sources, the
> more corner cases you create.  I've tried doing special things for irq,
> locks, wake_all, wake_one, and it always turned into a can of worms.
> IMHO, the best policy for the fast patch is KISS.
> 
>> 3)I ask about just because (being outside of scheduler history) it's a little
>> bit strange, we prefer smp_processor_id()'s sd_llc so much. Sync wakeup's
>> profit is less or more clear: smp_processor_id()'s sd_llc may contain some
>> data, which is interesting for a wakee, and this minimizes cache misses.
>> But we do the same in other cases too, and at every migration we loose
>> itlb, dtlb... Of course, it requires more accurate patches, then posted
>> (not so rude patches).
> 
> IMHO, the sync wakeup hint is more often a big fat lie than anything
> else, it really just gives us a bit more headroom for affine wakeups in
> cases where that's likely to be a very good thing (affine in the cache
> sense, not affine as in an individual CPU).  What it means is that waker
> is likely to schedule RSN, but if you measure even very fast/light
> things, there is an overlap win to be had by NOT waking CPU affine,
> rather waking cache affine, that's why we cross core schedule so often.
> A real network app doing a wakeup does is not necessarily gonna schedule
> RSN, there is very often a latency win to be had by scheduling to a
> nearby core, ie a thread pool worker doing a "sync" wakeup may very
> instantly find that it has more work to do.  If a fast/light wakee can
> slip into an idle crack and get to CPU instantly, it can generate more
> work a little bit sooner.

Yeah, in most places, where sync wakeup is used, task is not going to reschedule
soon..

Thanks for the explanation, Mike!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/