linux-kernel - Re: [RFC] sched/fair: Use wake_q length as a hint for wake

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8953fc87-20ca-a0b3-e631-650114ec8b04@oracle.com>
Date:   Wed, 20 Sep 2017 16:17:31 -0500
From:   Atish Patra <atish.patra@...cle.com>
To:     Joel Fernandes <joelaf@...gle.com>,
        Brendan Jackman <brendan.jackman@....com>
Cc:     LKML <linux-kernel@...r.kernel.org>,
        Andres Oportus <andresoportus@...gle.com>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Josef Bacik <josef@...icpanda.com>,
        Mike Galbraith <efault@....de>,
        Matt Fleming <matt@...eblueprint.co.uk>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Morten Rasmussen <morten.rasmussen@....com>
Subject: Re: [RFC] sched/fair: Use wake_q length as a hint for wake_wide

On 09/20/2017 03:23 PM, Joel Fernandes wrote:
> On Wed, Sep 20, 2017 at 2:33 AM, Brendan Jackman
> <brendan.jackman@....com> wrote:
>>
>> On Wed, Sep 20 2017 at 05:06, Joel Fernandes wrote:
>>>> On Tue, Sep 19, 2017 at 3:05 AM, Brendan Jackman
>>>> <brendan.jackman@....com> wrote:
>>>>> On Mon, Sep 18 2017 at 22:15, Joel Fernandes wrote:
>>> [..]
>>>>>>> IIUC, if wake_affine() behaves correctly this trick wouldn't be
>>>>>>> necessary on SMP systems, so it might be best guarded by the presence
>>>>>>
>>>>>> Actually wake_affine doesn't check for balance if previous/next cpu
>>>>>> are within the same shared cache domain. The difference is some time
>>>>>> ago it would return true for shared cache but now it returns false as
>>>>>> of 4.14-rc1:
>>>>>> http://elixir.free-electrons.com/linux/v4.14-rc1/source/kernel/sched/fair.c#L5466
>>>>>>
>>>>>> Since it would return false in the above wake up cases for task 1 and
>>>>>> 2, it would then run select_idle_sibling on the previous CPU which is
>>>>>> also within the big cluster, so I don't think it will make a
>>>>>> difference in this case... Infact what it returns probably doesn't
>>>>>> matter.
>>>>>
>>>>> So my paragraph here was making a leap in reasoning, let me try to fill
>>>>> the gap: On SMP these tasks never need to move around. If by some chance
>>>>> they did get coscheduled, the first load balance would spread them out and
>>>>> then every time they wake up from then on, prev_cpu is the sensible
>>>>> choice. So it will look something like:
>>>>>
>>>>>             v CPU v           ->time->
>>>>>
>>>>>                             -------------
>>>>>          {  0  (SAME)       11111111111
>>>>>   cache  {                  -------------
>>>>>          {  1  (SAME)       222222222222|
>>>>>                             -------------
>>>>>          {  2  (SAME)       33333333333
>>>>>   cache  {                  -------------
>>>>>          {  3  (SAME)       44444444444
>>>>>                             -------------
>>>>>
>>>>> So here, task 2 wakes up the other guys and when it's doing tasks 3 and
>>>>> 4, prev_cpu and smp_processor_id() don't share a cache, so IIUC its'
>>>>> basically wake_affine's job to decide between prev_cpu and
>>>>> smp_processor_id(). So "if wake_affine is behaving correctly" the
>>>>> problem that this patch aims to solve (i.e. the fact that we overload
>>>>> the waker's LLC domain because of bias towards prev_cpu) does not arise
>>>>> on SMP.
>>>>>
>>>>
>>>> Yes SMP, but your patch is for solving a problem for non-SMP. So your
>>>> original statement about wake_affine solving any problem for SMP is
>>>> not relevant I feel :-P. I guess you can just kill this para from the
>>>> commit message to prevent confusion.
>>>
>>> Ok I take that back, you were talking about guarding this feature by
>>> the SD_ASYM_CPUCAPACITY flag.
>>>
>>> I don't think that protection would be helpful because you can have
>>> the same issue if the tasks do different amount of work on SMP. So in
>>> that case some threads might still complete before the others and you
>>> run into the same thing.
>>
>> Well assuming we're still talking about one task per CPU, if you have
>> tasks doing different amount of work there's still no reason to move the
>> longer-running threads around. The only reason that happens in my
>> example is because of the asym capacity.
>
> Yes but you can very well have RT pressure and things that temporarily
> change the capacity equality. Also this is a simple benchmark and for
> any reason you have more than 1 task running on those other CPUs and
> then the idle CPUs run some of the tasks and you run into a similar
> situation that might need your patch..
>
The patch would be helpful only if it doesn't cross NUMA boundary. right ?

If NUMA comes into picture, not sure searching across NUMA may hurt more 
than help, especially in this case.
> Also one more note, the SD_ASYM_CPUCAPACITY protection is still not
> needed because SD_BALANCE_WAKE isn't turned on for
> !SD_ASYM_CPUCAPACITY from what I learnt from discussions with Mike, I
> believe its this piece of code in sd_init that actually enables it:
>
>         if (sd->flags & SD_ASYM_CPUCAPACITY) {
>                 struct sched_domain *t = sd;
>
>                 for_each_lower_domain(t)
>                         t->flags |= SD_BALANCE_WAKE;
>         }
>
>
But it will always try to search within LLC group if want_affine is set 
irrespective of the SD_BALANCE_WAKE flag.

if (affine_sd) {
		sd = NULL; /* Prefer wake_affine over balance flags */
		if (cpu == prev_cpu)
			goto pick_cpu;

		if (wake_affine(affine_sd, p, prev_cpu, sync))
			new_cpu = cpu;
	}

	if (!sd) {
  pick_cpu:
		if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
			new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);

SD_ASYM_CPUCAPACITY protection may still be required if it is a 
non-issue for big SMP servers.

Regards,
Atish
> thanks,
>
> - Joel
>