[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <24e216f8-fdf0-d1b1-9318-8542ce6f4312@oracle.com>
Date: Thu, 21 Sep 2017 11:51:48 -0500
From: Atish Patra <atish.patra@...cle.com>
To: Joel Fernandes <joelaf@...gle.com>
Cc: Brendan Jackman <brendan.jackman@....com>,
LKML <linux-kernel@...r.kernel.org>,
Andres Oportus <andresoportus@...gle.com>,
Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Josef Bacik <josef@...icpanda.com>,
Mike Galbraith <efault@....de>,
Matt Fleming <matt@...eblueprint.co.uk>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Morten Rasmussen <morten.rasmussen@....com>
Subject: Re: [RFC] sched/fair: Use wake_q length as a hint for wake_wide
On 09/21/2017 12:50 AM, Joel Fernandes wrote:
> On Wed, Sep 20, 2017 at 2:17 PM, Atish Patra <atish.patra@...cle.com> wrote:
>> On 09/20/2017 03:23 PM, Joel Fernandes wrote:
>>>
>>> On Wed, Sep 20, 2017 at 2:33 AM, Brendan Jackman
>>> <brendan.jackman@....com> wrote:
>>>>
>>>>
>>>> On Wed, Sep 20 2017 at 05:06, Joel Fernandes wrote:
>>>>>>
>>>>>> On Tue, Sep 19, 2017 at 3:05 AM, Brendan Jackman
>>>>>> <brendan.jackman@....com> wrote:
>>>>>>>
>>>>>>> On Mon, Sep 18 2017 at 22:15, Joel Fernandes wrote:
>>>>>
>>>>> [..]
>>>>>>>>>
>>>>>>>>> IIUC, if wake_affine() behaves correctly this trick wouldn't be
>>>>>>>>> necessary on SMP systems, so it might be best guarded by the
>>>>>>>>> presence
>>>>>>>>
>>>>>>>>
>>>>>>>> Actually wake_affine doesn't check for balance if previous/next cpu
>>>>>>>> are within the same shared cache domain. The difference is some time
>>>>>>>> ago it would return true for shared cache but now it returns false
> as
>>>>>>>> of 4.14-rc1:
>>>>>>>>
>>>>>>>>
> http://elixir.free-electrons.com/linux/v4.14-rc1/source/kernel/sched/fair.c#L5466
>>>>>>>>
>>>>>>>> Since it would return false in the above wake up cases for task 1
> and
>>>>>>>> 2, it would then run select_idle_sibling on the previous CPU which
> is
>>>>>>>> also within the big cluster, so I don't think it will make a
>>>>>>>> difference in this case... Infact what it returns probably doesn't
>>>>>>>> matter.
>>>>>>>
>>>>>>>
>>>>>>> So my paragraph here was making a leap in reasoning, let me try to
>>>>>>> fill
>>>>>>> the gap: On SMP these tasks never need to move around. If by some
>>>>>>> chance
>>>>>>> they did get coscheduled, the first load balance would spread them
> out
>>>>>>> and
>>>>>>> then every time they wake up from then on, prev_cpu is the sensible
>>>>>>> choice. So it will look something like:
>>>>>>>
>>>>>>> v CPU v ->time->
>>>>>>>
>>>>>>> -------------
>>>>>>> { 0 (SAME) 11111111111
>>>>>>> cache { -------------
>>>>>>> { 1 (SAME) 222222222222|
>>>>>>> -------------
>>>>>>> { 2 (SAME) 33333333333
>>>>>>> cache { -------------
>>>>>>> { 3 (SAME) 44444444444
>>>>>>> -------------
>>>>>>>
>>>>>>> So here, task 2 wakes up the other guys and when it's doing tasks 3
>>>>>>> and
>>>>>>> 4, prev_cpu and smp_processor_id() don't share a cache, so IIUC its'
>>>>>>> basically wake_affine's job to decide between prev_cpu and
>>>>>>> smp_processor_id(). So "if wake_affine is behaving correctly" the
>>>>>>> problem that this patch aims to solve (i.e. the fact that we overload
>>>>>>> the waker's LLC domain because of bias towards prev_cpu) does not
>>>>>>> arise
>>>>>>> on SMP.
>>>>>>>
>>>>>>
>>>>>> Yes SMP, but your patch is for solving a problem for non-SMP. So your
>>>>>> original statement about wake_affine solving any problem for SMP is
>>>>>> not relevant I feel :-P. I guess you can just kill this para from the
>>>>>> commit message to prevent confusion.
>>>>>
>>>>>
>>>>> Ok I take that back, you were talking about guarding this feature by
>>>>> the SD_ASYM_CPUCAPACITY flag.
>>>>>
>>>>> I don't think that protection would be helpful because you can have
>>>>> the same issue if the tasks do different amount of work on SMP. So in
>>>>> that case some threads might still complete before the others and you
>>>>> run into the same thing.
>>>>
>>>>
>>>> Well assuming we're still talking about one task per CPU, if you have
>>>> tasks doing different amount of work there's still no reason to move the
>>>> longer-running threads around. The only reason that happens in my
>>>> example is because of the asym capacity.
>>>
>>>
>>> Yes but you can very well have RT pressure and things that temporarily
>>> change the capacity equality. Also this is a simple benchmark and for
>>> any reason you have more than 1 task running on those other CPUs and
>>> then the idle CPUs run some of the tasks and you run into a similar
>>> situation that might need your patch..
>>>
>> The patch would be helpful only if it doesn't cross NUMA boundary. right ?
>>
>> If NUMA comes into picture, not sure searching across NUMA may hurt more
>> than help, especially in this case.
>
> I don't understand what you mean by "searching across NUMA", :-(, do you
> mean the slow path?
>
> As I said, if the SD_BALANCE_WAKE flag for the sched domain flag is not
> set, then a full wide search isn't done anyway. You have this code that
> sets the sd variable:
>
> if (tmp->flags & sd_flag)
> sd = tmp;
>
> Since sd = NULL if the sched domain (tmp->flags) isn't set, you will
> always have select_idle_sibling running and not doing the full search
> if I understand correctly.
>
Correct. I was under the impression that the above logic is going
to changed to fix the possible issue raised by Brendan for big SMP
server as well. I guess that was not the case.
> Further adding the ASYM protection isn't sensible if capacities are
> affected by RT and IRQ time etc anyway. Does that make sense?
>
Yup. We won't need ASYM protection.
Regards,
Atish
> I am glad I understand the code a bit better now after staring at it
> for quite some time but I think some more staring is needed.
>
> - Joel
>
Powered by blists - more mailing lists