[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <560992AC.7020909@odin.com>
Date: Mon, 28 Sep 2015 22:19:08 +0300
From: Kirill Tkhai <ktkhai@...n.com>
To: Mike Galbraith <umgwanakikbuti@...il.com>
CC: <linux-kernel@...r.kernel.org>,
Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...hat.com>
Subject: Re: [PATCH] sched/fair: Skip wake_affine() for core siblings
On 28.09.2015 21:22, Mike Galbraith wrote:
> On Mon, 2015-09-28 at 18:36 +0300, Kirill Tkhai wrote:
>
>> Mike, one more moment. wake_wide() and current logic confuses me a bit.
>> It makes us to decide if we want affine wakeup or not, but select_idle_sibling()
>> if a function is not for choosing this_cpu's llc domain only. We use it
>> for searching in prev_cpu llc domain too, and it seems we are not interested
>> in current flips in this case.
>
> We're always interested in "flips", as the point is to try to identify
> N:M load components, and when they may overload a socket. The hope is
> to get it more right than wrong, as making the tracking really accurate
> is too expensive for the fast path.
>
>> Imagine a situation, when we share a mutex
>> with a task on another NUMA node. When the task is realising the mutex
>> it is waking us, but we definitelly won't use affine logic in this case.
>
> Why not? A wakeup is a wakeup is a wakeup, they all do the same thing.
> If wake_wide() doesn't NAK an affine wakeup, we ask wake_affine() for
> its opinion, then look for an idle CPU near the waker's CPU if it says
> OK, or near wakee's previous CPU if it says go away.
But NUMA sd does not have SD_WAKE_AFFINE flag, so this case a new cpu won't
be choosen from previous node. There will be choosen the highest domain
of smp_processor_id(), which has SD_BALANCE_WAKE flag, and the cpu will
be choosen from the idlest group/cpu. And we don't have a deal with old
cache at all. This looks like a completely wrong behaviour...
>> We wake the wakee anywhere and loose hot cache.
>
> Yeah, sometimes we'll make tasks drag their data to them when we could
> have dragged the task to the data in the name of trying to crank up CPU
> utilization. At some point, _somebody_ has to drag their data across
> interconnect, but we really don't know if/when the data transport cost
> will pay off in better utilization.
>
> -Mike
>
> (I'll take a peek at below when damn futexes get done kicking my a$$)
This case, you'll be able to analyze new results below :)
ORIGIN 1 2 3 4 avg
C=1 8607,960451 8344,16111 8381,197569 7991,84102 8331,2900375
C=2 15397,152685 15438,913761 15771,512182 15648,368819 15563,98686175
C=4 30987,860844 31144,127431 31104,153461 30874,292825 31027,60864025
C=8 62447,47612 62179,788923 61534,482204 62787,894021 62237,410317
PATCHED 1 2 3 4 avg
C=1 8782,439938 8675,891877 8609,209537 8735,120895 8700,66556175
C=2 16526,31409 16491,650678 16149,594736 16365,630084 16383,297397
C=4 32286,341708 32313,536565 32538,285157 32299,427398 32359,397707
C=8 63860,310019 63187,152569 63400,930755 63210,460753 63414,713524
This is:
# pgbench -j 1 -S -c X -T 200 test
The test machine has no NUMA, and I suppose, NUMA machine will show much better results.
I'll test it tomorrow.
>> I changed the logic, and
>> tried pgbench 1:8. The results (I threw away 3 first iterations, because
>> they much differ with iter >= 4. Looks like, the reason is in uncached disk IO).
>>
>>
>> Before:
>>
>> trans. | tps (i) | tps (e)
>> --------------------------------------
>> 12098226 | 60491.067392 | 60500.886373
>> 12030184 | 60150.874285 | 60160.654295
>> 11882977 | 59414.829150 | 59424.830637
>> 12020125 | 60100.579023 | 60111.600176
>> 12161917 | 60809.547906 | 60827.321639
>> 12154660 | 60773.249254 | 60783.085165
>>
>> After:
>>
>> trans. | tps (i) | tps (e)
>> --------------------------------------
>> 12770407 | 63849.883578 | 63860.310019
>> 12635366 | 63176.399769 | 63187.152569
>> 12676890 | 63384.396440 | 63400.930755
>> 12639949 | 63199.526330 | 63210.460753
>> 12670626 | 63353.079951 | 63363.274143
>> 12647001 | 63209.613698 | 63219.812331
>>
>> I'm going to test other cases, but could you tell me (if you remember) are there reasons
>> we skip prev_cpu, like I described above? Some types of workloads etc.
>>
>> ---
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 4df37a4..dfbe06b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4930,8 +4930,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>> int want_affine = 0;
>> int sync = wake_flags & WF_SYNC;
>>
>> - if (sd_flag & SD_BALANCE_WAKE)
>> - want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
>> + if (sd_flag & SD_BALANCE_WAKE) {
>> + want_affine = 1;
>> + if (cpu == prev_cpu || !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
>> + goto want_affine;
>> + if (wake_wide(p))
>> + goto want_affine;
>> + }
>>
>> rcu_read_lock();
>> for_each_domain(cpu, tmp) {
>> @@ -4954,16 +4959,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>> break;
>> }
>>
>> - if (affine_sd) {
>> +want_affine:
>> + if (want_affine) {
>> sd = NULL; /* Prefer wake_affine over balance flags */
>> - if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
>> + if (affine_sd && wake_affine(affine_sd, p, sync))
>> new_cpu = cpu;
>> - }
>> -
>> - if (!sd) {
>> - if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
>> - new_cpu = select_idle_sibling(p, new_cpu);
>> -
>> + new_cpu = select_idle_sibling(p, new_cpu);
>> } else while (sd) {
>> struct sched_group *group;
>> int weight;
>
>
Regards,
Kirill
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists