linux-kernel - Re: [PATCH v6] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7e5bdc95-a30a-58bc-fc67-98b03fe1fa22@amd.com>
Date:   Tue, 8 Mar 2022 17:18:16 +0530
From:   K Prateek Nayak <kprateek.nayak@....com>
To:     Srikar Dronamraju <srikar@...ux.vnet.ibm.com>
Cc:     peterz@...radead.org, aubrey.li@...ux.intel.com, efault@....de,
        gautham.shenoy@....com, linux-kernel@...r.kernel.org,
        mgorman@...hsingularity.net, mingo@...nel.org,
        song.bao.hua@...ilicon.com, valentin.schneider@....com,
        vincent.guittot@...aro.org
Subject: Re: [PATCH v6] sched/fair: Consider cpu affinity when allowing NUMA
 imbalance in find_idlest_group

Hello Srikar,

On 3/8/2022 2:59 PM, Srikar Dronamraju wrote:
> [..snip..]
> If I recollect correctly, each stream thread, has its independent data set.
> However if the threads were all to contend for the same resource (memory) or
> a waker/wakee relationships, would we not end up spreading the waker/wakee
> apart?

We only spread tasks based on number of allowed cpus in the domain.
For Stream with 8 threads, where it is beneficial for the threads to
spread across LLCs, user might pin the task as follows:

	numactl -C 0,16,32,48,64,80,96,112 ./stream8

For communicating processes, binding as shown above would
indeed be bad for performance. Instead, user may prefer a
different subset of cpus to pin the process to such as:

	numactl -C 0-7,128-135 ./communicating     # Bind to single LLC

This will ensure the communicating threads are on the same LLC.

>> [..snip..]
>>
>> Pinning is done using: numactl -C 0,16,32,48,64,80,96,112 ./stream8
>>
>> 	           5.17.0-rc1               5.17.0-rc1                5.17.0-rc1
>>                tip sched/core           tip sched/core            tip sched/core
>>                  (no pinning)                 +pinning              + this-patch
>> 								       + pinning
>>
>>  Copy:    97699.28 (0.00 pct)     95933.60  (-1.80 pct)    156578.91 (60.26 pct)
>> Scale:   107754.15 (0.00 pct)     91869.88 (-14.74 pct)    149783.25 (39.00 pct)
>>   Add:   126383.29 (0.00 pct)    105730.86 (-16.34 pct)    186493.09 (47.56 pct)
>> Triad:   124896.78 (0.00 pct)    106394.38 (-14.81 pct)    184733.48 (47.90 pct)
>>
> Do we have numbers for the with-your patch - non-pinned case?
Yes, we do.

	           5.17.0-rc1               5.17.0-rc1
               tip sched/core           tip sched/core
                 (no pinning)             + this-patch
                                          (no pinning)

 Copy:   97699.28  (0.00 pct)    106114.59  (8.61 pct)
Scale:   107754.15 (0.00 pct)    106814.48 (-0.87 pct)
  Add:   126383.29 (0.00 pct)    125323.78 (-0.83 pct)
Triad:   124896.78 (0.00 pct)    123622.62 (-1.02 pct)

These are the results on the same dual socket 2 x 64C/128T machine

Following are the output of the tracepoint sched_wakeup_new for stream with
8 threads without any pinning:

- 5.17-rc1 tip/sched/core (no pinning)

stream-4570    [025] d..2.   115.786096: sched_wakeup_new: comm=stream pid=4572 prio=120 target_cpu=048  (LLC: 6)
stream-4570    [025] d..2.   115.786160: sched_wakeup_new: comm=stream pid=4573 prio=120 target_cpu=175  (LLC: 5)
stream-4570    [025] d..2.   115.786221: sched_wakeup_new: comm=stream pid=4574 prio=120 target_cpu=000  (LLC: 0)
stream-4570    [025] d..2.   115.786271: sched_wakeup_new: comm=stream pid=4575 prio=120 target_cpu=145  (LLC: 2)
stream-4570    [025] d..2.   115.786329: sched_wakeup_new: comm=stream pid=4576 prio=120 target_cpu=138  (LLC: 1)
stream-4570    [025] d..2.   115.786375: sched_wakeup_new: comm=stream pid=4577 prio=120 target_cpu=059  (LLC: 7)
stream-4570    [025] d..2.   115.786415: sched_wakeup_new: comm=stream pid=4578 prio=120 target_cpu=036  (LLC: 4)

- 5.17-rc1 tip/sched/core + this-patch (no pinning)

stream-4537    [191] d..2.   115.926113: sched_wakeup_new: comm=stream pid=4539 prio=120 target_cpu=162  (LLC: 4)
stream-4537    [191] d..2.   115.926181: sched_wakeup_new: comm=stream pid=4540 prio=120 target_cpu=000  (LLC: 0)
stream-4537    [191] d..2.   115.926235: sched_wakeup_new: comm=stream pid=4541 prio=120 target_cpu=144  (LLC: 2)
stream-4537    [191] d..2.   115.926284: sched_wakeup_new: comm=stream pid=4542 prio=120 target_cpu=025  (LLC: 3)
stream-4537    [191] d..2.   115.926332: sched_wakeup_new: comm=stream pid=4543 prio=120 target_cpu=048  (LLC: 6)
stream-4537    [191] d..2.   115.926376: sched_wakeup_new: comm=stream pid=4544 prio=120 target_cpu=137  (LLC: 1)
stream-4537    [191] d..2.   115.926436: sched_wakeup_new: comm=stream pid=4545 prio=120 target_cpu=041  (LLC: 5)

All the threads remain in the same socket as imb_numa_nr in NPS1 mode
is 8 but each thread gets a separate LLC.
> I would assume they would be the same as 1st column since your change
> affects only pinned case. But I am wondering if this problem happens in the
> unpinned case or not?

We see nearly identical result in unpinned cases - with and
without this patch.

> Also Stream on powerpc seems to have some variation in results, did we take
> a mean of runs, or is it just results of just one run?
The results posted is the mean of 10 runs.
>> Pinning currently hurts the performance compared to unbound case on
>> tip/sched/core. With the addition of this patch, we are able to
>> outperform tip/sched/core by a good margin with pinning.
>>
>> Following are the results from running 16 Stream threads with and
>> without pinning on a dual socket Skylake Machine (2 x 24C/48T):
>>
>> Pinning is done using: numactl -C 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 ./stream16
>>
>> 	           5.17.0-rc1               5.17.0-rc1                5.17.0-rc1
>>                tip sched/core           tip sched/core            tip sched/core
>>                  (no pinning)                 +pinning              + this-patch
>> 								       + pinning
>>
>>  Copy:   126620.67 (0.00 pct)     141062.10 (11.40 pct)    147615.44 (16.58 pct)
>> Scale:   91313.51 (0.00 pct)      112879.61 (23.61 pct)    122591.28 (34.25 pct)
>>   Add:   102035.43 (0.00 pct)     125889.98 (23.37 pct)    138179.01 (35.42 pct)
>> Triad:   102281.91 (0.00 pct)     123743.48 (20.98 pct)    138940.41 (35.84 pct)
>>
>> In case of Skylake machine, with single LLC per socket, we see good
>> improvement brought about by pinning which is further benefited by
>> this patch.
>>
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
>> Acked-by: Mel Gorman <mgorman@...hsingularity.net>
>> ---
>> Changelog v5-->v6:
>>  -  Move the cpumask variable declaration to the block it is
>>     used in.
>>  -  Collect tags from v5.
>> Changelog v4-->v5:
>>  -  Only perform cpumask operations if nr_cpus_allowed is not
>>     equal to num_online_cpus based on Mel's suggestion.
>> ---
>>  kernel/sched/fair.c | 16 +++++++++++++---
>>  1 file changed, 13 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 16874e112fe6..6cc90d76250f 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -9183,6 +9183,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>>
>>  	case group_has_spare:
>>  		if (sd->flags & SD_NUMA) {
>> +			int imb;
>>  #ifdef CONFIG_NUMA_BALANCING
>>  			int idlest_cpu;
>>  			/*
>> @@ -9200,10 +9201,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>>  			 * Otherwise, keep the task close to the wakeup source
>>  			 * and improve locality if the number of running tasks
>>  			 * would remain below threshold where an imbalance is
>> -			 * allowed. If there is a real need of migration,
>> -			 * periodic load balance will take care of it.
>> +			 * allowed while accounting for the possibility the
>> +			 * task is pinned to a subset of CPUs. If there is a
>> +			 * real need of migration, periodic load balance will
>> +			 * take care of it.
>>  			 */
>> -			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
>> +			imb = sd->imb_numa_nr;
>> +			if (p->nr_cpus_allowed != num_online_cpus()) {
> Again, repeating, is the problem only happening in the pinned case?
Yes. We've tested stream with 8 and 16 stream threads on a Zen3 system
with 16 LLCs and in both cases, with unbound runs, we've seen each
Stream thread get a separate LLC and we didn't observe any stacking.
--
Thanks and Regards,
Prateek