[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bb443650-868a-49b7-e41e-c2a788781df5@amd.com>
Date: Wed, 9 Mar 2022 12:42:51 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Srikar Dronamraju <srikar@...ux.vnet.ibm.com>
Cc: peterz@...radead.org, aubrey.li@...ux.intel.com, efault@....de,
gautham.shenoy@....com, linux-kernel@...r.kernel.org,
mgorman@...hsingularity.net, mingo@...nel.org,
song.bao.hua@...ilicon.com, valentin.schneider@....com,
vincent.guittot@...aro.org
Subject: Re: [PATCH v6] sched/fair: Consider cpu affinity when allowing NUMA
imbalance in find_idlest_group
Hello Srikar,
On 3/9/2022 10:55 AM, Srikar Dronamraju wrote:
> * K Prateek Nayak <kprateek.nayak@....com> [2022-03-08 17:18:16]:
> [..snip..]
>> Yes. We've tested stream with 8 and 16 stream threads on a Zen3 system
>> with 16 LLCs and in both cases, with unbound runs, we've seen each
>> Stream thread get a separate LLC and we didn't observe any stacking.
> If the problem is only happening with pinned case, then it means that in the
> in unpinned case, the load balancer is able to do the load balancing
> correctly and quickly but for some reason may not be able to do the same in
> pinned case. Without the patch, even in the unpinned case, the initial CPU
> range is more less the same number of LLCs as the pinned. However its able
> to spread better.
The problem this patch is trying to solve is that of the initial placement
of task when they are pinned to a subset of cpu in a way that number of
allowed cpus in a NUMA domain is less than the imb_numa_nr.
Consider the same example of Stream running 8 threads with pinning as follows
on the dual socket Zen3 system (0-63,128-191 in one socket, 64-127,192-255 in
another socket) with 8 LLCs per socket:
numactl -C 0,16,32,48,64,80,96,112 ./stream8
Number of cpus available in each socket for this task is 4. However, each
socket consists of 8 LLCs. According to current scheduler, all the LLCs are
available for task to use but in reality, it can only use 4. Hence, all
the stream threads are put on same socket and they have to initially share
LLCs and the same set of cpus. This is why we see stacking initially and
this is what we are addressing.
We've observed that load balancer does kick in and is able to spread the tasks
apart but this takes a while and quite a few migrations before a stable state
is reached.
Following is the output of the tracepoint sched_migrate_task on 5.17-rc1
tip sched/core for stream running 8 threads with the same pinning pattern:
(Output has been slightly modified for readability)
167.928338: sched_migrate_task: comm=stream pid=5050 prio=120 orig_cpu=32 dest_cpu=48 START - {8}{0}
... 9 migrations
168.595632: sched_migrate_task: comm=stream pid=5051 prio=120 orig_cpu=16 dest_cpu=64 * {7}{1}
168.595634: sched_migrate_task: comm=stream pid=5048 prio=120 orig_cpu=16 dest_cpu=64 * {6}{2}
... 3 migrations
168.625803: sched_migrate_task: comm=stream pid=5052 prio=120 orig_cpu=0 dest_cpu=64 * {5}{3}
168.626146: sched_migrate_task: comm=stream pid=5050 prio=120 orig_cpu=16 dest_cpu=0
168.650832: sched_migrate_task: comm=stream pid=5050 prio=120 orig_cpu=0 dest_cpu=48
168.651009: sched_migrate_task: comm=stream pid=5048 prio=120 orig_cpu=64 dest_cpu=32 * {6}{2}
168.677314: sched_migrate_task: comm=stream pid=5048 prio=120 orig_cpu=32 dest_cpu=80 * {5}{3}
168.677320: sched_migrate_task: comm=stream pid=5050 prio=120 orig_cpu=48 dest_cpu=64 * {4}{4}
168.735707: sched_migrate_task: comm=stream pid=5050 prio=120 orig_cpu=64 dest_cpu=96
168.775510: sched_migrate_task: comm=stream pid=5051 prio=120 orig_cpu=80 dest_cpu=0 * {5}{3}
... 39 migrations
170.232105: sched_migrate_task: comm=stream pid=5049 prio=120 orig_cpu=0 dest_cpu=64 END {4}{4}
As we can see, 63 migrations are arecorded during the runtime of the
program that can be avoided by the correct initial placement.
As you highlight, there may be areas in the load balancer path that
can be optimized too for such a case.
> I believe the problem could be in can_migrate_task() checking for
> !cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)
>
> i.e dst_cpu is doing a load balance on behalf of the entire LLC, however it
> only will pull tasks that can be pulled into it.
Please correct me if I'm wrong but don't we scan the entire sched group
and check if there is a compatible cpu for pinned tasks in can_migrate_task()
when !cpumask_test_cpu(env->dst_cpu, p->cpus_ptr) ?
If a compatible cpu is found in the group, it is stored in new_dst_cpu
member of lb_env to be used later in the load balancing path if my
understanding is correct.
--
Thanks and Regards,
Prateek
Powered by blists - more mailing lists