linux-kernel - Re: [PATCH v6] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20220309094320.GL618915@linux.vnet.ibm.com>
Date:   Wed, 9 Mar 2022 15:13:20 +0530
From:   Srikar Dronamraju <srikar@...ux.vnet.ibm.com>
To:     K Prateek Nayak <kprateek.nayak@....com>
Cc:     peterz@...radead.org, aubrey.li@...ux.intel.com, efault@....de,
        gautham.shenoy@....com, linux-kernel@...r.kernel.org,
        mgorman@...hsingularity.net, mingo@...nel.org,
        song.bao.hua@...ilicon.com, valentin.schneider@....com,
        vincent.guittot@...aro.org
Subject: Re: [PATCH v6] sched/fair: Consider cpu affinity when allowing NUMA
 imbalance in find_idlest_group

* K Prateek Nayak <kprateek.nayak@....com> [2022-03-09 12:42:51]:

Hi Prateek,

> Hello Srikar,
> 
> On 3/9/2022 10:55 AM, Srikar Dronamraju wrote:
> > * K Prateek Nayak <kprateek.nayak@....com> [2022-03-08 17:18:16]:
> > [..snip..]
> >> Yes. We've tested stream with 8 and 16 stream threads on a Zen3 system
> >> with 16 LLCs and in both cases, with unbound runs, we've seen each
> >> Stream thread get a separate LLC and we didn't observe any stacking.
> > If the problem is only happening with pinned case, then it means that in the
> > in unpinned case, the load balancer is able to do the load balancing
> > correctly and quickly but for some reason may not be able to do the same in
> > pinned case. Without the patch, even in the unpinned case, the initial CPU
> > range is more less the same number of LLCs as the pinned. However its able
> > to spread better.
> The problem this patch is trying to solve is that of the initial placement
> of task when they are pinned to a subset of cpu in a way that number of
> allowed cpus in a NUMA domain is less than the imb_numa_nr.
> 

I completely understand your problem. The only missing piece is why is this
initial placement *not a problem for the unpinned case*. If we are able to
articulate how the current code works well for the unpinned case, I would
be fine.

> Consider the same example of Stream running 8 threads with pinning as follows
> on the dual socket Zen3 system (0-63,128-191 in one socket, 64-127,192-255 in
> another socket) with 8 LLCs per socket:
> 
> numactl -C 0,16,32,48,64,80,96,112 ./stream8
> 
> Number of cpus available in each socket for this task is 4. However, each
> socket consists of 8 LLCs. According to current scheduler, all the LLCs are
> available for task to use but in reality, it can only use 4. Hence, all
> the stream threads are put on same socket and they have to initially share
> LLCs and the same set of cpus. This is why we see stacking initially and
> this is what we are addressing.
> 
> We've observed that load balancer does kick in and is able to spread the tasks
> apart but this takes a while and quite a few migrations before a stable state
> is reached.
> 
> Following is the output of the tracepoint sched_migrate_task on 5.17-rc1
> tip sched/core for stream running 8 threads with the same pinning pattern:
> (Output has been slightly modified for readability)
> 
> 167.928338: sched_migrate_task: comm=stream pid=5050 prio=120 orig_cpu=32 dest_cpu=48 START - {8}{0}
> ... 9 migrations
> 168.595632: sched_migrate_task: comm=stream pid=5051 prio=120 orig_cpu=16 dest_cpu=64 * {7}{1}
> 168.595634: sched_migrate_task: comm=stream pid=5048 prio=120 orig_cpu=16 dest_cpu=64 * {6}{2}
> ... 3 migrations
> 168.625803: sched_migrate_task: comm=stream pid=5052 prio=120 orig_cpu=0 dest_cpu=64 * {5}{3}
> 168.626146: sched_migrate_task: comm=stream pid=5050 prio=120 orig_cpu=16 dest_cpu=0
> 168.650832: sched_migrate_task: comm=stream pid=5050 prio=120 orig_cpu=0 dest_cpu=48
> 168.651009: sched_migrate_task: comm=stream pid=5048 prio=120 orig_cpu=64 dest_cpu=32 * {6}{2}
> 168.677314: sched_migrate_task: comm=stream pid=5048 prio=120 orig_cpu=32 dest_cpu=80 * {5}{3}
> 168.677320: sched_migrate_task: comm=stream pid=5050 prio=120 orig_cpu=48 dest_cpu=64 * {4}{4}
> 168.735707: sched_migrate_task: comm=stream pid=5050 prio=120 orig_cpu=64 dest_cpu=96
> 168.775510: sched_migrate_task: comm=stream pid=5051 prio=120 orig_cpu=80 dest_cpu=0 * {5}{3}
> ... 39 migrations
> 170.232105: sched_migrate_task: comm=stream pid=5049 prio=120 orig_cpu=0 dest_cpu=64 END {4}{4}
> 
> As we can see, 63 migrations are arecorded during the runtime of the
> program that can be avoided by the correct initial placement.
> 
> As you highlight, there may be areas in the load balancer path that
> can be optimized too for such a case.
> > I believe the problem could be in can_migrate_task() checking for
> > !cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)
> >
> > i.e dst_cpu is doing a load balance on behalf of the entire LLC, however it
> > only will pull tasks that can be pulled into it.
> Please correct me if I'm wrong but don't we scan the entire sched group
> and check if there is a compatible cpu for pinned tasks in can_migrate_task()
> when !cpumask_test_cpu(env->dst_cpu, p->cpus_ptr) ?
> 

> If a compatible cpu is found in the group, it is stored in new_dst_cpu
> member of lb_env to be used later in the load balancing path if my
> understanding is correct.

Yes, this is the LBF_DST_PINNED logic, So I am wondering if that is kicking
in correctly because this is the only difference I see between pinned and
unpinned.

> --
> Thanks and Regards,
> Prateek

-- 
Thanks and Regards
Srikar Dronamraju