linux-kernel - Re: [PATCH v6] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9effd823-5375-fce0-cb92-6630e82d8b04@amd.com>
Date:   Wed, 9 Mar 2022 17:00:33 +0530
From:   K Prateek Nayak <kprateek.nayak@....com>
To:     Srikar Dronamraju <srikar@...ux.vnet.ibm.com>
Cc:     peterz@...radead.org, aubrey.li@...ux.intel.com, efault@....de,
        gautham.shenoy@....com, linux-kernel@...r.kernel.org,
        mgorman@...hsingularity.net, mingo@...nel.org,
        song.bao.hua@...ilicon.com, valentin.schneider@....com,
        vincent.guittot@...aro.org
Subject: Re: [PATCH v6] sched/fair: Consider cpu affinity when allowing NUMA
 imbalance in find_idlest_group

Hello Srikar,

On 3/9/2022 3:13 PM, Srikar Dronamraju wrote:
> [..snip..]
> I completely understand your problem. The only missing piece is why is this
> initial placement *not a problem for the unpinned case*. If we are able to
> articulate how the current code works well for the unpinned case, I would
> be fine.
>From what I understand, the initial task placement happens as follows:

When a new task is created via fork or exec, for the initial wakeup
it takes the slow path in select_task_rq_fair() and goes to
find_idlest_cpu(). find_idlest_cpu() will explore the sched domain
hierarchy in a top-down fashion to find the idlest cpu for task to
run on.

During this, it'll call find_idlest_group() to get the idlest group
within a particular domain to search for the target cpu. In our case,
the local group will have spare capacity to accommodate tasks.
We only do a cpumask_test_cpu(this_cpu, sched_group_span(group))
to check is the task can run on a particular group.

Hence we land on the case "group_has_spare". Here, if the sched
domain we are looking at has SD_NUMA flag set and if
(local_sgs.sum_nr_running + 1) <= sd->imb_numa_nr
we decide explore the loacl group itself in search of the target
cpu to place task on.
If number of tasks running in local group is more than sd->imb_numa_nr
or if the domain is not a NUMA domain, then the group with
most number of idle cpus is chosen.

We traverse the hierarchy down finding the idlest group of each
domain based on number of idle cpus in the group find the target
cpu to place the task on.

This is just a basic picture I've painted with the details necessary
to explain correct placement in the unbound case.

Consider the unbound case:

- With 8 stream threads
The sched domain hierarchy will be traversed as follows:

Socket Domain ---------------> Die Domain ---------------> MC Domain -> SMT Domain -> Idle CPU
[NUMA](2 socket)               (8LLCs)                     (16 CPUs)    (2 CPUs)      (Target)

  ^                             ^
  |                             |

sd->imb_numa_nr = 8            Each stream thread gets
All 8 stream threads           its own LLC as the
are placed on same             choice of group at this
socket as up to 8              level is purely dependent
tasks can be running           on the number of idle
on same CPU before             cpus in the group. Once a
exploring across NUMA.         stream thread is scheduled
                               on a LLC, it contains less
                               number of idle cpus
                               compared to other LLCs.

- With 16 stream threads
The sched domain hierarchy will be traversed as follows:

Socket Domain ---------------> Die Domain ---------------> MC Domain -> SMT Domain -> Idle CPU
[NUMA](2 socket)               (8LLCs)                     (16 CPUs)    (2 CPUs)      (Target)

  ^                             ^
  |                             |

sd->imb_numa_nr = 8            Each stream thread gets
First 8 stream threads         its own LLC as the
are placed on same             choice of group at this
socket as up to 8              level is purely dependent
tasks can be running           on the number of idle
on same CPU before             cpus in the group.
exploring across NUMA.
                              
Starting from the 9th          First 8 thread will go to
thread, the placement          the 8LLCs in the first
will happen based on           socket and next 8 will
the number of idle cpus.       take one LLC each in the
8 CPUs in one socket           second socket.
will be running the
8 stream threads
previously scheduled so
the rest of the threads
will be scheduled on
the second socket.

If we bind 8 threads to 4 cpus on one socket and 4 cpus
on another, there is currently no mechanism that tells the
scheduler that it can explore the other socket after
scheduling the first four threads on one.
> [..snip..]
>> If a compatible cpu is found in the group, it is stored in new_dst_cpu
>> member of lb_env to be used later in the load balancing path if my
>> understanding is correct.
> Yes, this is the LBF_DST_PINNED logic, So I am wondering if that is kicking
> in correctly because this is the only difference I see between pinned and
> unpinned.
Ah! I see. But I do believe this problem of initial
placement lies along the wakeup path.
--
Thanks and Regards,
Prateek