[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9effd823-5375-fce0-cb92-6630e82d8b04@amd.com>
Date: Wed, 9 Mar 2022 17:00:33 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Srikar Dronamraju <srikar@...ux.vnet.ibm.com>
Cc: peterz@...radead.org, aubrey.li@...ux.intel.com, efault@....de,
gautham.shenoy@....com, linux-kernel@...r.kernel.org,
mgorman@...hsingularity.net, mingo@...nel.org,
song.bao.hua@...ilicon.com, valentin.schneider@....com,
vincent.guittot@...aro.org
Subject: Re: [PATCH v6] sched/fair: Consider cpu affinity when allowing NUMA
imbalance in find_idlest_group
Hello Srikar,
On 3/9/2022 3:13 PM, Srikar Dronamraju wrote:
> [..snip..]
> I completely understand your problem. The only missing piece is why is this
> initial placement *not a problem for the unpinned case*. If we are able to
> articulate how the current code works well for the unpinned case, I would
> be fine.
>From what I understand, the initial task placement happens as follows:
When a new task is created via fork or exec, for the initial wakeup
it takes the slow path in select_task_rq_fair() and goes to
find_idlest_cpu(). find_idlest_cpu() will explore the sched domain
hierarchy in a top-down fashion to find the idlest cpu for task to
run on.
During this, it'll call find_idlest_group() to get the idlest group
within a particular domain to search for the target cpu. In our case,
the local group will have spare capacity to accommodate tasks.
We only do a cpumask_test_cpu(this_cpu, sched_group_span(group))
to check is the task can run on a particular group.
Hence we land on the case "group_has_spare". Here, if the sched
domain we are looking at has SD_NUMA flag set and if
(local_sgs.sum_nr_running + 1) <= sd->imb_numa_nr
we decide explore the loacl group itself in search of the target
cpu to place task on.
If number of tasks running in local group is more than sd->imb_numa_nr
or if the domain is not a NUMA domain, then the group with
most number of idle cpus is chosen.
We traverse the hierarchy down finding the idlest group of each
domain based on number of idle cpus in the group find the target
cpu to place the task on.
This is just a basic picture I've painted with the details necessary
to explain correct placement in the unbound case.
Consider the unbound case:
- With 8 stream threads
The sched domain hierarchy will be traversed as follows:
Socket Domain ---------------> Die Domain ---------------> MC Domain -> SMT Domain -> Idle CPU
[NUMA](2 socket) (8LLCs) (16 CPUs) (2 CPUs) (Target)
^ ^
| |
sd->imb_numa_nr = 8 Each stream thread gets
All 8 stream threads its own LLC as the
are placed on same choice of group at this
socket as up to 8 level is purely dependent
tasks can be running on the number of idle
on same CPU before cpus in the group. Once a
exploring across NUMA. stream thread is scheduled
on a LLC, it contains less
number of idle cpus
compared to other LLCs.
- With 16 stream threads
The sched domain hierarchy will be traversed as follows:
Socket Domain ---------------> Die Domain ---------------> MC Domain -> SMT Domain -> Idle CPU
[NUMA](2 socket) (8LLCs) (16 CPUs) (2 CPUs) (Target)
^ ^
| |
sd->imb_numa_nr = 8 Each stream thread gets
First 8 stream threads its own LLC as the
are placed on same choice of group at this
socket as up to 8 level is purely dependent
tasks can be running on the number of idle
on same CPU before cpus in the group.
exploring across NUMA.
Starting from the 9th First 8 thread will go to
thread, the placement the 8LLCs in the first
will happen based on socket and next 8 will
the number of idle cpus. take one LLC each in the
8 CPUs in one socket second socket.
will be running the
8 stream threads
previously scheduled so
the rest of the threads
will be scheduled on
the second socket.
If we bind 8 threads to 4 cpus on one socket and 4 cpus
on another, there is currently no mechanism that tells the
scheduler that it can explore the other socket after
scheduling the first four threads on one.
> [..snip..]
>> If a compatible cpu is found in the group, it is stored in new_dst_cpu
>> member of lb_env to be used later in the load balancing path if my
>> understanding is correct.
> Yes, this is the LBF_DST_PINNED logic, So I am wondering if that is kicking
> in correctly because this is the only difference I see between pinned and
> unpinned.
Ah! I see. But I do believe this problem of initial
placement lies along the wakeup path.
--
Thanks and Regards,
Prateek
Powered by blists - more mailing lists