linux-kernel - task_hot and idle balancing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <f38a9f7-b7fb-2aec-c3f9-b36a55d6636a@inria.fr>
Date: Sun, 2 Nov 2025 10:07:16 +0100 (CET)
From: Julia Lawall <julia.lawall@...ia.fr>
To: mingo@...hat.com, peterz@...radead.org, vincent.guittot@...aro.org, 
    dietmar.eggemann@....com
cc: jean-pierre.lozi@...ia.fr, rostedt@...dmis.org, clark.williams@...il.com, 
    lgoncalv@...hat.com, righi.andrea@...il.com, linux-kernel@...r.kernel.org
Subject: task_hot and idle balancing

Context: NVIDIA/ARM Grace A02 CPU, one socket, 72 CPUs, no hyperthreads,
Linux 6.18-rc3.

On a machine with 4ms ticks, given the default 2.8ms timeslice of EEVDF,
when there are two threads placed on a single CPU, they interchange at
every tick (see the first page of the attached pdf; the vertical blue
lines are ticks, the horizontal lines are tasks and the colors of those
lines indicate pids).  Likewise an attempt to steal a task for an idle
core can happen at each tick (the blue boxes in that graph).  The final
step in trying to steal is to check can_migrate_task, which involves
checking task_hot (migrate_degrades_locality is not relevant since there
is only one numa node).  task_hot ends with:

        delta = rq_clock_task(env->src_rq) - p->se.exec_start;

        return delta < (s64)sysctl_sched_migration_cost;

That is task_hot is true if the exec_start field of the task has very
recently been updated.  This happens on each context switch both for the
newly running task in:

*
 * We are picking a new current task - update its stats:
 */
static inline void
update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
        /*
         * We are starting a new run period:
         */
        se->exec_start = rq_clock_task(rq_of(cfs_rq));
}

And also for the preempted task in update_se:

static s64 update_se(struct rq *rq, struct sched_entity *se)
{
        u64 now = rq_clock_task(rq);
        s64 delta_exec;

        delta_exec = now - se->exec_start;
        if (unlikely(delta_exec <= 0))
                return delta_exec;

        se->exec_start = now;

The attempt at load balancing always happens after the context switch (see
the second page in the attached file for one example), often very shortly
after, so the delta computed by task_hot for both the running task and the
preempted is very small, and thus less than sysctl_sched_migration_cost.
Thus load balancing almost always fails, even when the tasks have
cumulatively been waiting for a long time on the CPU.

For completeness, there are two cases on the right side of the first page
of the attached file where load balancing succeeds (light blue lines on
the upper right).  The third page is a zoom on the first of those cases.
So delta sometimes exceeds sysctl_sched_migration_cost, but not very
often.

Possible solutions:

* Reduce sysctl_sched_migration_cost.  By default is it 500 000ns.  Maybe
it should be 0 for an on-socket migration?

* Remove se->exec_start = now; in update_se.  For a task that is not
executing, this seems like it should be dead code.  Delta would then be
the sum of the running time (at least 2.8ms) and the waiting time, which
is more than 500 000ns.  But it might have other implications, and it's
not clear why the running time plus the waiting time is a good metric for
deciding whether to migrate.

* Keep track of all of the waiting time on the current CPU and not just
the time since the most recent context switch.

julia
Download attachment "lu2.pdf" of type "application/pdf" (71687 bytes)