linux-kernel - Re: [RFC PATCH 6/8] sched/fair: Increase probability of lb stats being reused

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <75322a53-dd50-4046-afc1-f59cfa6f7164@amd.com>
Date: Wed, 19 Mar 2025 12:21:41 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: "Chen, Yu C" <yu.c.chen@...el.com>
CC: Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt
	<rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel Gorman
	<mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>, David Vernet
	<void@...ifault.com>, "Gautham R. Shenoy" <gautham.shenoy@....com>, "Swapnil
 Sapkal" <swapnil.sapkal@....com>, Shrikanth Hegde <sshegde@...ux.ibm.com>,
	Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>, "Juri
 Lelli" <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>,
	<linux-kernel@...r.kernel.org>, <yu.chen.surf@...mail.com>
Subject: Re: [RFC PATCH 6/8] sched/fair: Increase probability of lb stats
 being reused

Hello Chenyu,

On 3/17/2025 11:37 PM, Chen, Yu C wrote:
> On 3/13/2025 5:37 PM, K Prateek Nayak wrote:
>> The load balancer will start caching the sg_lb_stats during load
>> balancing and propagate it up the sched domain hierarchy in the
>> subsequent commits.
>>
>> Increase the probability of load balancing intervals across domains to
>> be aligned to improve the reuse efficiency of the propagated stats.
>> Go one step further and proactively explore balancing at a higher domain
>> if the next update time for a higher domain in before the next update
>> time for its children.
>>
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
>> ---
>>   kernel/sched/fair.c | 18 +++++++-----------
>>   1 file changed, 7 insertions(+), 11 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 3b1ed14e4b5e..60517a732c10 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -11956,15 +11956,6 @@ get_sd_balance_interval(struct sched_domain *sd, int cpu_busy)
>>       /* scale ms to jiffies */
>>       interval = msecs_to_jiffies(interval);
>> -
>> -    /*
>> -     * Reduce likelihood of busy balancing at higher domains racing with
>> -     * balancing at lower domains by preventing their balancing periods
>> -     * from being multiples of each other.
>> -     */
>> -    if (cpu_busy)
>> -        interval -= 1;
>> -
>>       interval = clamp(interval, 1UL, max_load_balance_interval);
>>       return interval;
>> @@ -12126,7 +12117,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>>       int continue_balancing = 1;
>>       int cpu = rq->cpu;
>>       int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
>> -    unsigned long interval;
>> +    unsigned long interval, prev_sd_next_balance = 0;
>>       struct sched_domain *sd;
>>       /* Earliest time when we have to do rebalance again */
>>       unsigned long next_balance = jiffies + 60*HZ;
>> @@ -12136,6 +12127,8 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>>       rcu_read_lock();
>>       for_each_domain(cpu, sd) {
>> +        unsigned long next_interval;
>> +
>>           /*
>>            * Decay the newidle max times here because this is a regular
>>            * visit to all the domains.
>> @@ -12162,7 +12155,9 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>>                   goto out;
>>           }
>> -        if (time_after_eq(jiffies, sd->last_balance + interval)) {
>> +        next_interval = sd->last_balance + interval;
>> +        if (time_after_eq(jiffies, next_interval) ||
>> +            (prev_sd_next_balance && time_after(prev_sd_next_balance, next_interval))) {
> 
> (prev_sd_next_balance && time_after(jiffies, prev_sd_next_balance))?

So the rationale here is to sync the balancing at different levels if the
load balancing interval at the parent is somewhere between now and the
next load balancing interval of the child domain:

                                            Move MC balance
                                             here for more
                                                 reuse
                                                   v
     jiffies <---------------------------------------------
                  ^                 ^              ^
             Next balance      Next balance   Current balance
             at SMT domain     at MC domain    at SMT domain


On some topology, it can mean slightly more aggressive load balancing at
higher domains but the goal is that cost savings of a stats reuse will
eventually hide this jitter of doing load balancing at multiple domains
at once.

I would like to fo one step further and modify the cpumask_first() is
should we balance to instead return the last CPU doing load balancing
for this tick but it became slightly harder to cover the case of delay
in SOFTIRQ handler being executed so I left it out of this prototype.
I'll try to add something in proper v2.

-- 
Thanks and Regards,
Prateek

> 
> thanks,
> Chenyu
> 
>>               if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
>>                   /*
>>                    * The LBF_DST_PINNED logic could have changed
>> @@ -12174,6 +12169,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>>               }
>>               sd->last_balance = jiffies;
>>               interval = get_sd_balance_interval(sd, busy);
>> +            prev_sd_next_balance = sd->last_balance + interval;
>>           }
>>           if (need_serialize)
>>               atomic_set_release(&sched_balance_running, 0);