linux-kernel - Re: [RFC][PATCH] sched: Cache aware load-balancing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1e73a88a-0d49-4ac2-a64e-6d5ff2d7ba19@intel.com>
Date: Mon, 31 Mar 2025 13:25:35 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Abel Wu <wuyun.abel@...edance.com>, Peter Zijlstra <peterz@...radead.org>
CC: <juri.lelli@...hat.com>, <vincent.guittot@...aro.org>,
	<dietmar.eggemann@....com>, <rostedt@...dmis.org>, <bsegall@...gle.com>,
	<mgorman@...e.de>, <vschneid@...hat.com>, <linux-kernel@...r.kernel.org>,
	<tim.c.chen@...ux.intel.com>, <tglx@...utronix.de>, <mingo@...nel.org>,
	<gautham.shenoy@....com>, <kprateek.nayak@....com>,
	<yu.chen.surf@...mail.com>
Subject: Re: [RFC][PATCH] sched: Cache aware load-balancing

On 3/30/2025 4:46 PM, Abel Wu wrote:
> On 3/29/25 11:06 PM, Chen, Yu C wrote:
>> On 3/28/2025 9:57 PM, Abel Wu wrote:
>>>
>>> IIUC this work updates stats for the whole mm and seems not
>>> necessary for each task of the process to repeat same thing.
>>> Hence would be better move this work to mm_struct.
>>>
>>
>> It seems that the per task cache_work is used to avoid
>> duplicated task_cache_work() in task->task_works queue, see
>> task_tick_cache()'s check
>>   if (work->next == work)
>>      task_work_add()
> 
> Yes, this check avoids task_works being corrupt caused by double
> adding the work, no matter this work is task- or mm-specific. So
> if moving to mm_struct, this check become false once any task takes
> this work, and other tasks can not do the same job until that task
> finishes by setting work->next to work.

I see. What I previous thought was that, checking work->next == work
can not avoid two tasks doing the same statistic calculation
in task_cache_work(), because work->next = work is invoked
at the beginning of task_cache_work() - maybe need to put
this at the end of task_cache_work()?

> 
> BTW, moving to mm_struct will save some memory footprint?
> Yes, this would help.

>>
>> To do exclusive task_cache_work() and only allow 1 task
>> in the process to do the calculation, maybe introducing similar 
>> mechanism like task_numa_work(), something like this:
>>
>> if (!try_cmpxchg(&mm->cache_next_scan, &calc, next_scan))
>>      return;
> 
> Yes, this looks good to me. While Peter used epoch to regulate
> the frequency of this work, which is a ligher way but allows some
> inaccuracy which is further fixed by a double check after holding
> mm->mm_sched_lock.
> 
> I plan to use trylock on mm->mm_sched_lock first. If trylock fails
> then someone is adding the work and we can skip early.
> 
> static void task_tick_cache(struct rq *rq, struct task_struct *p)
> {
>      ...
> 
>      if (mm->mm_sched_epoch == rq->cpu_epoch)
>          return;
> 
>      guard(raw_spinlock)(&mm->mm_sched_lock);  <-- trylock
> 
>      ...
> }
> 

This lock is intended to protect that no two tasks enqueuing the same 
per-mm_struct work at the same time, right? And for the task work 
execution phase, maybe we also need to put work->next = work at the end 
of task_cache_work()?

>>
>>>> +
>>>> +static inline
>>>> +void account_mm_sched(struct rq *rq, struct task_struct *p, s64 
>>>> delta_exec)
>>>> +{
>>>> +    struct mm_struct *mm = p->mm;
>>>> +    struct mm_sched *pcpu_sched;
>>>> +    unsigned long epoch;
>>>> +
>>>> +    /*
>>>> +     * init_task and kthreads don't be having no mm
>>>> +     */
>>>> +    if (!mm || !mm->pcpu_sched)
>>>> +        return;
>>>> +
>>>> +    pcpu_sched = this_cpu_ptr(p->mm->pcpu_sched);
>>>> +
>>>> +    scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) {
>>>> +        __update_mm_sched(rq, pcpu_sched);
>>>> +        pcpu_sched->runtime += delta_exec;
>>>> +        rq->cpu_runtime += delta_exec;
>>>> +        epoch = rq->cpu_epoch;
>>>> +    }
>>>> +
>>>> +    /*
>>>> +     * If this task hasn't hit task_cache_work() for a while, 
>>>> invalidate
>>>> +     * it's preferred state.
>>>> +     */
>>>> +    if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_OLD) {
>>>> +        mm->mm_sched_cpu = -1;
>>>> +        pcpu_sched->occ = -1;
>>>> +    }
>>>
>>> This seems too late. account_mm_sched() is called when p is runnable,
>>> so if the whole process sleeps for a while before woken up, ttwu will
>>> take the out-dated value.
>>>
>>
>> Yup, there seems to be a problem. It would be better if we could reset 
>> the mm_sched_cpu to -1 after the last thread of the process falls 
>> asleep. But considering that all threads are sleeping, even if the 
>> ttwu tries to enqueue the first newly-woken thread to an out-dated 
>> idle mm_sched_cpu, does it matter? I guess it would not be a serious 
>> problem, because all the cache of the process might have been evicted 
>> due to long sleep.
> 
> Yes, it seems not a big deal if mm_sched_cpu not overwrites any better
> choice.
> 
>>
>>>> +
>>>> +static void task_cache_work(struct callback_head *work)
>>>> +{
>>>> +    struct task_struct *p = current;
>>>> +    struct mm_struct *mm = p->mm;
>>>> +    unsigned long m_a_occ = 0;
>>>> +    int cpu, m_a_cpu = -1;
>>>> +    cpumask_var_t cpus;
>>>> +
>>>> +    WARN_ON_ONCE(work != &p->cache_work);
>>>> +
>>>> +    work->next = work;
>>>> +
>>>> +    if (p->flags & PF_EXITING)
>>>> +        return;
>>>> +
>>>> +    if (!alloc_cpumask_var(&cpus, GFP_KERNEL))
>>>> +        return;
>>>> +
>>>> +    scoped_guard (cpus_read_lock) {
>>>> +        cpumask_copy(cpus, cpu_online_mask);
>>>> +
>>>> +        for_each_cpu(cpu, cpus) {
>>>> +            /* XXX sched_cluster_active */
>>>> +            struct sched_domain *sd = per_cpu(sd_llc, cpu);
>>>> +            unsigned long occ, m_occ = 0, a_occ = 0;
>>>> +            int m_cpu = -1, nr = 0, i;
>>>> +
>>>> +            for_each_cpu(i, sched_domain_span(sd)) {
>>>> +                occ = fraction_mm_sched(cpu_rq(i),
>>>> +                            per_cpu_ptr(mm->pcpu_sched, i));
>>>> +                a_occ += occ;
>>>> +                if (occ > m_occ) {
>>>> +                    m_occ = occ;
>>>> +                    m_cpu = i;
>>>> +                }
>>>
>>> It would be possible to cause task stacking on this hint cpu
>>> due to its less frequently updated compared to wakeup.
>>>
>>
>> The SIS would overwrite the prev CPU with this hint(cached) CPU, and 
>> use that cached CPU as a hint to search for an idle CPU, so ideally it 
>> should not cause task stacking. But there is a race condition that 
>> multiple wakeup path might find the same cached "idle" CPU and queue 
>> wakees on it, this usually happens when there is frequent context 
>> switch(wakeup)+short duration tasks.
> 
> Ideally mm_sched_cpu is updated every EPOCH_PERIOD which is default
> to 10ms, so I'm afraid the race window is not small.
> 
OK, understood. My thought was that, if many wakers start searching for 
idle CPU from the same mm_sched_cpu(because mm_sched_cpu has not been 
changed for 10ms), if waker1 succeeds to enqueue a long-running wakee1 
on that mm_sched_cpu, other wakers might stop choosing that mm_sched_cpu 
in SIS. But if all the wakees are short-duration ones and doing frequent 
context switches, many wakers would think that they find an "idle" 
mm_sched_cpu, but actually it is heavily contended by many wakers.

thanks,
Chenyu

>>
>>
>>> And although the occupancy heuristic looks reasonable, IMHO it
>>> doesn't make much sense to compare between cpus as they share
>>> the LLC, and a non-hint cpu with warmer L1/L2$ in same LLC with
>>> the hint cpu seems more preferred.
>>>
>>> Do you think it's appropriate or not to only hint on the hottest
>>> LLC? So the tasks can hopefully wokenup on 'right' LLC on the
>>> premise that wouldn't cause much imbalance between LLCs.
>>>
>>  > I will do some tests and return with more feedback.
>>  >
>>
>> Find an idle CPU in the wakee's hostest LLC seems to be plausible.
>> The benchmark data might indicate a proper way.
>>
>> thanks,
>> Chenyu
>>
>>