[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f5f056f1-a698-4660-87f0-63f5753c1679@bytedance.com>
Date: Sun, 30 Mar 2025 16:46:20 +0800
From: Abel Wu <wuyun.abel@...edance.com>
To: "Chen, Yu C" <yu.c.chen@...el.com>, Peter Zijlstra <peterz@...radead.org>
Cc: juri.lelli@...hat.com, vincent.guittot@...aro.org,
dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org,
tim.c.chen@...ux.intel.com, tglx@...utronix.de, mingo@...nel.org,
gautham.shenoy@....com, kprateek.nayak@....com
Subject: Re: Re: [RFC][PATCH] sched: Cache aware load-balancing
On 3/29/25 11:06 PM, Chen, Yu C wrote:
> On 3/28/2025 9:57 PM, Abel Wu wrote:
>>
>> IIUC this work updates stats for the whole mm and seems not
>> necessary for each task of the process to repeat same thing.
>> Hence would be better move this work to mm_struct.
>>
>
> It seems that the per task cache_work is used to avoid
> duplicated task_cache_work() in task->task_works queue, see
> task_tick_cache()'s check
> if (work->next == work)
> task_work_add()
Yes, this check avoids task_works being corrupt caused by double
adding the work, no matter this work is task- or mm-specific. So
if moving to mm_struct, this check become false once any task takes
this work, and other tasks can not do the same job until that task
finishes by setting work->next to work.
BTW, moving to mm_struct will save some memory footprint?
>
> To do exclusive task_cache_work() and only allow 1 task
> in the process to do the calculation, maybe introducing similar mechanism like task_numa_work(), something like this:
>
> if (!try_cmpxchg(&mm->cache_next_scan, &calc, next_scan))
> return;
Yes, this looks good to me. While Peter used epoch to regulate
the frequency of this work, which is a ligher way but allows some
inaccuracy which is further fixed by a double check after holding
mm->mm_sched_lock.
I plan to use trylock on mm->mm_sched_lock first. If trylock fails
then someone is adding the work and we can skip early.
static void task_tick_cache(struct rq *rq, struct task_struct *p)
{
...
if (mm->mm_sched_epoch == rq->cpu_epoch)
return;
guard(raw_spinlock)(&mm->mm_sched_lock); <-- trylock
...
}
>
>>> +
>>> +static inline
>>> +void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
>>> +{
>>> + struct mm_struct *mm = p->mm;
>>> + struct mm_sched *pcpu_sched;
>>> + unsigned long epoch;
>>> +
>>> + /*
>>> + * init_task and kthreads don't be having no mm
>>> + */
>>> + if (!mm || !mm->pcpu_sched)
>>> + return;
>>> +
>>> + pcpu_sched = this_cpu_ptr(p->mm->pcpu_sched);
>>> +
>>> + scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) {
>>> + __update_mm_sched(rq, pcpu_sched);
>>> + pcpu_sched->runtime += delta_exec;
>>> + rq->cpu_runtime += delta_exec;
>>> + epoch = rq->cpu_epoch;
>>> + }
>>> +
>>> + /*
>>> + * If this task hasn't hit task_cache_work() for a while, invalidate
>>> + * it's preferred state.
>>> + */
>>> + if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_OLD) {
>>> + mm->mm_sched_cpu = -1;
>>> + pcpu_sched->occ = -1;
>>> + }
>>
>> This seems too late. account_mm_sched() is called when p is runnable,
>> so if the whole process sleeps for a while before woken up, ttwu will
>> take the out-dated value.
>>
>
> Yup, there seems to be a problem. It would be better if we could reset the mm_sched_cpu to -1 after the last thread of the process falls asleep. But considering that all threads are sleeping, even if the ttwu tries to enqueue the first newly-woken thread to an out-dated idle mm_sched_cpu, does it matter? I guess it would not be a serious problem, because all the cache of the process might have been evicted due to long sleep.
Yes, it seems not a big deal if mm_sched_cpu not overwrites any better
choice.
>
>>> +
>>> +static void task_cache_work(struct callback_head *work)
>>> +{
>>> + struct task_struct *p = current;
>>> + struct mm_struct *mm = p->mm;
>>> + unsigned long m_a_occ = 0;
>>> + int cpu, m_a_cpu = -1;
>>> + cpumask_var_t cpus;
>>> +
>>> + WARN_ON_ONCE(work != &p->cache_work);
>>> +
>>> + work->next = work;
>>> +
>>> + if (p->flags & PF_EXITING)
>>> + return;
>>> +
>>> + if (!alloc_cpumask_var(&cpus, GFP_KERNEL))
>>> + return;
>>> +
>>> + scoped_guard (cpus_read_lock) {
>>> + cpumask_copy(cpus, cpu_online_mask);
>>> +
>>> + for_each_cpu(cpu, cpus) {
>>> + /* XXX sched_cluster_active */
>>> + struct sched_domain *sd = per_cpu(sd_llc, cpu);
>>> + unsigned long occ, m_occ = 0, a_occ = 0;
>>> + int m_cpu = -1, nr = 0, i;
>>> +
>>> + for_each_cpu(i, sched_domain_span(sd)) {
>>> + occ = fraction_mm_sched(cpu_rq(i),
>>> + per_cpu_ptr(mm->pcpu_sched, i));
>>> + a_occ += occ;
>>> + if (occ > m_occ) {
>>> + m_occ = occ;
>>> + m_cpu = i;
>>> + }
>>
>> It would be possible to cause task stacking on this hint cpu
>> due to its less frequently updated compared to wakeup.
>>
>
> The SIS would overwrite the prev CPU with this hint(cached) CPU, and use that cached CPU as a hint to search for an idle CPU, so ideally it should not cause task stacking. But there is a race condition that multiple wakeup path might find the same cached "idle" CPU and queue wakees on it, this usually happens when there is frequent context switch(wakeup)+short duration tasks.
Ideally mm_sched_cpu is updated every EPOCH_PERIOD which is default
to 10ms, so I'm afraid the race window is not small.
>
>
>> And although the occupancy heuristic looks reasonable, IMHO it
>> doesn't make much sense to compare between cpus as they share
>> the LLC, and a non-hint cpu with warmer L1/L2$ in same LLC with
>> the hint cpu seems more preferred.
>>
>> Do you think it's appropriate or not to only hint on the hottest
>> LLC? So the tasks can hopefully wokenup on 'right' LLC on the
>> premise that wouldn't cause much imbalance between LLCs.
>>
> > I will do some tests and return with more feedback.
> >
>
> Find an idle CPU in the wakee's hostest LLC seems to be plausible.
> The benchmark data might indicate a proper way.
>
> thanks,
> Chenyu
>
>
Powered by blists - more mailing lists