lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ec7bc197-8fde-4ccd-a02a-1effec5ac898@bytedance.com>
Date: Mon, 31 Mar 2025 16:04:26 +0800
From: Abel Wu <wuyun.abel@...edance.com>
To: "Chen, Yu C" <yu.c.chen@...el.com>, Peter Zijlstra <peterz@...radead.org>
Cc: juri.lelli@...hat.com, vincent.guittot@...aro.org,
 dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
 mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org,
 tim.c.chen@...ux.intel.com, tglx@...utronix.de, mingo@...nel.org,
 gautham.shenoy@....com, kprateek.nayak@....com, yu.chen.surf@...mail.com
Subject: Re: Re: [RFC][PATCH] sched: Cache aware load-balancing

On 3/31/25 1:25 PM, Chen, Yu C wrote:
> On 3/30/2025 4:46 PM, Abel Wu wrote:
>> On 3/29/25 11:06 PM, Chen, Yu C wrote:
>>> On 3/28/2025 9:57 PM, Abel Wu wrote:
>>>>
>>>> IIUC this work updates stats for the whole mm and seems not
>>>> necessary for each task of the process to repeat same thing.
>>>> Hence would be better move this work to mm_struct.
>>>>
>>>
>>> It seems that the per task cache_work is used to avoid
>>> duplicated task_cache_work() in task->task_works queue, see
>>> task_tick_cache()'s check
>>>   if (work->next == work)
>>>      task_work_add()
>>
>> Yes, this check avoids task_works being corrupt caused by double
>> adding the work, no matter this work is task- or mm-specific. So
>> if moving to mm_struct, this check become false once any task takes
>> this work, and other tasks can not do the same job until that task
>> finishes by setting work->next to work.
> 
> I see. What I previous thought was that, checking work->next == work
> can not avoid two tasks doing the same statistic calculation
> in task_cache_work(), because work->next = work is invoked
> at the beginning of task_cache_work() - maybe need to put
> this at the end of task_cache_work()?

LGTM.

> 
>>
>> BTW, moving to mm_struct will save some memory footprint?
>> Yes, this would help.
> 
>>>
>>> To do exclusive task_cache_work() and only allow 1 task
>>> in the process to do the calculation, maybe introducing similar mechanism like task_numa_work(), something like this:
>>>
>>> if (!try_cmpxchg(&mm->cache_next_scan, &calc, next_scan))
>>>      return;
>>
>> Yes, this looks good to me. While Peter used epoch to regulate
>> the frequency of this work, which is a ligher way but allows some
>> inaccuracy which is further fixed by a double check after holding
>> mm->mm_sched_lock.
>>
>> I plan to use trylock on mm->mm_sched_lock first. If trylock fails
>> then someone is adding the work and we can skip early.
>>
>> static void task_tick_cache(struct rq *rq, struct task_struct *p)
>> {
>>      ...
>>
>>      if (mm->mm_sched_epoch == rq->cpu_epoch)
>>          return;
>>
>>      guard(raw_spinlock)(&mm->mm_sched_lock);  <-- trylock
>>
>>      ...
>> }
>>
> 
> This lock is intended to protect that no two tasks enqueuing the same per-mm_struct work at the same time, right? And for the task work execution phase, maybe we also need to put work->next = work at the end of task_cache_work()?

Lock here is needed to protect adding work, what I meant is to return
early if trylock fails, since there is someone else being the owner to
do the work.

> 
>>>
>>>>> +
>>>>> +static inline
>>>>> +void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
>>>>> +{
>>>>> +    struct mm_struct *mm = p->mm;
>>>>> +    struct mm_sched *pcpu_sched;
>>>>> +    unsigned long epoch;
>>>>> +
>>>>> +    /*
>>>>> +     * init_task and kthreads don't be having no mm
>>>>> +     */
>>>>> +    if (!mm || !mm->pcpu_sched)
>>>>> +        return;
>>>>> +
>>>>> +    pcpu_sched = this_cpu_ptr(p->mm->pcpu_sched);
>>>>> +
>>>>> +    scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) {
>>>>> +        __update_mm_sched(rq, pcpu_sched);
>>>>> +        pcpu_sched->runtime += delta_exec;
>>>>> +        rq->cpu_runtime += delta_exec;
>>>>> +        epoch = rq->cpu_epoch;
>>>>> +    }
>>>>> +
>>>>> +    /*
>>>>> +     * If this task hasn't hit task_cache_work() for a while, invalidate
>>>>> +     * it's preferred state.
>>>>> +     */
>>>>> +    if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_OLD) {
>>>>> +        mm->mm_sched_cpu = -1;
>>>>> +        pcpu_sched->occ = -1;
>>>>> +    }
>>>>
>>>> This seems too late. account_mm_sched() is called when p is runnable,
>>>> so if the whole process sleeps for a while before woken up, ttwu will
>>>> take the out-dated value.
>>>>
>>>
>>> Yup, there seems to be a problem. It would be better if we could reset the mm_sched_cpu to -1 after the last thread of the process falls asleep. But considering that all threads are sleeping, even if the ttwu tries to enqueue the first newly-woken thread to an out-dated idle mm_sched_cpu, does it matter? I guess it would not be a serious problem, because all the cache of the process might have been evicted due to long sleep.
>>
>> Yes, it seems not a big deal if mm_sched_cpu not overwrites any better
>> choice.
>>
>>>
>>>>> +
>>>>> +static void task_cache_work(struct callback_head *work)
>>>>> +{
>>>>> +    struct task_struct *p = current;
>>>>> +    struct mm_struct *mm = p->mm;
>>>>> +    unsigned long m_a_occ = 0;
>>>>> +    int cpu, m_a_cpu = -1;
>>>>> +    cpumask_var_t cpus;
>>>>> +
>>>>> +    WARN_ON_ONCE(work != &p->cache_work);
>>>>> +
>>>>> +    work->next = work;
>>>>> +
>>>>> +    if (p->flags & PF_EXITING)
>>>>> +        return;
>>>>> +
>>>>> +    if (!alloc_cpumask_var(&cpus, GFP_KERNEL))
>>>>> +        return;
>>>>> +
>>>>> +    scoped_guard (cpus_read_lock) {
>>>>> +        cpumask_copy(cpus, cpu_online_mask);
>>>>> +
>>>>> +        for_each_cpu(cpu, cpus) {
>>>>> +            /* XXX sched_cluster_active */
>>>>> +            struct sched_domain *sd = per_cpu(sd_llc, cpu);
>>>>> +            unsigned long occ, m_occ = 0, a_occ = 0;
>>>>> +            int m_cpu = -1, nr = 0, i;
>>>>> +
>>>>> +            for_each_cpu(i, sched_domain_span(sd)) {
>>>>> +                occ = fraction_mm_sched(cpu_rq(i),
>>>>> +                            per_cpu_ptr(mm->pcpu_sched, i));
>>>>> +                a_occ += occ;
>>>>> +                if (occ > m_occ) {
>>>>> +                    m_occ = occ;
>>>>> +                    m_cpu = i;
>>>>> +                }
>>>>
>>>> It would be possible to cause task stacking on this hint cpu
>>>> due to its less frequently updated compared to wakeup.
>>>>
>>>
>>> The SIS would overwrite the prev CPU with this hint(cached) CPU, and use that cached CPU as a hint to search for an idle CPU, so ideally it should not cause task stacking. But there is a race condition that multiple wakeup path might find the same cached "idle" CPU and queue wakees on it, this usually happens when there is frequent context switch(wakeup)+short duration tasks.
>>
>> Ideally mm_sched_cpu is updated every EPOCH_PERIOD which is default
>> to 10ms, so I'm afraid the race window is not small.
>>
> OK, understood. My thought was that, if many wakers start searching for idle CPU from the same mm_sched_cpu(because mm_sched_cpu has not been changed for 10ms), if waker1 succeeds to enqueue a long-running wakee1 on that mm_sched_cpu, other wakers might stop choosing that mm_sched_cpu in SIS. But if all the wakees are short-duration ones and doing frequent context switches, many wakers would think that they find an "idle" mm_sched_cpu, but actually it is heavily contended by many wakers.

OK, seems I misunderstood the race condition you previously mentioned.
Yes, it's not clear yet whether mm_sched_cpu would cause stacking or not,
I will do some more test before further discussion.

Thanks,
	Abel


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ