linux-kernel - Re: [PATCH v5 2/2] sched/fair: Introduce SIS_SHORT to wake up short task on current CPU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <cde76de6-7c47-d5c4-9fa3-4c3707b15328@didichuxing.com>
Date:   Mon, 20 Feb 2023 15:26:06 +0800
From:   Honglei Wang <wanghonglei@...ichuxing.com>
To:     Chen Yu <yu.c.chen@...el.com>
CC:     Abel Wu <wuyun.abel@...edance.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Tim Chen <tim.c.chen@...el.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>,
        K Prateek Nayak <kprateek.nayak@....com>,
        Yicong Yang <yangyicong@...ilicon.com>,
        "Gautham R . Shenoy" <gautham.shenoy@....com>,
        Len Brown <len.brown@...el.com>,
        Chen Yu <yu.chen.surf@...il.com>,
        Tianchen Ding <dtcccc@...ux.alibaba.com>,
        Joel Fernandes <joel@...lfernandes.org>,
        Josh Don <joshdon@...gle.com>, Hillf Danton <hdanton@...a.com>,
        <linux-kernel@...r.kernel.org>,
        kernel test robot <yujie.liu@...el.com>
Subject: Re: [PATCH v5 2/2] sched/fair: Introduce SIS_SHORT to wake up short
 task on current CPU



On 2023/2/20 12:58, Chen Yu wrote:
> On 2023-02-17 at 16:35:24 +0800, Honglei Wang wrote:
>>
>>
>> On 2023/2/16 20:55, Abel Wu wrote:
>>> Hi Chen,
>>>
>>> I've tested this patchset (with modification) on our Redis proxy
>>> servers, and the results seems promising.
>>>
>>> On 2/3/23 1:18 PM, Chen Yu wrote:
>>>> ...
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index aa16611c7263..d50097e5fcc1 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -6489,6 +6489,20 @@ static int wake_wide(struct task_struct *p)
>>>>        return 1;
>>>>    }
>>>> +/*
>>>> + * If a task switches in and then voluntarily relinquishes the
>>>> + * CPU quickly, it is regarded as a short duration task.
>>>> + *
>>>> + * SIS_SHORT tries to wake up the short wakee on current CPU. This
>>>> + * aims to avoid race condition among CPUs due to frequent context
>>>> + * switch.
>>>> + */
>>>> +static inline int is_short_task(struct task_struct *p)
>>>> +{
>>>> +    return sched_feat(SIS_SHORT) && p->se.dur_avg &&
>>>> +           ((p->se.dur_avg * 8) < sysctl_sched_min_granularity);
>>>> +}
>>>
>>> I changed the factor to fit into the shape of tasks in question.
>>>
>>>       static inline int is_short_task(struct task_struct *p)
>>>       {
>>>           u64 dur = sysctl_sched_min_granularity / 8;
>>>
>>>           if (!sched_feat(SIS_SHORT) || !p->se.dur_avg)
>>>               return false;
>>>
>>>           /*
>>>            * Bare tracepoint to allow dynamically changing
>>>            * the threshold.
>>>            */
>>>           trace_sched_short_task_tp(p, &dur);
>>>
>>>           return p->se.dur_avg < dur;
>>>       }
>>>
>>> I'm not sure it is the right way to provide such flexibility, but
>>> definition of 'short' can be workload specific.
>>>
>>>> +
>>>>    /*
>>>>     * The purpose of wake_affine() is to quickly determine on which
>>>> CPU we can run
>>>>     * soonest. For the purpose of speed we only consider the waking
>>>> and previous
>>>> @@ -6525,6 +6539,11 @@ wake_affine_idle(int this_cpu, int prev_cpu,
>>>> int sync)
>>>>        if (available_idle_cpu(prev_cpu))
>>>>            return prev_cpu;
>>>> +    /* The only running task is a short duration one. */
>>>> +    if (cpu_rq(this_cpu)->nr_running == 1 &&
>>>> +        is_short_task(rcu_dereference(cpu_curr(this_cpu))))
>>>> +        return this_cpu;
>>>
>>> Since proxy server handles simple data delivery, the tasks are
>>> generally short running ones and hate task stacking which may
>>> introduce scheduling latency (even there are only 2 short tasks
>>> competing each other). So this part brings slight regression on
>>> the proxy case. But I still think this is good for most cases.
>>>
>>> Speaking of task stacking, I found wake_affine_weight() can be
>>> much more dangerous. It chooses the less loaded one between the
>>> prev & this cpu as a candidate, so 'small' tasks can be easily
>>> stacked on this cpu when wake up several tasks at one time if
>>> this cpu is unloaded. This really hurts if the 'small' tasks are
>>> latency-sensitive, although wake_affine_weight() does the right
>>> thing from the point of view of 'load'.
>>>
>>> The following change greatly reduced the p99lat of Redis service
>>> from 150ms to 0.9ms, at exactly the same throughput (QPS).
>>>
>>> @@ -5763,6 +5787,9 @@ wake_affine_weight(struct sched_domain *sd, struct
>>> task_struct *p,
>>>       s64 this_eff_load, prev_eff_load;
>>>       unsigned long task_load;
>>>
>>> +    if (is_short_task(p))
>>> +        return nr_cpumask_bits;
>>> +
>>>       this_eff_load = cpu_load(cpu_rq(this_cpu));
>>>
>>>       if (sync) {
>>>
>>> I know that 'short' tasks are not necessarily 'small' tasks, e.g.
>>> sleeping duration is small or have large weights, but this works
>>> really well for this case. This is partly because delivering data
>>> is memory bandwidth intensive hence prefer cache hot cpus. And I
>>> think this is also applicable to the general purposes: do NOT let
>>> the short running tasks suffering from cache misses caused by
>>> migration.
>>>
>>
>> Redis is a bit special. It runs quick and really sensitive on schedule
>> latency. The purpose of this 'short task' feature from Yu is to mitigate the
>> migration and tend to place the waking task on local cpu, this is somehow on
>> the opposite side of workload such as Redis. The changes you did remind me
>> of the latency-prio stuff. Maybe we can do something base on both the 'short
>> task' and 'latency-prio' to make your changes more general. thoughts?
>>
> Looks reasonable, I suppose you were refering to 'latency nice' proposed by
> Vincent. For now I'd like to keep this patch simple enough, later we can
> extend it.
> 

Yep, agree to keep this patch as is for now.

Thanks,
Honglei

> thanks,
> Chenyu