[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <cde76de6-7c47-d5c4-9fa3-4c3707b15328@didichuxing.com>
Date: Mon, 20 Feb 2023 15:26:06 +0800
From: Honglei Wang <wanghonglei@...ichuxing.com>
To: Chen Yu <yu.c.chen@...el.com>
CC: Abel Wu <wuyun.abel@...edance.com>,
Peter Zijlstra <peterz@...radead.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
Ingo Molnar <mingo@...hat.com>,
Juri Lelli <juri.lelli@...hat.com>,
Mel Gorman <mgorman@...hsingularity.net>,
Tim Chen <tim.c.chen@...el.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>,
K Prateek Nayak <kprateek.nayak@....com>,
Yicong Yang <yangyicong@...ilicon.com>,
"Gautham R . Shenoy" <gautham.shenoy@....com>,
Len Brown <len.brown@...el.com>,
Chen Yu <yu.chen.surf@...il.com>,
Tianchen Ding <dtcccc@...ux.alibaba.com>,
Joel Fernandes <joel@...lfernandes.org>,
Josh Don <joshdon@...gle.com>, Hillf Danton <hdanton@...a.com>,
<linux-kernel@...r.kernel.org>,
kernel test robot <yujie.liu@...el.com>
Subject: Re: [PATCH v5 2/2] sched/fair: Introduce SIS_SHORT to wake up short
task on current CPU
On 2023/2/20 12:58, Chen Yu wrote:
> On 2023-02-17 at 16:35:24 +0800, Honglei Wang wrote:
>>
>>
>> On 2023/2/16 20:55, Abel Wu wrote:
>>> Hi Chen,
>>>
>>> I've tested this patchset (with modification) on our Redis proxy
>>> servers, and the results seems promising.
>>>
>>> On 2/3/23 1:18 PM, Chen Yu wrote:
>>>> ...
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index aa16611c7263..d50097e5fcc1 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -6489,6 +6489,20 @@ static int wake_wide(struct task_struct *p)
>>>> return 1;
>>>> }
>>>> +/*
>>>> + * If a task switches in and then voluntarily relinquishes the
>>>> + * CPU quickly, it is regarded as a short duration task.
>>>> + *
>>>> + * SIS_SHORT tries to wake up the short wakee on current CPU. This
>>>> + * aims to avoid race condition among CPUs due to frequent context
>>>> + * switch.
>>>> + */
>>>> +static inline int is_short_task(struct task_struct *p)
>>>> +{
>>>> + return sched_feat(SIS_SHORT) && p->se.dur_avg &&
>>>> + ((p->se.dur_avg * 8) < sysctl_sched_min_granularity);
>>>> +}
>>>
>>> I changed the factor to fit into the shape of tasks in question.
>>>
>>> static inline int is_short_task(struct task_struct *p)
>>> {
>>> u64 dur = sysctl_sched_min_granularity / 8;
>>>
>>> if (!sched_feat(SIS_SHORT) || !p->se.dur_avg)
>>> return false;
>>>
>>> /*
>>> * Bare tracepoint to allow dynamically changing
>>> * the threshold.
>>> */
>>> trace_sched_short_task_tp(p, &dur);
>>>
>>> return p->se.dur_avg < dur;
>>> }
>>>
>>> I'm not sure it is the right way to provide such flexibility, but
>>> definition of 'short' can be workload specific.
>>>
>>>> +
>>>> /*
>>>> * The purpose of wake_affine() is to quickly determine on which
>>>> CPU we can run
>>>> * soonest. For the purpose of speed we only consider the waking
>>>> and previous
>>>> @@ -6525,6 +6539,11 @@ wake_affine_idle(int this_cpu, int prev_cpu,
>>>> int sync)
>>>> if (available_idle_cpu(prev_cpu))
>>>> return prev_cpu;
>>>> + /* The only running task is a short duration one. */
>>>> + if (cpu_rq(this_cpu)->nr_running == 1 &&
>>>> + is_short_task(rcu_dereference(cpu_curr(this_cpu))))
>>>> + return this_cpu;
>>>
>>> Since proxy server handles simple data delivery, the tasks are
>>> generally short running ones and hate task stacking which may
>>> introduce scheduling latency (even there are only 2 short tasks
>>> competing each other). So this part brings slight regression on
>>> the proxy case. But I still think this is good for most cases.
>>>
>>> Speaking of task stacking, I found wake_affine_weight() can be
>>> much more dangerous. It chooses the less loaded one between the
>>> prev & this cpu as a candidate, so 'small' tasks can be easily
>>> stacked on this cpu when wake up several tasks at one time if
>>> this cpu is unloaded. This really hurts if the 'small' tasks are
>>> latency-sensitive, although wake_affine_weight() does the right
>>> thing from the point of view of 'load'.
>>>
>>> The following change greatly reduced the p99lat of Redis service
>>> from 150ms to 0.9ms, at exactly the same throughput (QPS).
>>>
>>> @@ -5763,6 +5787,9 @@ wake_affine_weight(struct sched_domain *sd, struct
>>> task_struct *p,
>>> s64 this_eff_load, prev_eff_load;
>>> unsigned long task_load;
>>>
>>> + if (is_short_task(p))
>>> + return nr_cpumask_bits;
>>> +
>>> this_eff_load = cpu_load(cpu_rq(this_cpu));
>>>
>>> if (sync) {
>>>
>>> I know that 'short' tasks are not necessarily 'small' tasks, e.g.
>>> sleeping duration is small or have large weights, but this works
>>> really well for this case. This is partly because delivering data
>>> is memory bandwidth intensive hence prefer cache hot cpus. And I
>>> think this is also applicable to the general purposes: do NOT let
>>> the short running tasks suffering from cache misses caused by
>>> migration.
>>>
>>
>> Redis is a bit special. It runs quick and really sensitive on schedule
>> latency. The purpose of this 'short task' feature from Yu is to mitigate the
>> migration and tend to place the waking task on local cpu, this is somehow on
>> the opposite side of workload such as Redis. The changes you did remind me
>> of the latency-prio stuff. Maybe we can do something base on both the 'short
>> task' and 'latency-prio' to make your changes more general. thoughts?
>>
> Looks reasonable, I suppose you were refering to 'latency nice' proposed by
> Vincent. For now I'd like to keep this patch simple enough, later we can
> extend it.
>
Yep, agree to keep this patch as is for now.
Thanks,
Honglei
> thanks,
> Chenyu
Powered by blists - more mailing lists