[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <04cfb7ee-72dc-0a53-deb7-eaee6f0e9590@bytedance.com>
Date: Mon, 15 Aug 2022 17:49:26 +0800
From: Abel Wu <wuyun.abel@...edance.com>
To: "Gautham R. Shenoy" <gautham.shenoy@....com>
Cc: Peter Zijlstra <peterz@...radead.org>,
Mel Gorman <mgorman@...e.de>,
Vincent Guittot <vincent.guittot@...aro.org>,
Josh Don <joshdon@...gle.com>, Chen Yu <yu.c.chen@...el.com>,
Tim Chen <tim.c.chen@...ux.intel.com>,
K Prateek Nayak <kprateek.nayak@....com>,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4 7/7] sched/fair: de-entropy for SIS filter
On 7/21/22 1:08 AM, Gautham R. Shenoy Wrote:
> Hello Abel,
>
> On Sun, Jun 19, 2022 at 08:04:51PM +0800, Abel Wu wrote:
>> Now when updating core state, there are two main problems that could
>> pollute the SIS filter:
>>
>> - The updating is before task migration, so if dst_cpu is
>> selected to be propagated which might be fed with tasks
>> soon, the efforts we paid is no more than setting a busy
>> cpu into the SIS filter. While on the other hand it is
>> important that we update as early as possible to keep the
>> filter fresh, so it's not wise to delay the update to the
>> end of load balancing.
>>
>> - False negative propagation hurts performance since some
>> idle cpus could be out of reach. So in general we will
>> aggressively propagate idle cpus but allow false positive
>> continue to exist for a while, which may lead to filter
>> being fully polluted.
>
> Ok, so the false positive case is being addressed in this patch.
>
>>
>> Pains can be relieved by a force correction when false positive
>> continuously detected.
>>
> [..snip..]
>
>> @@ -111,6 +117,7 @@ struct sched_group;
>> enum sd_state {
>> sd_has_icores,
>> sd_has_icpus,
>> + sd_may_idle,
>> sd_is_busy
>> };
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d55fdcedf2c0..9713d183d35e 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>
> [...snip..]
>
>> @@ -9321,7 +9327,7 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>> {
>> struct sched_domain_shared *sd_smt_shared = env->sd->shared;
>> enum sd_state new = sds->sd_state;
>> - int this = env->dst_cpu;
>> + int icpu = sds->idle_cpu, this = env->dst_cpu;
>>
>> /*
>> * Parallel updating can hardly contribute accuracy to
>> @@ -9331,6 +9337,22 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>> if (cmpxchg(&sd_smt_shared->updating, 0, 1))
>> return;
>>
>> + /*
>> + * The dst_cpu is likely to be fed with tasks soon.
>> + * If it is the only unoccupied cpu in this domain,
>> + * we still handle it the same way as as_has_icpus
> ^^^^^^^^^^^^^
> Nit: sd_has_icpus
Will fix.
>
>> + * but turn the SMT into the unstable state, rather
>> + * than waiting to the end of load balancing since
>> + * it's also important that update the filter as
>> + * early as possible to keep it fresh.
>> + */
>> + if (new == sd_is_busy) {
>> + if (idle_cpu(this) || sched_idle_cpu(this)) {
>> + new = sd_may_idle;
>> + icpu = this;
>> + }
>> + }
>> +
>> /*
>> * There is at least one unoccupied cpu available, so
>> * propagate it to the filter to avoid false negative
>> @@ -9338,6 +9360,12 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>> * idle cpus thus throughupt downgraded.
>> */
>> if (new != sd_is_busy) {
>> + /*
>> + * The sd_may_idle state is taken into
>> + * consideration as well because from
>> + * here we couldn't actually know task
>> + * migrations would happen or not.
>> + */
>> if (!test_idle_cpus(this))
>> set_idle_cpus(this, true);
>> } else {
>> @@ -9347,9 +9375,26 @@ static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>> */
>> if (sd_smt_shared->state == sd_is_busy)
>> goto out;
>> +
>> + /*
>> + * Allow false positive to exist for some time
>> + * to make a tradeoff of accuracy of the filter
>> + * for relieving cache traffic.
>> + */
>
> I can understand allowing the false positive to exist when there are
> no other idle CPUs in this SMT domain other than this CPU, which is
> handled by the case where new != sd_is_busy in the current
> load-balance round and will be handled by the "else" clause in the
> subsequent round if env->dst_cpu ends up becoming busy.
>
Yes.
>
> However, when we know that new == sd_is_busy and the previous state of
> this SMT domain was sd_has_icpus, should we not immediately clear this
> core's cpumask from the LLCs icpus mask? What is the need for the
> intermediate sd_may_idle state transition between sd_has_icpus and
> sd_is_busy in this case ?
>
The thought was to make addition more aggressive than deletion to the
filter to try to avoid real idle cpus being out of reach. Take short
running tasks for example, the cpus in this newly-busy SMT domain can
become idle multiple times during next balancing period, which won't
be selected if update to sd_is_busy intermediately. While for the non-
short running workloads, the only downside is sacrificing the filter's
accuracy for a short while.
IOW, temporarily containing false positive cpus in the filter is more
acceptable than missing some real idle ones which would cause throughput
reduced. Besides, in this way the cache traffic can be relieved more
or less.
>
>
>> + if (sd_smt_shared->state == sd_has_icpus) {
>> + new = sd_may_idle;
>> + goto update;
>> + }
>> +
>> + /*
>> + * If the false positive issue has already been
>> + * there for a while, a correction of the filter
>> + * is needed.
>> + */
>> }
>>
>> sd_update_icpus(this, sds->idle_cpu);
> ^^^^^^^^^^^^^^
>
> I think you meant to use icpu here ? sds->idle_cpu == -1 in the case
> when new == sd_may_idle. Which will end up clearing this core's
> cpumask from this LLC's icpus mask. This defeats the
> "allow-false-positive" heuristic.
>
Nice catch, will fix.
Thanks & Best Regards,
Abel
Powered by blists - more mailing lists