[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <90e76dec-aacf-5619-1116-b4c5640a65a4@bytedance.com>
Date: Wed, 13 Jul 2022 18:25:58 +0800
From: Abel Wu <wuyun.abel@...edance.com>
To: Chen Yu <yu.c.chen@...el.com>
Cc: Peter Zijlstra <peterz@...radead.org>,
Mel Gorman <mgorman@...e.de>,
Vincent Guittot <vincent.guittot@...aro.org>,
Josh Don <joshdon@...gle.com>,
Tim Chen <tim.c.chen@...ux.intel.com>,
K Prateek Nayak <kprateek.nayak@....com>,
"Gautham R . Shenoy" <gautham.shenoy@....com>,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4 6/7] sched/fair: skip busy cores in SIS search
On 7/11/22 8:02 PM, Chen Yu Wrote:
>>> ...
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index d3e2c5a7c1b7..452eb63ee6f6 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -5395,6 +5395,7 @@ void scheduler_tick(void)
>>> resched_latency = cpu_resched_latency(rq);
>>> calc_global_load_tick(rq);
>>> sched_core_tick(rq);
>>> + update_overloaded_rq(rq);
>>
>> I didn't see this update in idle path. Is this on your intend?
>>
> It is intended to exclude the idle path. My thought was that, since
> the avg_util has contained the historic activity, checking the cpu
> status in each idle path seems to have no much difference...
I presume the avg_util is used to decide how many cpus to scan, while
the update determines which cpus to scan.
>>> rq_unlock(rq, &rf);
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index f80ae86bb404..34b1650f85f6 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -6323,6 +6323,50 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
>>> #endif /* CONFIG_SCHED_SMT */
>>> +/* derived from group_is_overloaded() */
>>> +static inline bool rq_overloaded(struct rq *rq, int cpu, unsigned int imbalance_pct)
>>> +{
>>> + if (rq->nr_running - rq->cfs.idle_h_nr_running <= 1)
>>> + return false;
>>> +
>>> + if ((SCHED_CAPACITY_SCALE * 100) <
>>> + (cpu_util_cfs(cpu) * imbalance_pct))
>>> + return true;
>>> +
>>> + if ((SCHED_CAPACITY_SCALE * imbalance_pct) <
>>> + (cpu_runnable(rq) * 100))
>>> + return true;
>>
>> So the filter contains cpus that over-utilized or overloaded now.
>> This steps further to make the filter reliable while at the cost
>> of sacrificing scan efficiency.
>>
> Right. Ideally if there is a 'realtime' idle cpumask for SIS, the
> scan would be quite accurate. The issue is how to maintain this
> cpumask in low cost.
Yes indeed.
>> The idea behind my recent patches is to keep the filter radical,
>> but use it conservatively.
>>
> Do you mean, update the per-core idle filter frequently, but only
> propogate the filter to LLC-cpumask when the system is overloaded?
Not exactly. I want to update the filter (BTW there is only the LLC
filter, no core filters :)) once core state changes, while apply it
in SIS domain scan only if the domain is busy enough.
>>> +
>>> + return false;
>>> +}
>>> +
>>> +void update_overloaded_rq(struct rq *rq)
>>> +{
>>> + struct sched_domain_shared *sds;
>>> + struct sched_domain *sd;
>>> + int cpu;
>>> +
>>> + if (!sched_feat(SIS_FILTER))
>>> + return;
>>> +
>>> + cpu = cpu_of(rq);
>>> + sd = rcu_dereference(per_cpu(sd_llc, cpu));
>>> + if (unlikely(!sd))
>>> + return;
>>> +
>>> + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
>>> + if (unlikely(!sds))
>>> + return;
>>> +
>>> + if (rq_overloaded(rq, cpu, sd->imbalance_pct)) {
>>
>> I'm not sure whether it is appropriate to use LLC imbalance pct here,
>> because we are comparing inside the LLC rather than between the LLCs.
>>
> Right, imbalance_pct could not be of LLC's, it could be of the core domain's
> imbalance_pct.
>>> + /* avoid duplicated write, mitigate cache contention */
>>> + if (!cpumask_test_cpu(cpu, sdo_mask(sds)))
>>> + cpumask_set_cpu(cpu, sdo_mask(sds));
>>> + } else {
>>> + if (cpumask_test_cpu(cpu, sdo_mask(sds)))
>>> + cpumask_clear_cpu(cpu, sdo_mask(sds));
>>> + }
>>> +}
>>> /*
>>> * Scan the LLC domain for idle CPUs; this is dynamically regulated by
>>> * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
>>> @@ -6383,6 +6427,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>> }
>>> }
>>> + if (sched_feat(SIS_FILTER) && !has_idle_core && sd->shared)
>>> + cpumask_andnot(cpus, cpus, sdo_mask(sd->shared));
>>> +
>>> for_each_cpu_wrap(cpu, cpus, target + 1) {
>>> if (has_idle_core) {
>>> i = select_idle_core(p, cpu, cpus, &idle_cpu);
>>> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
>>> index ee7f23c76bd3..1bebdb87c2f4 100644
>>> --- a/kernel/sched/features.h
>>> +++ b/kernel/sched/features.h
>>> @@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
>>> */
>>> SCHED_FEAT(SIS_PROP, false)
>>> SCHED_FEAT(SIS_UTIL, true)
>>> +SCHED_FEAT(SIS_FILTER, true)
>>
>> The filter should be enabled when there is a need. If the system
>> is idle enough, I don't think it's a good idea to clear out the
>> overloaded cpus from domain scan. Making the filter a sched-feat
>> won't help the problem.
>>
>> My latest patch will only apply the filter when nr is less than
>> the LLC size.
> Do you mean only update the filter(idle cpu mask), or only uses the
> filter in SIS when the system meets: nr_running < LLC size?
>
In SIS domain search, apply the filter when nr < LLC_size. But I
haven't tested this with SIS_UTIL, and in the SIS_UTIL case this
condition seems always true.
Thanks,
Abel
Powered by blists - more mailing lists