[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220225101625.GB4363@suse.de>
Date: Fri, 25 Feb 2022 10:16:25 +0000
From: Mel Gorman <mgorman@...e.de>
To: Abel Wu <wuyun.abel@...edance.com>
Cc: Ben Segall <bsegall@...gle.com>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Ingo Molnar <mingo@...hat.com>,
Juri Lelli <juri.lelli@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Steven Rostedt <rostedt@...dmis.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 0/5] introduce sched-idle balancing
On Fri, Feb 25, 2022 at 04:15:06PM +0800, Abel Wu wrote:
> > As the overloaded mask may be updated on each idle, it could be a
> > significant source of cache misses between CPUs sharing the domain for
> > workloads that rapidly idle so there should be data on whether cache misses
> > are increased heavily. It also potentially delays the CPU reaching idle
> > but it may not be by much.
>
> Yes, that's why I cached overloaded status in rq->overloaded. So in
> this case of short running tasks, when cpus rapidly/frequently go
> idle, the cpumask/counter are not actually updated if the cached
> status is already 0 (not overloaded).
>
Which is a good idea in some respects. It tries to limit the number of
updates and treats it as a boolean but it's probably prone to races.
> > The filter may be out of date. It takes up to one tick to detect
> > overloaded and the filter to have a positive impact. As a CPU is not
> > guaranteed to enter idle if there is at least one CPU-bound task, it may
> > also be up to 1 tick before the mask is cleared. I'm not sure this is a
> > serious problem though as SIS would not pick the CPU with the CPU-bound
> > task anyway.
>
> Yes, it can be out of date, but increasing the accuracy means more
> frequent update which would introduce cache issues you mentioned
> above. Rate limit the updating to tick at the LLC basis might be an
> acceptable tradeoff I presume.
>
> >
> > At minimum, the filter should be split out and considered first as it
> > is the most likely reason why a performance difference was measured. It
> > has some oddities like why nr_overloaded is really a boolean and as
> > it's under rq lock, it's not clear why it's atomic. The changelog
> > would ideally contain some comment on the impact to cache misses
> > if any and some sort of proof that SIS search depth is reduced which
> > https://lore.kernel.org/lkml/20210726102247.21437-2-mgorman@techsingularity.net/
> > may be some help.
> >
> > At that point, compare the idle task balancing on top to isolate how
> > much it improves things if any and identify why existing balancing is
> > insufficient. Split out the can_migrate_task change beforehand in case it
> > is the main source of difference as opposed to the new balancing mechanism.
> >
>
> The nr_overloaded sits in shared domain structure thus shared in
> LLC domain and needs to be atomic_t, while rq->overloaded is a
> boolean updated under rq lock. Maybe the naming can cause some
> confusion, please lighten me up if you have better idea :)
>
The naming doesn't help because it's not really "the number of overloaded
rq's". atomic_t would be slightly safer against parallel updates but
it's race prone. I didn't think about it deeply but I suspect that two
separate rq's could disagree on what the boolean value should be if one rq
is overloaded, the other is not and they are updating via the idle path at
the same time. This probably can happen because the locking is rq based
and the cpumask is shared. On the flip side, making it an accurate count
would result in more updates and incur cache misses as well as probably
needing a cpumask check instead of a nr_overloaded comparison to determine
if the rq is already accounted for so it costs more. You are very likely
trading accuracy versus cost of update.
Whichever choice you make, add comments on the pros/cons and describe
the limitation of either approach. e.g. if overloaded is effectively a
boolean, describe in a comment the limitations.
> And yes, I agree it would be nice if test result on SIS search
> depth can be shown, and I actually did the test, but the result
> didn't show a reduction in depth due to sched-idle balancing
> will also consume sched-idle/idle cpus. I will apply your patch
> and make some further tests on that, thanks.
>
Just remember to use the patch to measure changes in SIS depth but
performance figures should not include the patch as the schedstat
overhead distorts results.
Also place the filter first and do any measurements of any change to
balancing versus the filter. I'm suggesting placing the filter first
because it's less controversial than a new balancer. Just be aware that
the filter alone is not a guarantee of merging as there have been a few
approaches to filtering and so far all of them had downsides on either cost
or accuracy. IIRC the only active approach to reducing search cost in SIS
is https://lore.kernel.org/all/20220207034013.599214-1-yu.c.chen@intel.com/
and it's likely to get a new version due to
https://lore.kernel.org/all/20220207135253.GF23216@worktop.programming.kicks-ass.net/.
It also updates sched_domain_shared but with a single boolean instead of
an atomic+cpumask.
--
Mel Gorman
SUSE Labs
Powered by blists - more mailing lists