[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <89ccb344-7a03-65e6-826d-4807e1ab2815@de.ibm.com>
Date: Wed, 28 Apr 2021 16:49:44 +0200
From: Christian Borntraeger <borntraeger@...ibm.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: bristot@...hat.com, bsegall@...gle.com, dietmar.eggemann@....com,
greg@...ah.com, gregkh@...uxfoundation.org, joshdon@...gle.com,
juri.lelli@...hat.com, linux-kernel@...r.kernel.org,
linux@...musvillemoes.dk, mgorman@...e.de, mingo@...nel.org,
rostedt@...dmis.org, valentin.schneider@....com,
vincent.guittot@...aro.org, linux-s390@...r.kernel.org,
kvm@...r.kernel.org
Subject: Re: sched: Move SCHED_DEBUG sysctl to debugfs
On 28.04.21 14:38, Peter Zijlstra wrote:
> On Wed, Apr 28, 2021 at 11:42:57AM +0200, Christian Borntraeger wrote:
>> On 28.04.21 10:46, Peter Zijlstra wrote:
>> [..]
>>> The right thing to do here is to analyze the situation and determine why
>>> migration_cost needs changing; is that an architectural thing, does s390
>>> benefit from less sticky tasks due to its cache setup (the book caches
>>> could be absorbing some of the penalties here for example). Or is it
>>> something that's workload related, does KVM intrinsically not care about
>>> migrating so much, or is it something else.
>>
>> So lets focus on the performance issue.
>>
>> One workload where we have seen this is transactional workload that is
>> triggered by external network requests. So every external request
>> triggered a wakup of a guest and a wakeup of a process in the guest.
>> The end result was that KVM was 40% slower than z/VM (in terms of
>> transactions per second) while we had more idle time.
>> With smaller sched_migration_cost_ns (e.g. 100000) KVM was as fast
>> as z/VM.
>>
>> So to me it looks like that the wakeup and reschedule to a free CPU
>> was just not fast enough. It might also depend where I/O interrupts
>> land. Not sure yet.
>
> So there's unfortunately three places where migration_cost is used; one
> is in {nohz_,}newidle_balance(), see below. Someone tried removing it
> before and that ran into so weird regressions somewhere. But it is worth
> checking if this is the thing that matters for your workload.
>
> The other (main) use is in task_hot(), where we try and prevent
> migrating tasks that have recently run on a CPU. We already have an
> exception for SMT there, because SMT siblings share all cache levels per
> defintion, so moving it to the sibling should have no ill effect.
>
> It could be that the current measure is fundamentally too high for your
> machine -- it is basically a random number that was determined many
> years ago on some random x86 machine, so it not reflecting reality today
> on an entirely different platform is no surprise.
>
> Back in the day, we had some magic code that measured cache latency per
> sched_domain and we used that, but that suffered from boot-to-boot
> variance and made things rather non-deterministic, but the idea of
> having per-domain cost certainly makes sense.
>
> Over the years people have tried bringing parts of that back, but it
> never really had convincing numbers justifying the complexity. So that's
> another thing you could be looking at I suppose.
>
> And then finally we have an almost random use in rebalance_domains(),
> and I can't remember the story behind that one :/
>
>
> Anyway, TL;DR, try and figure out which of these three is responsible
> for your performance woes. If it's the first, the below patch might be a
> good candidate. If it's task_hot(), we might need to re-eval per domain
> costs. If its that other thing, I'll have to dig to figure out wth that
> was supposed to accomplish ;-)
Thanks for the insight. I will try to find out which of these areas make
a difference here.
[..]
Powered by blists - more mailing lists