[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=wjTAJwMJZ-6PPxvdtDmkL0=pfRF77nJ5qWw2vbiTzT4nQ@mail.gmail.com>
Date: Mon, 11 Oct 2021 12:54:49 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Mark Rutland <mark.rutland@....com>
Cc: Catalin Marinas <catalin.marinas@....com>,
Thomas Gleixner <tglx@...utronix.de>,
Will Deacon <will@...nel.org>, Marc Zyngier <maz@...nel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Linux ARM <linux-arm-kernel@...ts.infradead.org>,
"Paul E. McKenney" <paulmck@...nel.org>,
Peter Zijlstra <peterz@...radead.org>
Subject: Re: [GIT PULL] arm64 fixes for 5.15-rc5
On Mon, Oct 11, 2021 at 3:47 AM Mark Rutland <mark.rutland@....com> wrote:
>
> Sorry; I agree that commit messages don't explain this thoroughly. I can
> go and rework the commit messages to clarify things, if you'd like?
So I really hate that patch, even with explanation.
Having the Kconfig option for "do the right thing" is not how this
should be fixed.
Either the generic code is generic, or it isn't.
And if the generic code doesn't work for arm64, we shouldn't add a
random Kconfig option for "this architecture wants different
behavior".
> The TL;DR is that a bunch of constraints conspire such that we can't
> defer accounting things to the irqdomain or irqchip code without making
> this even more complicated/fragile, and many architectures get this
> subtly wrong today -- it's not that arm64 is necessarily much different
> from other architectures using this code, just that we're trying to
> solve this first.
So then I think the fix is to just say "accounting in this generic
function is wrong, and the accounting needs to be moved to the
callers".
That's particularly true if you think arm64 is only the only actually
_tested_ case that gets this wrong, and other architectures most
likely have the exact same issue, but you only fixed it for arm64.
So do it unconditionally - possibly even using a coccinelle script if
there are lots of callers.
Because generic code that just isn't generic, but randomly does
different things based on subtle Kconfig options that are different
for different architectures is bad, bad, bad.
And yes, I realize that we've had that kind of stuff before (and we
have odd Kconfig option testing in that irqdesc.c file elsewhere), but
the Kconfig options we currently have are mostly either
(a) actual real honest-to-goodness config options with semantic
meaning (ie things like CONFIG_SMP and CONFIG_NUMA)
(b) really ugly workarounds for odd special module exports (that
CONFIG_KVM_BOOK3S_64_HV_MODULE thing for irq_to_desc() that we *tried*
but failed to remove).
And so the reason I really hate that patch is that it introduces a new
"different architectures randomly and inexplicably do different
things, and the generic behavior is very different on arm64 than it is
elsewhere".
That's just the worst kind of hack to me.
And in this case, it's really *horribly* hard to see what the call
chain is. It all ends up being actively obfuscated and obscured
through that 'handle_arch_irq' function pointer, that is sometimes set
through set_handle_irq(), and sometimes set directly.
I really think that if the rule is "we can't do accounting in
handle_domain_irq(), because it's too late for arm64", then the fix
really should be to just not do that.
Move the irq_enter()/irq_exit() to the callers - quite possibly far up
the call chain to the root of it all, and just say "architecture code
needs to do this in the low-level code before calling
handle_arch_irq".
Or, if it turns out that 99% of callers do want it - and don't have
the issues arm64 has - maybe we can have a helper wrapper that does
the irq_enter/irq_exit, and another version that doesn't do it, and at
least it's clear to the callers which one it is. But it looks like
particularly with the odd indirection through that handle_arch_irq
function, it's best to just say "this needs to be done".
What architectures actually care? From what I can follow, it's really
GENERIC_IRQ_MULTI_HANDLER that ends up doing this all, and then arm64
has it's own special case for no obvious reason (why isn't it using
GENERIC_IRQ_MULTI_HANDLER that seems to do the EXACT same thing in
generic code except for a special "default != NULL" case?)
Anyway, it _looks_ to me like the pattern is very simple:
Step 1:
- remove irq_enter/irq_exit from handle_domain_irq(), move it to all callers.
This clearly doesn't change anything at all, but also doesn't fix the
problem you have. But it's easy to verify that the code is the same
before-and-after.
Step 2 is the pattern matching step:
- if the caller of handle_domain_irq() ends up being a function that
is registered with set_handle_irq(), then we
(a) remove the irq_enter/irq_exit from it
(b) add it to the architectures that use handle_arch_irq.
(c) make sure that if there are other callers of it (not through
handle_arch_irq) we move that irq_enter/irq_exit into them too
I _suspect_ - but didn't check - that Step 2(c) doesn't actually
exist. But who knows.
It really looks like there is a very tight connection between "uses
handle_domain_irq()" and "uses handle_arch_irq/set_handle_irq()". No?
Is this a bit more work? Yes.
Would this fix arm64? Yes - it ends up effectively doing what you did.
Would this fix _other_ architectures doing the same thing that you
suspect are broken? YES. Instead of leaving them randomly broken.
And most importantly, it would make the rules for "generic" code
clear, and actually generic.
Now, it may be that I'm wrong. I'm willing to be convinced, and if you
beat me over the head enough I guess I can take that pull request and
just be unhappy about it.
So I'm not trying to draw some line in the sand and be an arsehole and
say "you have to do it this way". But I look at that patch in that
pull request, and it really screams "this is horribly wrong" to me.
And I _think_ the above suggestion can be done almost mechanically.
And maybe it's easier to combine step1/step2 into one thing, if that
handle_domain_irq/handle_arch_irq" use is really as tightly bound as
it seems to be from a quick look.
So I'm not saying that they have to be separate steps, I wrote them
out that way mainly to kind of show the logic of the change - and in
case there are situations where you end up calling things without
going through that handle_arch_irq thing.
Linus
Powered by blists - more mailing lists