linux-kernel - Re: [GIT PULL] arm64 fixes for 5.15-rc5

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20211011104729.GB1421@C02TD0UTHF1T.local>
Date:   Mon, 11 Oct 2021 11:47:29 +0100
From:   Mark Rutland <mark.rutland@....com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Catalin Marinas <catalin.marinas@....com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Will Deacon <will@...nel.org>, Marc Zyngier <maz@...nel.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Linux ARM <linux-arm-kernel@...ts.infradead.org>,
        "Paul E. McKenney" <paulmck@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>
Subject: Re: [GIT PULL] arm64 fixes for 5.15-rc5

Hi Linus,

[adding Paul and Peter, just in case we need to discuss the RCU and
accounting bits further]

On Fri, Oct 08, 2021 at 01:25:51PM -0700, Linus Torvalds wrote:
> On Fri, Oct 8, 2021 at 11:37 AM Catalin Marinas <catalin.marinas@....com> wrote:
> >
> > Pingfan Liu (2):
> >       kernel/irq: make irq_{enter,exit}() in handle_domain_irq() arch optional
> >       arm64: entry: avoid double-accounting IRQ RCU entry
> 
> Ugh. This is *really* ugly. And it seems to be going exactly the wrong way.
> 
> I read the commit descriptions, and it still doesn't answer the
> fundamental question of why arm64 needs to do the accounting in
> arch-specific code, and disable the generic code.
> 
> It says
> 
>     To fix this, we must perform all the accounting from the architecture
>     code. We prevent the IRQ domain code from performing any accounting by
>     selecting HAVE_ARCH_IRQENTRY, and must call irq_enter_rcu() and
>     irq_exit_rcu() around invoking the root IRQ handler.
> 
> but at no point does it actually explain *why* all the accounting
> needs to be done by the architecture code.

Sorry; I agree that commit messages don't explain this thoroughly. I can
go and rework the commit messages to clarify things, if you'd like?

The TL;DR is that a bunch of constraints conspire such that we can't
defer accounting things to the irqdomain or irqchip code without making
this even more complicated/fragile, and many architectures get this
subtly wrong today -- it's not that arm64 is necessarily much different
from other architectures using this code, just that we're trying to
solve this first.

More specifically, the problem here is the combination of:

* During entry, rcu_irq_enter() must be called *before*
  trace_hardirqs_off_finish(), because the latter needs RCU to be
  watching, but we need to have informed lockdep before poking RCU or
  lockdep will complain withing the RCU code. To handle that,
  kernel/entry/common.c and arch/arm64/kernel/entry-common.c have the
  sequence:

  lockdep_hardirqs_off(CALLER_ADDR0);
  rcu_irq_enter(): // or rcu_irq_enter_check_tick();
  trace_hardirqs_off_finish();

  ... and since we don't want something to come in the middle of the
  sequence, we want those sandwiched together in a noinstr function
  called before we invoke the root irqchip's handler function.

  A bunch of architectures don't follow this sequence, and are
  potentially subtly broken today in some configurations.

  Note: the plan is to move arm64 over to the generic entry code, but
  that has no effect on this specific issue.

* rcu_irq_enter() must be called *precisely once* upon taking an
  interrupt exception, or we can end up mis-accounting quiescent periods
  as non-quiescent (since this maintains a nesting count in
  rcu_data::dynticks_nmi_nesting, checked by
  rcu_is_cpu_rrupt_from_idle()).

* Prior to these patches, we call rcu_irq_enter() at least twice on
  arm64, because the architectural entry code must call rcu_irq_enter()
  for the lockdep bits, then when we invoke the irqchip (e.g. GICv3), we
  do:

  gic_handle_irq()
    handle_domain_irq()
      irq_enter()
        rcu_irq_enter()
	irq_enter_rcu()

  ... and so if we take a sched clock IRQ and call
  rcu_sched_clock_irq(), rcu_is_cpu_rrupt_from_idle() may return false
  when it should return true(), and we don't decide to preempt the task,
  leading to stalls.

  Note that since irqchips can be chained (e.g. there can be a second
  interrupt controller fed into the GIC), handle_domain_irq() can be
  nested, so even if we somehow removed the accounting from arch code
  we'd need to handle the nested calls to handle_domain_irq() somehow.

  AFAICT it's far simpler to do that *once* outside of the irqchip code
  than it is to try to handle that nesting.

  Note that x86 doesn't use handle_domain_irq(), and just maps the raw
  irqnrs itself, and just calls irq_enter_rcu() when transitioning to
  the IRQ stack.

> Yes, yes, I read the previous paragraph. But why isn't the fix to just
> stop doing the double accounting in the arm64 specific code?

As above, that'd require moving *some* of the low-level entry logic into
the irqchip/irqdomain code, and handling nesting, which is *more*
fragile.

FWIW, we do need to fix the other architectures too, but that involves
more substantial rework to each of those (e.g. moving to generic entry),
since there are a bunch of problems today. The thinking was that this
gave a way to fix each of those on its own, then remove the old
behaviour.

Does that all make sense to you?

Thanks,
Mark.