linux-kernel - Re: [patch V9 21/39] x86/irq: Convey vector as argument and not in ptregs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b1fdf037-9c11-9f47-f285-e9a0843d648a@amazon.com>
Date:   Wed, 26 Aug 2020 02:04:03 +0200
From:   Alexander Graf <graf@...zon.com>
To:     Andy Lutomirski <luto@...nel.org>
CC:     Thomas Gleixner <tglx@...utronix.de>,
        LKML <linux-kernel@...r.kernel.org>,
        Andrew Cooper <andrew.cooper3@...rix.com>,
        X86 ML <x86@...nel.org>,
        "Paul E. McKenney" <paulmck@...nel.org>,
        Alexandre Chartre <alexandre.chartre@...cle.com>,
        Frederic Weisbecker <frederic@...nel.org>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Sean Christopherson <sean.j.christopherson@...el.com>,
        Masami Hiramatsu <mhiramat@...nel.org>,
        Petr Mladek <pmladek@...e.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        "Joel Fernandes" <joel@...lfernandes.org>,
        Boris Ostrovsky <boris.ostrovsky@...cle.com>,
        Juergen Gross <jgross@...e.com>,
        Brian Gerst <brgerst@...il.com>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Josh Poimboeuf <jpoimboe@...hat.com>,
        Will Deacon <will@...nel.org>,
        "Tom Lendacky" <thomas.lendacky@....com>,
        Wei Liu <wei.liu@...nel.org>,
        "Michael Kelley" <mikelley@...rosoft.com>,
        Jason Chen CJ <jason.cj.chen@...el.com>,
        Zhao Yakui <yakui.zhao@...el.com>,
        "Peter Zijlstra (Intel)" <peterz@...radead.org>,
        Avi Kivity <avi@...lladb.com>,
        "Herrenschmidt, Benjamin" <benh@...zon.com>, <robketr@...zon.de>,
        <amos@...lladb.com>
Subject: Re: [patch V9 21/39] x86/irq: Convey vector as argument and not in
 ptregs

Hi Andy,

On 26.08.20 01:41, Andy Lutomirski wrote:
> 
> On Tue, Aug 25, 2020 at 4:18 PM Alexander Graf <graf@...zon.com> wrote:
>>
>> Hi Thomas,
>>
>> On 25.08.20 12:28, Thomas Gleixner wrote:
>>> void irq_complete_move(struct irq_cfg *cfg)
> {
>          __irq_complete_move(cfg, ~get_irq_regs()->orig_ax);
> }
> 
>>> Alex,
>>>
>>> On Mon, Aug 24 2020 at 19:29, Alexander Graf wrote:
>>>> I'm currently trying to understand a performance regression with
>>>> ScyllaDB on i3en.3xlarge (KVM based VM on Skylake) which we reliably
>>>> bisected down to this commit:
>>>>
>>>>      https://github.com/scylladb/scylla/issues/7036
>>>>
>>>> What we're seeing is that syscalls such as membarrier() take forever
>>>> (0-10 µs would be normal):
>>> ...
>>>> That again seems to stem from a vastly slowed down
>>>> smp_call_function_many_cond():
>>>>
>>>> Samples: 218K of event 'cpu-clock', 4000 Hz
>>>> Overhead  Shared Object        Symbol
>>>>      94.51%  [kernel]             [k] smp_call_function_many_cond
>>>>       0.76%  [kernel]             [k] __do_softirq
>>>>       0.32%  [kernel]             [k] native_queued_spin_lock_slowpath
>>>> [...]
>>>>
>>>> which is stuck in
>>>>
>>>>           │     csd_lock_wait():
>>>>           │             smp_cond_load_acquire(&csd->flags, !(VAL &
>>>>      0.00 │       mov    0x8(%rcx),%edx
>>>>      0.00 │       and    $0x1,%edx
>>>>           │     ↓ je     2b9
>>>>           │     rep_nop():
>>>>      0.70 │2af:   pause
>>>>           │     csd_lock_wait():
>>>>     92.82 │       mov    0x8(%rcx),%edx
>>>>      6.48 │       and    $0x1,%edx
>>>>      0.00 │     ↑ jne    2af
>>>>      0.00 │2b9: ↑ jmp    282
>>>>
>>>>
>>>> Given the patch at hand I was expecting lost IPIs, but I can't quite see
>>>> anything getting lost.
>>>
>>> I have no idea how that patch should be related to IPI and smp function
>>> calls. It's changing the way how regular device interrupts and their
>>> spurious counterpart are handled and not the way how IPIs are
>>> handled. They are handled by direct vectors and do not go through
>>> do_IRQ() at all.
>>>
>>> Aside of that the commit just changes the way how the interrupt vector
>>> of a regular device interrupt is stored and conveyed. The extra read and
>>> write on the cache hot stack is hardly related to anything IPI.
>>
>> I am as puzzled as you are, but the bisect is very clear: 79b9c183021e
>> works fast and 633260fa1 (as well as mainline) shows the weird behavior
>> above.
>>
>> It gets even better. This small (demonstrative only, mangled) patch on
>> top of 633260fa1 also resolves the performance issue:
>>
>> diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
>> index c766936..7e91e9a 100644
>> --- a/arch/x86/kernel/irq.c
>> +++ b/arch/x86/kernel/irq.c
>> @@ -239,6 +239,7 @@ __visible void __irq_entry do_IRQ(struct pt_regs
>> *regs, unsigned long vector)
>>           * lower 8 bits.
>>           */
>>          vector &= 0xFF;
>> +       regs->orig_ax = ~vector;
>>
>>          /* entering_irq() tells RCU that we're not quiescent.  Check it. */
>>          RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up RCU");
>>
>>
>> To me that sounds like some irq exit code somewhere must still be
>> looking at orig_ax to decide on something - and that something is wrong
>> now that we removed the negation of the vector. It also seems to have an
>> impact on remote function calls.
>>
>> I'll have a deeper look tomorrow again if I can find any such place, but
>> I wouldn't mind if anyone could point me into the right direction
>> earlier :).
> 
> How about this:
> 
> void irq_complete_move(struct irq_cfg *cfg)
> {
>          __irq_complete_move(cfg, ~get_irq_regs()->orig_ax);
> }
> 
> in arch/x86/kernel/apic/vector.c.
> 

Thanks a lot, I stumbled over the same thing just after I sent the email 
as well and had been trying to see if I can quickly patch it up before I 
fall asleep :).

The code path above is used by the APIC vector move (irqbalance) logic, 
which explains why not everyone was seeing issues.

So with 633260fa1 applied, we never get out of moving state for our IRQ 
because orig_ax is always -1. That means we send an IPI to the cleanup 
vector on every single device interrupt, completely occupying the poor 
CPU that we moved the IRQ from.

I've confirmed that the patch below fixes the issue and will send a 
proper, more complete patch on top of mainline with fancy description 
and stable tag tomorrow.


Alex



diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index e7434cd..a474e6e 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -734,7 +734,6 @@ SYM_CODE_START_LOCAL(common_spurious)
  	call	interrupt_entry
  	UNWIND_HINT_REGS indirect=1
  	movq	ORIG_RAX(%rdi), %rsi		/* get vector from stack */
-	movq	$-1, ORIG_RAX(%rdi)		/* no syscall to restart */
  	call	smp_spurious_interrupt		/* rdi points to pt_regs */
  	jmp	ret_from_intr
  SYM_CODE_END(common_spurious)
@@ -746,7 +745,6 @@ SYM_CODE_START_LOCAL(common_interrupt)
  	call	interrupt_entry
  	UNWIND_HINT_REGS indirect=1
  	movq	ORIG_RAX(%rdi), %rsi		/* get vector from stack */
-	movq	$-1, ORIG_RAX(%rdi)		/* no syscall to restart */
  	call	do_IRQ				/* rdi points to pt_regs */
  	/* 0(%rsp): old RSP */
  ret_from_intr:
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 67768e5443..5b6f74e 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -934,7 +934,7 @@ static void __irq_complete_move(struct irq_cfg *cfg, 
unsigned vector)

  void irq_complete_move(struct irq_cfg *cfg)
  {
-	__irq_complete_move(cfg, ~get_irq_regs()->orig_ax);
+	__irq_complete_move(cfg, get_irq_regs()->orig_ax);
  }

  /*



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879