lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAMzpN2i3AL3cED-XAo-YmaAD5PhjxfwPs9e0JPPNZOkOpu=9HQ@mail.gmail.com>
Date:   Tue, 25 Aug 2020 21:03:05 -0400
From:   Brian Gerst <brgerst@...il.com>
To:     Alexander Graf <graf@...zon.com>
Cc:     Andy Lutomirski <luto@...nel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        LKML <linux-kernel@...r.kernel.org>,
        Andrew Cooper <andrew.cooper3@...rix.com>,
        X86 ML <x86@...nel.org>,
        "Paul E. McKenney" <paulmck@...nel.org>,
        Alexandre Chartre <alexandre.chartre@...cle.com>,
        Frederic Weisbecker <frederic@...nel.org>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Sean Christopherson <sean.j.christopherson@...el.com>,
        Masami Hiramatsu <mhiramat@...nel.org>,
        Petr Mladek <pmladek@...e.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Joel Fernandes <joel@...lfernandes.org>,
        Boris Ostrovsky <boris.ostrovsky@...cle.com>,
        Juergen Gross <jgross@...e.com>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Josh Poimboeuf <jpoimboe@...hat.com>,
        Will Deacon <will@...nel.org>,
        Tom Lendacky <thomas.lendacky@....com>,
        Wei Liu <wei.liu@...nel.org>,
        Michael Kelley <mikelley@...rosoft.com>,
        Jason Chen CJ <jason.cj.chen@...el.com>,
        Zhao Yakui <yakui.zhao@...el.com>,
        "Peter Zijlstra (Intel)" <peterz@...radead.org>,
        Avi Kivity <avi@...lladb.com>,
        "Herrenschmidt, Benjamin" <benh@...zon.com>, robketr@...zon.de,
        amos@...lladb.com
Subject: Re: [patch V9 21/39] x86/irq: Convey vector as argument and not in ptregs

On Tue, Aug 25, 2020 at 8:04 PM Alexander Graf <graf@...zon.com> wrote:
>
> Hi Andy,
>
> On 26.08.20 01:41, Andy Lutomirski wrote:
> >
> > On Tue, Aug 25, 2020 at 4:18 PM Alexander Graf <graf@...zon.com> wrote:
> >>
> >> Hi Thomas,
> >>
> >> On 25.08.20 12:28, Thomas Gleixner wrote:
> >>> void irq_complete_move(struct irq_cfg *cfg)
> > {
> >          __irq_complete_move(cfg, ~get_irq_regs()->orig_ax);
> > }
> >
> >>> Alex,
> >>>
> >>> On Mon, Aug 24 2020 at 19:29, Alexander Graf wrote:
> >>>> I'm currently trying to understand a performance regression with
> >>>> ScyllaDB on i3en.3xlarge (KVM based VM on Skylake) which we reliably
> >>>> bisected down to this commit:
> >>>>
> >>>>      https://github.com/scylladb/scylla/issues/7036
> >>>>
> >>>> What we're seeing is that syscalls such as membarrier() take forever
> >>>> (0-10 µs would be normal):
> >>> ...
> >>>> That again seems to stem from a vastly slowed down
> >>>> smp_call_function_many_cond():
> >>>>
> >>>> Samples: 218K of event 'cpu-clock', 4000 Hz
> >>>> Overhead  Shared Object        Symbol
> >>>>      94.51%  [kernel]             [k] smp_call_function_many_cond
> >>>>       0.76%  [kernel]             [k] __do_softirq
> >>>>       0.32%  [kernel]             [k] native_queued_spin_lock_slowpath
> >>>> [...]
> >>>>
> >>>> which is stuck in
> >>>>
> >>>>           │     csd_lock_wait():
> >>>>           │             smp_cond_load_acquire(&csd->flags, !(VAL &
> >>>>      0.00 │       mov    0x8(%rcx),%edx
> >>>>      0.00 │       and    $0x1,%edx
> >>>>           │     ↓ je     2b9
> >>>>           │     rep_nop():
> >>>>      0.70 │2af:   pause
> >>>>           │     csd_lock_wait():
> >>>>     92.82 │       mov    0x8(%rcx),%edx
> >>>>      6.48 │       and    $0x1,%edx
> >>>>      0.00 │     ↑ jne    2af
> >>>>      0.00 │2b9: ↑ jmp    282
> >>>>
> >>>>
> >>>> Given the patch at hand I was expecting lost IPIs, but I can't quite see
> >>>> anything getting lost.
> >>>
> >>> I have no idea how that patch should be related to IPI and smp function
> >>> calls. It's changing the way how regular device interrupts and their
> >>> spurious counterpart are handled and not the way how IPIs are
> >>> handled. They are handled by direct vectors and do not go through
> >>> do_IRQ() at all.
> >>>
> >>> Aside of that the commit just changes the way how the interrupt vector
> >>> of a regular device interrupt is stored and conveyed. The extra read and
> >>> write on the cache hot stack is hardly related to anything IPI.
> >>
> >> I am as puzzled as you are, but the bisect is very clear: 79b9c183021e
> >> works fast and 633260fa1 (as well as mainline) shows the weird behavior
> >> above.
> >>
> >> It gets even better. This small (demonstrative only, mangled) patch on
> >> top of 633260fa1 also resolves the performance issue:
> >>
> >> diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
> >> index c766936..7e91e9a 100644
> >> --- a/arch/x86/kernel/irq.c
> >> +++ b/arch/x86/kernel/irq.c
> >> @@ -239,6 +239,7 @@ __visible void __irq_entry do_IRQ(struct pt_regs
> >> *regs, unsigned long vector)
> >>           * lower 8 bits.
> >>           */
> >>          vector &= 0xFF;
> >> +       regs->orig_ax = ~vector;
> >>
> >>          /* entering_irq() tells RCU that we're not quiescent.  Check it. */
> >>          RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up RCU");
> >>
> >>
> >> To me that sounds like some irq exit code somewhere must still be
> >> looking at orig_ax to decide on something - and that something is wrong
> >> now that we removed the negation of the vector. It also seems to have an
> >> impact on remote function calls.
> >>
> >> I'll have a deeper look tomorrow again if I can find any such place, but
> >> I wouldn't mind if anyone could point me into the right direction
> >> earlier :).
> >
> > How about this:
> >
> > void irq_complete_move(struct irq_cfg *cfg)
> > {
> >          __irq_complete_move(cfg, ~get_irq_regs()->orig_ax);
> > }
> >
> > in arch/x86/kernel/apic/vector.c.
> >
>
> Thanks a lot, I stumbled over the same thing just after I sent the email
> as well and had been trying to see if I can quickly patch it up before I
> fall asleep :).
>
> The code path above is used by the APIC vector move (irqbalance) logic,
> which explains why not everyone was seeing issues.
>
> So with 633260fa1 applied, we never get out of moving state for our IRQ
> because orig_ax is always -1. That means we send an IPI to the cleanup
> vector on every single device interrupt, completely occupying the poor
> CPU that we moved the IRQ from.
>
> I've confirmed that the patch below fixes the issue and will send a
> proper, more complete patch on top of mainline with fancy description
> and stable tag tomorrow.
>
>
> Alex
>
>
>
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index e7434cd..a474e6e 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -734,7 +734,6 @@ SYM_CODE_START_LOCAL(common_spurious)
>         call    interrupt_entry
>         UNWIND_HINT_REGS indirect=1
>         movq    ORIG_RAX(%rdi), %rsi            /* get vector from stack */
> -       movq    $-1, ORIG_RAX(%rdi)             /* no syscall to restart */
>         call    smp_spurious_interrupt          /* rdi points to pt_regs */
>         jmp     ret_from_intr
>   SYM_CODE_END(common_spurious)
> @@ -746,7 +745,6 @@ SYM_CODE_START_LOCAL(common_interrupt)
>         call    interrupt_entry
>         UNWIND_HINT_REGS indirect=1
>         movq    ORIG_RAX(%rdi), %rsi            /* get vector from stack */
> -       movq    $-1, ORIG_RAX(%rdi)             /* no syscall to restart */
>         call    do_IRQ                          /* rdi points to pt_regs */
>         /* 0(%rsp): old RSP */
>   ret_from_intr:
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index 67768e5443..5b6f74e 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -934,7 +934,7 @@ static void __irq_complete_move(struct irq_cfg *cfg,
> unsigned vector)
>
>   void irq_complete_move(struct irq_cfg *cfg)
>   {
> -       __irq_complete_move(cfg, ~get_irq_regs()->orig_ax);
> +       __irq_complete_move(cfg, get_irq_regs()->orig_ax);
>   }

I think you need to also truncate the vector to 8-bits, since it now
gets sign-extended when pushed into the orig_ax slot.

--
Brian Gerst

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ