lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <86ilizmi40.wl-maz@kernel.org>
Date:   Mon, 28 Nov 2022 11:30:23 +0000
From:   Marc Zyngier <maz@...nel.org>
To:     Anup Patel <apatel@...tanamicro.com>
Cc:     Palmer Dabbelt <palmer@...belt.com>,
        Paul Walmsley <paul.walmsley@...ive.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Daniel Lezcano <daniel.lezcano@...aro.org>,
        Atish Patra <atishp@...shpatra.org>,
        Alistair Francis <Alistair.Francis@....com>,
        Anup Patel <anup@...infault.org>,
        linux-riscv@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v12 3/7] genirq: Add mechanism to multiplex a single HW IPI

On Mon, 28 Nov 2022 11:13:30 +0000,
Anup Patel <apatel@...tanamicro.com> wrote:
> 
> On Mon, Nov 28, 2022 at 4:04 PM Marc Zyngier <maz@...nel.org> wrote:
> >
> > On Sat, 26 Nov 2022 17:34:49 +0000,
> > Anup Patel <apatel@...tanamicro.com> wrote:
> > >
> > > +static void ipi_mux_send_mask(struct irq_data *d, const struct cpumask *mask)
> > > +{
> > > +     u32 ibit = BIT(irqd_to_hwirq(d));
> > > +     struct ipi_mux_cpu *icpu = this_cpu_ptr(ipi_mux_pcpu);
> > > +     struct cpumask *send_mask = &icpu->send_mask;
> > > +     unsigned long flags;
> > > +     int cpu;
> > > +
> > > +     /*
> > > +      * We use send_mask as a per-CPU variable so disable local
> > > +      * interrupts to avoid being preempted.
> > > +      */
> > > +     local_irq_save(flags);
> >
> > The correct way to avoid preemption is to use preempt_disable(), which
> > is a lot cheaper than disabling interrupt on most architectures.
> 
> Okay, I will update.
> 
> >
> > > +
> > > +     cpumask_clear(send_mask);
> >
> > This thing is likely to be unnecessarily expensive on very large
> > systems, as it is proportional to the number of CPUs.
> >
> > > +
> > > +     for_each_cpu(cpu, mask) {
> > > +             icpu = per_cpu_ptr(ipi_mux_pcpu, cpu);
> > > +             atomic_or(ibit, &icpu->bits);
> >
> > The original code had an atomic_fetch_or_release() to allow eliding
> > the IPI if the target interrupt was already pending. Why is that code
> > gone? This is a pretty cheap and efficient optimisation.
> 
> That optimization is causing RCU stalls on QEMU RISC-V virt
> machine with large number of CPUs.

Then there is a bug somewhere, either in the implementation of the
atomic operations or in QEMU. Or maybe even in the original code
(though this looks unlikely given how heavily this is used on actual
HW - I'm typing this email from one of these machines, and I'd be
pretty annoyed if I was missing IPIs).

In any case, please don't paper over this.

> 
> >
> > > +
> > > +             /*
> > > +              * The atomic_or() above must complete before
> > > +              * the atomic_read() below to avoid racing with
> > > +              * ipi_mux_unmask().
> > > +              */
> > > +             smp_mb__after_atomic();
> > > +
> > > +             if (atomic_read(&icpu->enable) & ibit)
> > > +                     cpumask_set_cpu(cpu, send_mask);
> > > +     }
> > > +
> > > +     /* Trigger the parent IPI */
> > > +     ipi_mux_send(send_mask);
> >
> > IPIs are very rarely made pending on more than a single CPU at a
> > time. The overwhelming majority of them are targeting a single CPU. So
> > accumulating bits to avoid doing two or more "send" actions only
> > penalises the generic case.
> >
> > My conclusion is that this "send_mask" can probably be removed,
> > together with the preemption fiddling.
> 
> So, we should call ipi_mux_send() for one target CPU at a time ?

I think so, as it matches my measurements from a few years ago. It
also simplifies things significantly, leading to better performance
for the common case. Add some instrumentation and see whether this is
still the case though.

> 
> >
> > > +
> > > +     local_irq_restore(flags);
> > > +}
> > > +
> > > +static const struct irq_chip ipi_mux_chip = {
> > > +     .name           = "IPI Mux",
> > > +     .irq_mask       = ipi_mux_mask,
> > > +     .irq_unmask     = ipi_mux_unmask,
> > > +     .ipi_send_mask  = ipi_mux_send_mask,
> > > +};
> >
> > OK, you have now dropped the superfluous pre/post handlers. But the
> > need still exists. Case in point, the aic_handle_ipi() prologue and
> > epilogue to the interrupt handling. I have suggested last time that
> > the driver could provide the actual struct irq_chip in order to
> > provide the callbacks it requires.
> 
> The aic_handle_ipi() can simply call ipi_mux_process() between
> the prologue and epilogue.

Hmm. OK. That's not what I had in mind, but fair enough.

	M.

-- 
Without deviation from the norm, progress is not possible.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ