lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20240820155338.GB28750@willie-the-truck>
Date: Tue, 20 Aug 2024 16:53:39 +0100
From: Will Deacon <will@...nel.org>
To: Doug Anderson <dianders@...omium.org>
Cc: Catalin Marinas <catalin.marinas@....com>,
	Mark Rutland <mark.rutland@....com>,
	Sumit Garg <sumit.garg@...aro.org>, Yu Zhao <yuzhao@...gle.com>,
	Misono Tomohiro <misono.tomohiro@...itsu.com>,
	Stephen Boyd <swboyd@...omium.org>, Chen-Yu Tsai <wens@...e.org>,
	Marc Zyngier <maz@...nel.org>,
	Daniel Thompson <daniel.thompson@...aro.org>,
	D Scott Phillips <scott@...amperecomputing.com>,
	Frederic Weisbecker <frederic@...nel.org>,
	"Guilherme G. Piccoli" <gpiccoli@...lia.com>,
	James Morse <james.morse@....com>, Kees Cook <kees@...nel.org>,
	Tony Luck <tony.luck@...el.com>,
	linux-arm-kernel@...ts.infradead.org,
	linux-hardening@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] arm64: smp: smp_send_stop() and crash_smp_send_stop()
 should try non-NMI first

On Mon, Aug 05, 2024 at 01:14:12PM -0700, Doug Anderson wrote:
> On Mon, Aug 5, 2024 at 10:24 AM Will Deacon <will@...nel.org> wrote:
> > On Mon, Aug 05, 2024 at 10:13:11AM -0700, Doug Anderson wrote:
> > > On Mon, Aug 5, 2024 at 9:53 AM Will Deacon <will@...nel.org> wrote:
> > > > On Tue, Jun 25, 2024 at 04:07:22PM -0700, Douglas Anderson wrote:
> > > > > @@ -1084,79 +1088,87 @@ static inline unsigned int num_other_online_cpus(void)
> > > > >
> > > > >  void smp_send_stop(void)
> > > > >  {
> > > > > +     static unsigned long stop_in_progress;
> > > > > +     cpumask_t mask;
> > > > >       unsigned long timeout;
> > > > >
> > > > > -     if (num_other_online_cpus()) {
> > > > > -             cpumask_t mask;
> > > > > +     /*
> > > > > +      * If this cpu is the only one alive at this point in time, online or
> > > > > +      * not, there are no stop messages to be sent around, so just back out.
> > > > > +      */
> > > > > +     if (num_other_online_cpus() == 0)
> > > > > +             goto skip_ipi;
> > > > >
> > > > > -             cpumask_copy(&mask, cpu_online_mask);
> > > > > -             cpumask_clear_cpu(smp_processor_id(), &mask);
> > > > > +     /* Only proceed if this is the first CPU to reach this code */
> > > > > +     if (test_and_set_bit(0, &stop_in_progress))
> > > > > +             return;
> > > > >
> > > > > -             if (system_state <= SYSTEM_RUNNING)
> > > > > -                     pr_crit("SMP: stopping secondary CPUs\n");
> > > > > -             smp_cross_call(&mask, IPI_CPU_STOP);
> > > > > -     }
> > > > > +     cpumask_copy(&mask, cpu_online_mask);
> > > > > +     cpumask_clear_cpu(smp_processor_id(), &mask);
> > > > >
> > > > > -     /* Wait up to one second for other CPUs to stop */
> > > > > +     if (system_state <= SYSTEM_RUNNING)
> > > > > +             pr_crit("SMP: stopping secondary CPUs\n");
> > > > > +
> > > > > +     /*
> > > > > +      * Start with a normal IPI and wait up to one second for other CPUs to
> > > > > +      * stop. We do this first because it gives other processors a chance
> > > > > +      * to exit critical sections / drop locks and makes the rest of the
> > > > > +      * stop process (especially console flush) more robust.
> > > > > +      */
> > > > > +     smp_cross_call(&mask, IPI_CPU_STOP);
> > > >
> > > > I realise you've moved this out of crash_smp_send_stop() and it looks
> > > > like we inherited the code from x86, but do you know how this serialise
> > > > against CPU hotplug operations? I've spent the last 20m looking at the
> > > > code and I can't see what prevents other CPUs from coming and going
> > > > while we're trying to IPI a non-atomic copy of 'cpu_online_mask'.
> > >
> > > I don't think there is anything. ...and it's not just this code
> > > either. It sure looks like nmi_trigger_cpumask_backtrace() has the
> > > same problem.
> > >
> > > I guess maybe in the case of nmi_trigger_cpumask_backtrace() it's not
> > > such a big deal because:
> > > 1. If a CPU goes away then we'll just time out
> > > 2. If a CPU shows up then we'll skip backtracing it, but we were
> > > sending backtraces at an instant in time anyway.
> > >
> > > In the case of smp_send_stop() it's probably fine if a CPU goes away
> > > because, again, we'll just timeout. ...but if a CPU shows up then
> > > that's not super ideal. Maybe it doesn't cause problems in practice
> > > but it does feel like it should be fixed.
> >
> > On the other hand, I really don't fancy taking a CPU hotplug mutex on
> > the panic() path, so it's hard to know what's best.
> >
> > I suppose one thing we could do is recompute the mask before sending the
> > NMI, which should make it a little more robust in that case. It still
> > feels fragile, but it's no worse than the existing code, I suppose.
> 
> Happy to do this and send a new version if you want. Just let me know.
> Basically it's just replacing `cpumask_and(&mask, &mask,
> cpu_online_mask)`  with `cpumask_copy(&mask, cpu_online_mask);
> cpumask_clear_cpu(smp_processor_id(), &mask);` in the case where we
> fallback to NMI. If you're happy with the patch I'm also happy if you
> make this change while applying.

Please send a v3 with that along with a little comment summarising this
discussion so I don't confuse myself again when I look at next time :)

Thanks,

Will

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ