[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240415183616.GDZh1zoFsBzvAEduRo@fat_crate.local>
Date: Mon, 15 Apr 2024 20:36:16 +0200
From: Borislav Petkov <bp@...en8.de>
To: Serge Semin <fancer.lancer@...il.com>
Cc: Michal Simek <michal.simek@....com>,
Alexander Stein <alexander.stein@...tq-group.com>,
Tony Luck <tony.luck@...el.com>, James Morse <james.morse@....com>,
Mauro Carvalho Chehab <mchehab@...nel.org>,
Robert Richter <rric@...nel.org>, Dinh Nguyen <dinguyen@...nel.org>,
Punnaiah Choudary Kalluri <punnaiah.choudary.kalluri@...inx.com>,
Arnd Bergmann <arnd@...db.de>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
linux-arm-kernel@...ts.infradead.org, linux-edac@...r.kernel.org,
linux-kernel@...r.kernel.org, Sherry Sun <sherry.sun@....com>,
Borislav Petkov <bp@...e.de>
Subject: Re: [PATCH v5 01/20] EDAC/synopsys: Fix ECC status data and IRQ
disable race condition
On Thu, Feb 22, 2024 at 09:12:46PM +0300, Serge Semin wrote:
> The race condition around the ECCCLR register access happens in the IRQ
> disable method called in the device remove() procedure and in the ECC IRQ
> handler:
> 1. Enable IRQ:
> a. ECCCLR = EN_CE | EN_UE
> 2. Disable IRQ:
> a. ECCCLR = 0
> 3. IRQ handler:
> a. ECCCLR = CLR_CE | CLR_CE_CNT | CLR_CE | CLR_CE_CNT
> b. ECCCLR = 0
> c. ECCCLR = EN_CE | EN_UE
> So if the IRQ disabling procedure is called concurrently with the IRQ
> handler method the IRQ might be actually left enabled due to the
> statement 3c.
>
> The root cause of the problem is that ECCCLR register (which since v3.10a
> has been called as ECCCTL) has intermixed ECC status data clear flags and
> the IRQ enable/disable flags. Thus the IRQ disabling (clear EN flags) and
> handling (write 1 to clear ECC status data) procedures must be serialised
> around the ECCCTL register modification to prevent the race.
>
> So fix the problem described above by adding the spin-lock around the
> ECCCLR modifications and preventing the IRQ-handler from modifying the
> IRQs enable flags (there is no point in disabling the IRQ and then
> re-enabling it again within a single IRQ handler call, see the statements
> 3a/3b and 3c above).
So I'm looking at the code and am looking at this and wondering how we
even ended up in this mess?!
An interrupt handler should not *enable* the interrupt again - that's
just crazy. And I should've seen that in
4bcffe941758 ("EDAC/synopsys: Re-enable the error interrupts on v3 hw")
and stopped it right there. But well, it is what it is...
So I'd like to see the following flow:
* on init, the interrupt is enabled with enable_intr() *after*
registering the interrupt handler.
* on exit, the interrupt is disabled with disable_intr() and then no
interrupts are coming in anymore.
And then I don't think you'll need the spinlock and it'll be sane
design.
Right?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
Powered by blists - more mailing lists