netdev - Re: [PATCH 2/2] can: m_can: fix missed interrupts with m_can

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0ebdf87729fba276b1ff4a06a5f4dad4a3768e8a.camel@ew.tq-group.com>
Date: Thu, 19 Sep 2024 10:58:46 +0200
From: Matthias Schiffer <matthias.schiffer@...tq-group.com>
To: Marc Kleine-Budde <mkl@...gutronix.de>
Cc: Chandrasekar Ramakrishnan <rcsekar@...sung.com>, Vincent Mailhol
 <mailhol.vincent@...adoo.fr>, "David S. Miller" <davem@...emloft.net>, Eric
 Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Paolo
 Abeni <pabeni@...hat.com>,  Martin Hundebøll
 <martin@...nix.com>, Markus Schneider-Pargmann <msp@...libre.com>, "Felipe
 Balbi (Intel)" <balbi@...nel.org>, Raymond Tan <raymond.tan@...el.com>,
 Jarkko Nikula <jarkko.nikula@...ux.intel.com>, linux-can@...r.kernel.org, 
 netdev@...r.kernel.org, linux-kernel@...r.kernel.org, linux@...tq-group.com
Subject: Re: [PATCH 2/2] can: m_can: fix missed interrupts with m_can_pci

On Thu, 2024-09-19 at 10:47 +0200, Marc Kleine-Budde wrote:
> On 18.09.2024 16:21:54, Matthias Schiffer wrote:
> > The interrupt line of PCI devices is interpreted as edge-triggered,
> > however the interrupt signal of the m_can controller integrated in Intel
> > Elkhart Lake CPUs appears to be generated level-triggered.
> > 
> > Consider the following sequence of events:
> > 
> > - IR register is read, interrupt X is set
> > - A new interrupt Y is triggered in the m_can controller
> > - IR register is written to acknowledge interrupt X. Y remains set in IR
> > 
> > As at no point in this sequence no interrupt flag is set in IR, the
> > m_can interrupt line will never become deasserted, and no edge will ever
> > be observed to trigger another run of the ISR. This was observed to
> > result in the TX queue of the EHL m_can to get stuck under high load,
> > because frames were queued to the hardware in m_can_start_xmit(), but
> > m_can_finish_tx() was never run to account for their successful
> > transmission.
> > 
> > To fix the issue, repeatedly read and acknowledge interrupts at the
> > start of the ISR until no interrupt flags are set, so the next incoming
> > interrupt will also result in an edge on the interrupt line.
> > 
> > Fixes: cab7ffc0324f ("can: m_can: add PCI glue driver for Intel Elkhart Lake")
> > Signed-off-by: Matthias Schiffer <matthias.schiffer@...tq-group.com>
> > ---
> >  drivers/net/can/m_can/m_can.c | 18 +++++++++++++-----
> >  1 file changed, 13 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/net/can/m_can/m_can.c b/drivers/net/can/m_can/m_can.c
> > index 47481afb9add3..363732517c3c5 100644
> > --- a/drivers/net/can/m_can/m_can.c
> > +++ b/drivers/net/can/m_can/m_can.c
> > @@ -1207,20 +1207,28 @@ static void m_can_coalescing_update(struct m_can_classdev *cdev, u32 ir)
> >  static int m_can_interrupt_handler(struct m_can_classdev *cdev)
> >  {
> >  	struct net_device *dev = cdev->net;
> > -	u32 ir;
> > +	u32 ir = 0, ir_read;
> >  	int ret;
> >  
> >  	if (pm_runtime_suspended(cdev->dev))
> >  		return IRQ_NONE;
> >  
> > -	ir = m_can_read(cdev, M_CAN_IR);
> > +	/* For m_can_pci, the interrupt line is interpreted as edge-triggered,
> > +	 * but the m_can controller generates them as level-triggered. We must
> > +	 * observe that IR is 0 at least once to be sure that the next
> > +	 * interrupt will generate an edge.
> > +	 */
> > +	while ((ir_read = m_can_read(cdev, M_CAN_IR)) != 0) {
> > +		ir |= ir_read;
> > +
> > +		/* ACK all irqs */
> > +		m_can_write(cdev, M_CAN_IR, ir);
> > +	}
> 
> This probably causes a measurable overhead on peripheral devices, think
> about limiting this to !peripheral devices or introduce a new quirk that
> is only set for the PCI devices.
> 
> Marc

Hi Marc,

I did consider introducing a flag like that, but is the overhead really significant? In the regular
case (where no new interrupt comes in between reading, writing and re-reading IR), the only added
overhead is one additional register read. On m_can_pci, I've seen the race condition that causes a
second loop iteration to be taken only once in several 100k frames on avarage.

Or are register reads and writes that much slower on peripheral devices that it is more likely to
receive a new interrupt inbetween? If that is the case, it would indeed make sense to limit this to
instances with edge-triggered IRQ.

Matthias



> 
> > +
> >  	m_can_coalescing_update(cdev, ir);
> >  	if (!ir)
> >  		return IRQ_NONE;
> >  
> > -	/* ACK all irqs */
> > -	m_can_write(cdev, M_CAN_IR, ir);
> > -
> >  	if (cdev->ops->clear_interrupts)
> >  		cdev->ops->clear_interrupts(cdev);
> >  
> > -- 
> > TQ-Systems GmbH | Mühlstraße 2, Gut Delling | 82229 Seefeld, Germany
> > Amtsgericht München, HRB 105018
> > Geschäftsführer: Detlef Schneider, Rüdiger Stahl, Stefan Schneider
> > https://www.tq-group.com/
> > 
> > 
> > 
> 
> Achtung externe E-Mail: Öffnen Sie Anhänge und Links nur, wenn Sie wissen, dass diese aus einer sicheren Quelle stammen und sicher sind. Leiten Sie die E-Mail im Zweifelsfall zur Prüfung an den IT-Helpdesk weiter.
>   Attention external email: Open attachments and links only if you know that they are from a secure source and are safe. In doubt forward the email to the IT-Helpdesk to check it.
> 
>  

-- 
TQ-Systems GmbH | Mühlstraße 2, Gut Delling | 82229 Seefeld, Germany
Amtsgericht München, HRB 105018
Geschäftsführer: Detlef Schneider, Rüdiger Stahl, Stefan Schneider
https://www.tq-group.com/