netdev - Re: [PATCH] can: mcp251xfd: Always stop on BUS_OFF and call netif_stop

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <cd752cb5-c244-82d1-9fbe-bb63cca222d4@topic.nl>
Date: Wed, 12 Jul 2023 07:43:05 +0200
From: Mike Looijmans <mike.looijmans@...ic.nl>
To: Marc Kleine-Budde <mkl@...gutronix.de>
CC: linux-can@...r.kernel.org, "David S. Miller" <davem@...emloft.net>,
 Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>,
 Manivannan Sadhasivam <mani@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
 Thomas Kopp <thomas.kopp@...rochip.com>,
 Wolfgang Grandegger <wg@...ndegger.com>, linux-kernel@...r.kernel.org,
 netdev@...r.kernel.org
Subject: Re: [PATCH] can: mcp251xfd: Always stop on BUS_OFF and call
 netif_stop_queue

Met vriendelijke groet / kind regards,

Mike Looijmans
System Expert

TOPIC Embedded Products B.V.
Materiaalweg 4, 5681 RJ Best
The Netherlands

T: +31 (0) 499 33 69 69
E: mike.looijmans@...icproducts.com
W: www.topic.nl

Please consider the environment before printing this e-mail
On 11-07-2023 19:05, Marc Kleine-Budde wrote:
> On 11.07.2023 17:26:47, Mike Looijmans wrote:
>> When there's an error attempting to store the BER counter, don't abort
>> but continue shutting down the chip as required.
> 
> If you refer to an error by __mcp251xfd_get_berr_counter(), it's not a
> store error, but a failure of regmap_read(priv->map_reg,
> MCP251XFD_REG_TREC, &trec). By default the SPI transfers are CRC enabled
> and no/wrong data from the chip will be detected and return an error
> here (after 3 retires). In out of memory conditions or other kernel
> errors might be possible here, too.
> 
> Have you seen a problem here? But as we shut down the chip here anyways,
> we can ignore the error here.

Excactly my point, the important task here is to report the bus off state.

I haven't seen the error trigger, so it wasn't the cause of our problems.

In retrospect, maybe I should have split this in two patches?

>> After disabling communications, also stop the TX queue with a call to
>> netif_stop_queue.
> 
> can_bus_off() will call netif_carrier_off(), isn't this sufficient? Have
> you enabled automatic restart in case of bus off? I think the netdev
> watchdog will kick you, if the interface has a stooped queue for too
> long (IIRC 5 seconds).

Apparently not. The problem also occurred in the field, where the CAN 
communication got stuck for something like a full day.

Automatic restart was enabled, with a 100ms timeout. Setting a shorter timeout 
like 50 ms seemed to make the problem more likely to occur

What I also noticed, but didn't accurately measure, is that the time it took 
to recover used to be somewhat random, it could take up to about a second for 
packets to start going out again, as observed by watching LEDs. After adding 
netif_stop_queue the recovery was almost instantaneous.

> 
>> When the interface restarts, mcp251xfd_set_mode will
>> call netif_wake_queue and resume.
>>
>> This fixes a hangup in either send() or poll() from userspace. To
>> reproduce: I ran "cansequence can0 -p" to flood the system with packets.
>> While running that, I shorted the CAN signals, causing a bus error.
>> Usually communications would resume after the CAN bus restarted, but
>> sometimes the system got stuck and refused to send any more packets.
>> The sending process would be in either poll() or send(), waiting for
>> the queue to resume. To "unstuck" the process from send() it was
>> sufficient to send any packet on the can interface from another
>> process. To get it out of the poll() hang, only bringing the can
>> interface down (and up) would work.
>>
>> After adding the netif_stop_queue call, I was unable to reproduce the
>> problem.
> 
> The newly added netif_stop_queue() will cause the netif_wake_queue() in
> the mcp251xfd_set_mode() to actually wake the queue. If you observe a
> problem, I think it's a general problem, so all drivers would be
> effected.

After two days digging, my conclusion was that "something" wasn't getting a 
proper "kick" when the CAN interface came back online.

One could argue the netif_stop_queue should be put into can_bus_off. But the 
netif_wake_queue call is in the driver, so that would break symmetry.

> 
>> Signed-off-by: Mike Looijmans <mike.looijmans@...ic.nl>
> 
> Marc
>