[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <cd752cb5-c244-82d1-9fbe-bb63cca222d4@topic.nl>
Date:   Wed, 12 Jul 2023 07:43:05 +0200
From:   Mike Looijmans <mike.looijmans@...ic.nl>
To:     Marc Kleine-Budde <mkl@...gutronix.de>
CC:     linux-can@...r.kernel.org, "David S. Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Manivannan Sadhasivam <mani@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        Thomas Kopp <thomas.kopp@...rochip.com>,
        Wolfgang Grandegger <wg@...ndegger.com>,
        linux-kernel@...r.kernel.org, netdev@...r.kernel.org
Subject: Re: [PATCH] can: mcp251xfd: Always stop on BUS_OFF and call
 netif_stop_queue
Met vriendelijke groet / kind regards,
Mike Looijmans
System Expert
TOPIC Embedded Products B.V.
Materiaalweg 4, 5681 RJ Best
The Netherlands
T: +31 (0) 499 33 69 69
E: mike.looijmans@...icproducts.com
W: www.topic.nl
Please consider the environment before printing this e-mail
On 11-07-2023 19:05, Marc Kleine-Budde wrote:
> On 11.07.2023 17:26:47, Mike Looijmans wrote:
>> When there's an error attempting to store the BER counter, don't abort
>> but continue shutting down the chip as required.
> 
> If you refer to an error by __mcp251xfd_get_berr_counter(), it's not a
> store error, but a failure of regmap_read(priv->map_reg,
> MCP251XFD_REG_TREC, &trec). By default the SPI transfers are CRC enabled
> and no/wrong data from the chip will be detected and return an error
> here (after 3 retires). In out of memory conditions or other kernel
> errors might be possible here, too.
> 
> Have you seen a problem here? But as we shut down the chip here anyways,
> we can ignore the error here.
Excactly my point, the important task here is to report the bus off state.
I haven't seen the error trigger, so it wasn't the cause of our problems.
In retrospect, maybe I should have split this in two patches?
>> After disabling communications, also stop the TX queue with a call to
>> netif_stop_queue.
> 
> can_bus_off() will call netif_carrier_off(), isn't this sufficient? Have
> you enabled automatic restart in case of bus off? I think the netdev
> watchdog will kick you, if the interface has a stooped queue for too
> long (IIRC 5 seconds).
Apparently not. The problem also occurred in the field, where the CAN 
communication got stuck for something like a full day.
Automatic restart was enabled, with a 100ms timeout. Setting a shorter timeout 
like 50 ms seemed to make the problem more likely to occur
What I also noticed, but didn't accurately measure, is that the time it took 
to recover used to be somewhat random, it could take up to about a second for 
packets to start going out again, as observed by watching LEDs. After adding 
netif_stop_queue the recovery was almost instantaneous.
> 
>> When the interface restarts, mcp251xfd_set_mode will
>> call netif_wake_queue and resume.
>>
>> This fixes a hangup in either send() or poll() from userspace. To
>> reproduce: I ran "cansequence can0 -p" to flood the system with packets.
>> While running that, I shorted the CAN signals, causing a bus error.
>> Usually communications would resume after the CAN bus restarted, but
>> sometimes the system got stuck and refused to send any more packets.
>> The sending process would be in either poll() or send(), waiting for
>> the queue to resume. To "unstuck" the process from send() it was
>> sufficient to send any packet on the can interface from another
>> process. To get it out of the poll() hang, only bringing the can
>> interface down (and up) would work.
>>
>> After adding the netif_stop_queue call, I was unable to reproduce the
>> problem.
> 
> The newly added netif_stop_queue() will cause the netif_wake_queue() in
> the mcp251xfd_set_mode() to actually wake the queue. If you observe a
> problem, I think it's a general problem, so all drivers would be
> effected.
After two days digging, my conclusion was that "something" wasn't getting a 
proper "kick" when the CAN interface came back online.
One could argue the netif_stop_queue should be put into can_bus_off. But the 
netif_wake_queue call is in the driver, so that would break symmetry.
> 
>> Signed-off-by: Mike Looijmans <mike.looijmans@...ic.nl>
> 
> Marc
> 
Powered by blists - more mailing lists
 
