linux-kernel - Re: [PATCH 1/7] can: rx-offload: continue on error

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <f9da9ffb-f4c0-fcc2-17e6-7d11d7aec14f@victronenergy.com>
Date:   Sun, 13 Oct 2019 21:26:50 +0000
From:   Jeroen Hofstee <jhofstee@...tronenergy.com>
To:     Marc Kleine-Budde <mkl@...gutronix.de>,
        "linux-can@...r.kernel.org" <linux-can@...r.kernel.org>
CC:     Wolfgang Grandegger <wg@...ndegger.com>,
        "David S. Miller" <davem@...emloft.net>,
        "open list:NETWORKING DRIVERS" <netdev@...r.kernel.org>,
        open list <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 1/7] can: rx-offload: continue on error

Hello Marc,

On 10/9/19 3:18 PM, Marc Kleine-Budde wrote:
> Hello Jeroen,
>
> I'm currently looking at the rx-offload error handling in detail.
>
> TLDR: I've added the patch to linux-can.
>
> Thanks,
> Marc
>
> For the record, the details:
>
> On 9/24/19 8:45 PM, Jeroen Hofstee wrote:
>> While can_rx_offload_offload_one will call mailbox_read to discard
>> the mailbox in case of an error,
> Yes.
>
> can_rx_offload_offload_one() will read into the discard mailbox in case
> of an error.
>
> Currently there are two kinds of errors:
> 1) the rx-offoad skb queue (between the IRQ handler and the NAPI)
>     is full
> 2) alloc_can_skb() fails to allocate a skb, due to OOM
>
>> can_rx_offload_irq_offload_timestamp bails out in the error case.
> Yes, in case of an error both can_rx_offload_irq_offload_timestamp() and
> can_rx_offload_irq_offload_fifo() will stop reading mailboxes, add the
> already filled skbs to the queue and schedule NAPI if needed.
>
> Currently both can_rx_offload_irq_offload_timestamp() and
> can_rx_offload_irq_offload_fifo() will return the number of queued
> mailboxes.
>
> This means in case of queue overflow or OOM, only one mailbox is
> discaded, before can_rx_offload_irq_offload_*() returning the number of
> successfully queued mailboxes so far.
>
> Looking at the flexcan driver:
>
> https://elixir.bootlin.com/linux/latest/source/drivers/net/can/flexcan.c#L867
>
>> 		while ((reg_iflag = flexcan_read_reg_iflag_rx(priv))) {
>> 			handled = IRQ_HANDLED;
>> 			ret = can_rx_offload_irq_offload_timestamp(&priv->offload,
>> 								   reg_iflag);
>> 			if (!ret)
>> 				break;
>> 		}
> [...]
>> 		if (reg_iflag1 & FLEXCAN_IFLAG_RX_FIFO_AVAILABLE) {
>> 			handled = IRQ_HANDLED;
>> 			can_rx_offload_irq_offload_fifo(&priv->offload);
>> 		}
> This means for the timestamp mode, at least one mailbox is discarded or
> if the error occurred after reading one or more mailboxes the while loop
> will try again. If the error persists a second mailbox is discarded.
>
> For the fifo mode, only one mailbox is discarded.
>
> Then the flexcan's IRQ is exited. If there are messages in mailboxes are
> pending another IRQ is triggered.... I doubt that this is a good idea.
>
> On the other hand the ti_hecc driver:
>
>> 		/* offload RX mailboxes and let NAPI deliver them */
>> 		while ((rx_pending = hecc_read(priv, HECC_CANRMP))) {
>> 			can_rx_offload_irq_offload_timestamp(&priv->offload,
>> 							     rx_pending);
>> 			hecc_write(priv, HECC_CANRMP, rx_pending);
>> 		}
> The error is ignored and the _all_ mailboxes are discarded (given the
> error persists).
>
>> Since it is typically called from a while loop, all message will
>> eventually be discarded. So lets continue on error instead to discard
>> them directly.
> After reading my own code and writing it up, your patch totally makes sense.
>
> If there is a shortage of resources, queue overrun or OOM, it makes no
> sense to return from the IRQ handler, if a mailbox is still active as it
> will trigger the IRQ again. Entering the IRQ handler again probably
> doesn't give the system time to recover from the resource problem.


Indeed, I have nothing to comment on that, besides thanks for
being willing to reconsider your own code.

With kind regards,

Jeroen