netdev - Re: [PATCH v3] net: macb: restart tx after tx used bit read

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <14152644-c44f-a011-7f26-331868831e4f@microchip.com>
Date:   Fri, 25 Mar 2022 08:13:08 +0000
From:   <Claudiu.Beznea@...rochip.com>
To:     <robert.hancock@...ian.com>, <kuba@...nel.org>,
        <tomas.melin@...sala.com>
CC:     <Nicolas.Ferre@...rochip.com>, <davem@...emloft.net>,
        <netdev@...r.kernel.org>
Subject: Re: [PATCH v3] net: macb: restart tx after tx used bit read

Hi,

On 23.03.2022 18:42, Robert Hancock wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
> 
> On Wed, 2022-03-23 at 08:43 -0700, Jakub Kicinski wrote:
>> On Wed, 23 Mar 2022 10:08:20 +0200 Tomas Melin wrote:
>>>> From: <Claudiu.Beznea@...rochip.com>
>>>> To: <Nicolas.Ferre@...rochip.com>, <davem@...emloft.net>
>>>> Cc: <netdev@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
>>>>   <Claudiu.Beznea@...rochip.com>
>>>> Subject: [PATCH v3] net: macb: restart tx after tx used bit read
>>>> Date: Mon, 17 Dec 2018 10:02:42 +0000     [thread overview]
>>>> Message-ID: <
>>>> 1545040937-6583-1-git-send-email-claudiu.beznea@...rochip.com> (raw)
>>>>
>>>> From: Claudiu Beznea <claudiu.beznea@...rochip.com>
>>>>
>>>> On some platforms (currently detected only on SAMA5D4) TX might stuck
>>>> even the pachets are still present in DMA memories and TX start was
>>>> issued for them. This happens due to race condition between MACB driver
>>>> updating next TX buffer descriptor to be used and IP reading the same
>>>> descriptor. In such a case, the "TX USED BIT READ" interrupt is asserted.
>>>> GEM/MACB user guide specifies that if a "TX USED BIT READ" interrupt
>>>> is asserted TX must be restarted. Restart TX if used bit is read and
>>>> packets are present in software TX queue. Packets are removed from
>>>> software
>>>> TX queue if TX was successful for them (see macb_tx_interrupt()).
>>>>
>>>> Signed-off-by: Claudiu Beznea <claudiu.beznea@...rochip.com>
>>>
>>> On Xilinx Zynq the above change can cause infinite interrupt loop leading
>>> to CPU stall. Seems timing/load needs to be appropriate for this to happen,
>>> and currently
>>> with 1G ethernet this can be triggered normally within minutes when running
>>> stress tests
>>> on the network interface.
>>>
>>> The events leading up to the interrupt looping are similar as the issue
>>> described in the
>>> commit message. However in our case, restarting TX does not help at all.
>>> Instead
>>> the controller is stuck on the queue end descriptor generating endless
>>> TX_USED
>>> interrupts, never breaking out of interrupt routine.
>>>
>>> Any chance you remember more details about in which situation restarting TX
>>> helped for
>>> your use case? was tx_qbar at the end of frame or stopped in middle of
>>> frame?

I look though my emails for this particular issue, didn't find all that I
need with regards to the issue that leads to this fix, but what can I tell
from my mind and some emails still in my inbox is that this issue had been
reproduced at that time only with a particular we server running on SAMA5D4
and at some point a packet stopped being transmitted although TX_START had
been issued for it. In that case the controller fired TX Used bit read
interrupt.

The GEM datasheet specifies this "Transmit is halted when a buffer
descriptor with its used bit set is read, a transmit error occurs, or by
writing to the transmit halt bit of the network control register"

Also, at that point had a support case open on Cadence and they confirm
that having TX restarted is the good way.

At the time of investigating the issue I only found it reproducible only on
one SoC (SAMA5D4) out of 4 (SAMA5D2, SAMA5D3 and one ARM926 based SoC). All
these are probably less faster than ZynqMP.

Though this IP is today present also on SAMA7G5 who's CPU can run @1GHz and
MAC IP being clocked @200MHz. Even in this last setup I haven't saw the
behavior with used bit read being fired too often.

By any chance on your setup do you have small packets inserted in MACB
queues at high rate?

>>
>> Which kernel version are you using? Robert has been working on macb +
>> Zynq recently, adding him to CC.
> 
> We have been working with ZynqMP and haven't seen such isses in the past, but
> I'm not sure we've tried the same type of stress test on those interfaces. If
> by Zynq, Tomas means the Zynq-7000 series, that might be a different
> version/revision of the IP core than we have as well.
> 
> I haven't looked at the TX ring descriptor and register setup on this core in
> that much detail, but the fact the controller gets into this "TX used bit read"
> state in the first place seems unusual. I'm wondering if something is being
> done in the wrong order or if we are missing a memory barrier etc?

That might possible especially on descriptors update path.

> 
> --
> Robert Hancock
> Senior Hardware Designer, Calian Advanced Technologies
> www.calian.com