netdev - Re: [PATCH v3] net: macb: restart tx after tx used bit read

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <2f6bbbfc-0776-edb4-2eeb-ed73a8312d63@vaisala.com>
Date:   Fri, 25 Mar 2022 09:10:35 +0200
From:   Tomas Melin <tomas.melin@...sala.com>
To:     Robert Hancock <robert.hancock@...ian.com>,
        "kuba@...nel.org" <kuba@...nel.org>
Cc:     "Nicolas.Ferre@...rochip.com" <Nicolas.Ferre@...rochip.com>,
        "davem@...emloft.net" <davem@...emloft.net>,
        "claudiu.beznea@...rochip.com" <claudiu.beznea@...rochip.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: [PATCH v3] net: macb: restart tx after tx used bit read

Hi,

On 23/03/2022 18:42, Robert Hancock wrote:
> On Wed, 2022-03-23 at 08:43 -0700, Jakub Kicinski wrote:
>> On Wed, 23 Mar 2022 10:08:20 +0200 Tomas Melin wrote:
>>>> From: <Claudiu.Beznea@...rochip.com>
>>>> To: <Nicolas.Ferre@...rochip.com>, <davem@...emloft.net>
>>>> Cc: <netdev@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
>>>> 	<Claudiu.Beznea@...rochip.com>
>>>> Subject: [PATCH v3] net: macb: restart tx after tx used bit read
>>>> Date: Mon, 17 Dec 2018 10:02:42 +0000	[thread overview]
>>>> Message-ID: <
>>>> 1545040937-6583-1-git-send-email-claudiu.beznea@...rochip.com> (raw)
>>>>
>>>> From: Claudiu Beznea <claudiu.beznea@...rochip.com>
>>>>
>>>> On some platforms (currently detected only on SAMA5D4) TX might stuck
>>>> even the pachets are still present in DMA memories and TX start was
>>>> issued for them. This happens due to race condition between MACB driver
>>>> updating next TX buffer descriptor to be used and IP reading the same
>>>> descriptor. In such a case, the "TX USED BIT READ" interrupt is asserted.
>>>> GEM/MACB user guide specifies that if a "TX USED BIT READ" interrupt
>>>> is asserted TX must be restarted. Restart TX if used bit is read and
>>>> packets are present in software TX queue. Packets are removed from
>>>> software
>>>> TX queue if TX was successful for them (see macb_tx_interrupt()).
>>>>
>>>> Signed-off-by: Claudiu Beznea <claudiu.beznea@...rochip.com>
>>>
>>> On Xilinx Zynq the above change can cause infinite interrupt loop leading
>>> to CPU stall. Seems timing/load needs to be appropriate for this to happen,
>>> and currently
>>> with 1G ethernet this can be triggered normally within minutes when running
>>> stress tests
>>> on the network interface.
>>>
>>> The events leading up to the interrupt looping are similar as the issue
>>> described in the
>>> commit message. However in our case, restarting TX does not help at all.
>>> Instead
>>> the controller is stuck on the queue end descriptor generating endless
>>> TX_USED
>>> interrupts, never breaking out of interrupt routine.
>>>
>>> Any chance you remember more details about in which situation restarting TX
>>> helped for
>>> your use case? was tx_qbar at the end of frame or stopped in middle of
>>> frame?
>>
>> Which kernel version are you using? Robert has been working on macb +
>> Zynq recently, adding him to CC.

This was originally seen on 4.19.x series kernel, but the same issue is 
also present with 5.10.x series kernels. I have tried looking, but did 
not see any changes which would seem related to this particular issues 
neither in newer mainline kernels or xilinx tree.

These stall issues have surfaced as CPU load and timing has changed, 
even with the same kernel version. So it seems rather likely that it 
needs the correct timing to get triggered.

> 
> We have been working with ZynqMP and haven't seen such isses in the past, but
> I'm not sure we've tried the same type of stress test on those interfaces. If
> by Zynq, Tomas means the Zynq-7000 series, that might be a different
> version/revision of the IP core than we have as well.
Indeed, this is the Zynq-7000 series.

> 
> I haven't looked at the TX ring descriptor and register setup on this core in
> that much detail, but the fact the controller gets into this "TX used bit read"
> state in the first place seems unusual. I'm wondering if something is being
> done in the wrong order or if we are missing a memory barrier etc?
> 
I agree that it could be something like that, or then the controller 
gets into some unknown state that makes the transmission halt until more 
data gets pushed into the buffer.

I have a proposal for improving on the original tx restart approach, and 
recently posted it to the list. With that patch applied, have not been 
able to cause any stall sitations during stress testing anymore.

Thanks,
Tomas