netdev - Re: [PATCH v2] net: macb: Restart tx only if queue pointer is lagging

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <adf4ce47-142e-711c-bde1-cda1fc0a196c@vaisala.com>
Date:   Fri, 8 Apr 2022 12:57:24 +0300
From:   Tomas Melin <tomas.melin@...sala.com>
To:     Harini Katakam <harinik@...inx.com>,
        "Claudiu.Beznea@...rochip.com" <Claudiu.Beznea@...rochip.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Cc:     "Nicolas.Ferre@...rochip.com" <Nicolas.Ferre@...rochip.com>,
        "davem@...emloft.net" <davem@...emloft.net>,
        "kuba@...nel.org" <kuba@...nel.org>,
        "pabeni@...hat.com" <pabeni@...hat.com>,
        Shubhrajyoti Datta <shubhraj@...inx.com>,
        Michal Simek <michals@...inx.com>,
        "pthombar@...ence.com" <pthombar@...ence.com>,
        "mparab@...ence.com" <mparab@...ence.com>,
        "rafalo@...ence.com" <rafalo@...ence.com>
Subject: Re: [PATCH v2] net: macb: Restart tx only if queue pointer is lagging

Hi Claudiu, Harini,

On 08/04/2022 11:47, Harini Katakam wrote:
> Hi Claudiu, Tomas,
> 
>> -----Original Message-----
>> From: Claudiu.Beznea@...rochip.com <Claudiu.Beznea@...rochip.com>
>> Sent: Friday, April 8, 2022 1:13 PM
>> To: tomas.melin@...sala.com; netdev@...r.kernel.org
>> Cc: Nicolas.Ferre@...rochip.com; davem@...emloft.net; kuba@...nel.org;
>> pabeni@...hat.com; Harini Katakam <harinik@...inx.com>; Shubhrajyoti
>> Datta <shubhraj@...inx.com>; Michal Simek <michals@...inx.com>;
>> pthombar@...ence.com; mparab@...ence.com; rafalo@...ence.com
>> Subject: Re: [PATCH v2] net: macb: Restart tx only if queue pointer is lagging
>>
>> Hi, Tomas,
>>
>> I'm returning to this new thread.
>>
>> Sorry for the long delay. I looked though my emails for the steps to
>> reproduce the bug that introduces macb_tx_restart() but haven't found
>> them.
>> Though the code in this patch should not affect at all SAMA5D4.
>>
>> I have tested anyway SAMA5D4 with and without your code and saw no
>> issues.
>> In case Dave, Jakub want to merge it you can add my
>> Tested-by: Claudiu Beznea <claudiu.beznea@...rochip.com>
>> Reviewed-by: Claudiu Beznea <claudiu.beznea@...rochip.com>

Thank you for the effort to review and test this! Also thanks for the
discussions around this issue to provide further insights.


>>
>> The only thing with this patch, as mention earlier, is that freeing of packet N
>> may depend on sending packet N+1 and if packet N+1 blocks again the HW
>> then the freeing of packets N, N+1 may depend on packet N+2 etc. But from
>> your investigation it seems hardware has some bugs.

Indeed, this is not behaviour I have encountered in any testing. If we
were ever to encounter such issue, then it would need to be handled in
separate manner. Perhaps call tx_interrupt() to progress the queue. But
then again, this does not seem to happen.

>>
>> FYI, I looked though Xilinx github repository and saw no patches on macb that
>> may be related to this issue.
>>
>> Anyway, it would be good if there would be some replies from Xilinx or at
>> least Cadence people on this (previous thread at [1]).
> 
> Sorry for the delayed response.
> I saw the condition you described and I'm not able to reproduce it.
> But I agree with your assessment that restarting TX will not help in this case.
> Also, the original patch restarting TX was also not reproduced on Zynq board
> easily. We've had some users report the issue after > 1hr of traffic but that was
> on a 4.xx kernel and I'm afraid I don’t have a case where I can reproduce the
> original issue Claudiu described on any 5.xx kernel.
> 
> Based on the thread, there is one possibility for a HW bug that controller fails to
> generate TCOMP when a TXUBR and restart conditions occur because these interrupts
> are edge triggered on Zynq.

This is interesting hypothesis and that would indeed lead to this situation.


> 
> I'm going to check the errata and let you know if I find anything relevant and also
> request Cadence folks to comment.
> I'm sorry ask but is this condition reproducible on any later variants of the IP in Xilinx or
> non-Xilinx devices?

I have not seen this issue on MPSoC (atleast yet). Indeed this issue
seems to require the correct timing conditions for being able to trigger it.

So any additional information that we might get about possible issues in
IP is welcomed. However, the hardware on the boards we have at hand will
still be the same so the patch as such is relevant.

BR,
Tomas



> Zynq US+ MPSoC has the r1p07 while Zynq has the older version IP r1p23 (old versioning)
> 
> Regards,
> Harini
> 
>>
>> Thank you,
>> Claudiu Beznea
>>
>> [1]
>> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fnetdev%2F82276bf7-72a5-6a2e-ff33-&amp;data=04%7C01%7Ctomas.melin%40vaisala.com%7C352a532fe14b42ad01d508da193c6320%7C6d7393e041f54c2e9b124c2be5da5c57%7C0%7C0%7C637850044400650522%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=rsBnJEVlDqpSUIfL%2BuXzAgTUL4w9rqaR6A6OLAi9gNQ%3D&amp;reserved=0
>> f8fe0c5e4a90@...rochip.com/T/#m644c84a8709a65c40b8fc15a589e83b24e4
>> 8ccfd
>>
>> On 07.04.2022 19:16, Tomas Melin wrote:
>>> EXTERNAL EMAIL: Do not click links or open attachments unless you know
>>> the content is safe
>>>
>>> commit 4298388574da ("net: macb: restart tx after tx used bit read")
>>> added support for restarting transmission. Restarting tx does not work
>>> in case controller asserts TXUBR interrupt and TQBP is already at the
>>> end of the tx queue. In that situation, restarting tx will immediately
>>> cause assertion of another TXUBR interrupt. The driver will end up in
>>> an infinite interrupt loop which it cannot break out of.
>>>
>>> For cases where TQBP is at the end of the tx queue, instead only clear
>>> TX_USED interrupt. As more data gets pushed to the queue, transmission
>>> will resume.
>>>
>>> This issue was observed on a Xilinx Zynq-7000 based board.
>>> During stress test of the network interface, driver would get stuck on
>>> interrupt loop within seconds or minutes causing CPU to stall.
>>>
>>> Signed-off-by: Tomas Melin <tomas.melin@...sala.com>
>>> ---
>>> Changes v2:
>>> - change referenced commit to use original commit ID instead of stable
>>> branch ID
>>>
>>>  drivers/net/ethernet/cadence/macb_main.c | 8 ++++++++
>>>  1 file changed, 8 insertions(+)
>>>
>>> diff --git a/drivers/net/ethernet/cadence/macb_main.c
>>> b/drivers/net/ethernet/cadence/macb_main.c
>>> index 800d5ced5800..e475be29845c 100644
>>> --- a/drivers/net/ethernet/cadence/macb_main.c
>>> +++ b/drivers/net/ethernet/cadence/macb_main.c
>>> @@ -1658,6 +1658,7 @@ static void macb_tx_restart(struct macb_queue
>> *queue)
>>>         unsigned int head = queue->tx_head;
>>>         unsigned int tail = queue->tx_tail;
>>>         struct macb *bp = queue->bp;
>>> +       unsigned int head_idx, tbqp;
>>>
>>>         if (bp->caps & MACB_CAPS_ISR_CLEAR_ON_WRITE)
>>>                 queue_writel(queue, ISR, MACB_BIT(TXUBR)); @@ -1665,6
>>> +1666,13 @@ static void macb_tx_restart(struct macb_queue *queue)
>>>         if (head == tail)
>>>                 return;
>>>
>>> +       tbqp = queue_readl(queue, TBQP) / macb_dma_desc_get_size(bp);
>>> +       tbqp = macb_adj_dma_desc_idx(bp, macb_tx_ring_wrap(bp, tbqp));
>>> +       head_idx = macb_adj_dma_desc_idx(bp, macb_tx_ring_wrap(bp,
>>> + head));
>>> +
>>> +       if (tbqp == head_idx)
>>> +               return;
>>> +
>>>         macb_writel(bp, NCR, macb_readl(bp, NCR) | MACB_BIT(TSTART));
>>> }
>>>
>>> --
>>> 2.35.1
>>>
>