lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMeyCbhG7-dCr4bVWP=kNuwLa6CNB9h=SwN_kK7VbJ7YFCY2Ow@mail.gmail.com>
Date:   Thu, 12 Nov 2020 12:56:20 +0100
From:   Kegl Rohit <keglrohit@...il.com>
To:     David Laight <David.Laight@...lab.com>
Cc:     Eric Dumazet <eric.dumazet@...il.com>,
        Fabio Estevam <festevam@...il.com>,
        netdev <netdev@...r.kernel.org>
Subject: Re: net: fec: rx descriptor ring out of order

On Thu, Nov 12, 2020 at 12:10 PM David Laight <David.Laight@...lab.com> wrote:
>
> From: Eric Dumazet
> > Sent: 12 November 2020 10:42
> >
> > On 11/12/20 7:52 AM, Kegl Rohit wrote:
> > > On Wed, Nov 11, 2020 at 11:18 PM Fabio Estevam <festevam@...il.com> wrote:
> > >>
> > >> On Wed, Nov 11, 2020 at 11:27 AM Kegl Rohit <keglrohit@...il.com> wrote:
> > >>>
> > >>> Hello!
> > >>>
> > >>> We are using a imx6q platform.
> > >>> The fec interface is used to receive a continuous stream of custom /
> > >>> raw ethernet packets. The packet size is fixed ~132 bytes and they get
> > >>> sent every 250µs.
> > >>>
> > >>> While testing I observed spontaneous packet delays from time to time.
> > >>> After digging down deeper I think that the fec peripheral does not
> > >>> update the rx descriptor status correctly.
> > >>
> > >> What is the kernel version that you are using?
> > >
> > > Sadly stuck at 3.10.108.
>
> If you build a newer kernel it should work with your
> existing userspace.
Not so easily possible because there are custom drivers and some
kernel modifications in the mix.
I have a dirty ported system with a 5.4 kernel ready. I will also try it there.
But I am afraid the error will not happen but still exist.


> > > https://github.com/gregkh/linux/blob/v3.10.108/drivers/net/ethernet/freescale/fec_main.c
> > > The rx queue status handling did not change much compared to 5.x. Only
> > > the NAPI handling / clearing IRQs was changed more than once.
> > > I also backported the newer NAPI handling style / clearing irqs not in
> > > the irq handler but in napi_poll() => same issue.
> > > The issue is pretty rare => To reproduce i have to reboot the system
> > > every 3 min. Sometimes after 1~2min on the first, sometimes on the
> > > ~10th reboot it will happen.
> > >
> >
> > Is seems some rmb() & wmb() are missing.
>
> They are unlikely to make any difference since the 'bad'
> rx status persists between calls to the receive function.

Our kernel already has some patches like the wmb() for the rx path and
the rmb() for the tx path applied.
I tried the rmb() at the rx path, because this is not in master
https://github.com/gregkh/linux/blob/master/drivers/net/ethernet/freescale/fec_main.c#L1434.
=> Still the same issue, no change

I extended the debugging:
descriptor index, current, empty, desc.status, desc.buffer (mapped
skb->data), desc.length
[  137.758009 <    0.000015>] 409 0xa09d5320 C E 0x8840 0x2c6f0780    0

I also reset the desc.length field to 0 after the packet was received
and before the descriptor was set to empty again.
So I could observe that the length is also not set like the status.

Because i know the content and size of my rx packets, i used
dma_sync_single(mapped skb->data) to get the data even if the status
is empty.
Each packet contains a counter, so i verified that the data is already
there and not lost.
Only the descriptor status and length is not updated.

[  137.757966 <    0.000021>] cnt: 2341          .... counter of
current ("empty") packet; index 409  in example
[  137.757984 <    0.000018>] nxcnt: 2342      .... counter of next
not empty packet; index 410 in example
=> content is there but status is not. As next step i will also check
if all bytes are correct, not only the two counter bytes.

[   40.888181 <    0.000344>] --- start test application ---
[  137.757945 <   96.869764>] ring error, next is ready
[  137.757966 <    0.000021>] cnt: 2341
[  137.757984 <    0.000018>] nxcnt: 2342
[  137.757994 <    0.000010>] RX ahead
[  137.758009 <    0.000015>] 409 0xa09d5320 C E 0x8840 0x2c6f0780    0
[  137.758024 <    0.000015>] 410 0xa09d5340     0x0840 0x2c6f0ec0  132
[  137.758038 <    0.000014>] 411 0xa09d5360   E 0x8840 0x2c6f1600    0
[  137.758051 <    0.000013>] 412 0xa09d5380   E 0x8840 0x2c6f1d40    0
[  137.758064 <    0.000013>] 413 0xa09d53a0   E 0x8840 0x2c6f2480    0
[  137.758076 <    0.000012>] 414 0xa09d53c0   E 0x8840 0x2c6f2bc0    0
[  137.758089 <    0.000013>] 415 0xa09d53e0   E 0x8840 0x2c6f3300    0
[  137.758102 <    0.000013>] 416 0xa09d5400   E 0x8840 0x2c6f3a40    0
[  137.758115 <    0.000013>] 417 0xa09d5420   E 0x8840 0x2c6f4180    0
[  137.758127 <    0.000012>] 418 0xa09d5440   E 0x8840 0x2c6f48c0    0
[  137.758140 <    0.000013>] 419 0xa09d5460   E 0x8840 0x2c6f5000    0
[  137.758152 <    0.000012>] 420 0xa09d5480   E 0x8840 0x2c6f5740    0
[  137.758165 <    0.000013>] 421 0xa09d54a0   E 0x8840 0x2c6f5e80    0
[  137.758414 <    0.000025>] ring error, next is ready
[  137.758426 <    0.000012>] cnt: 2341
[  137.758439 <    0.000013>] nxcnt: 2342
[  137.758448 <    0.000009>] RX ahead
[  137.758485 <    0.000037>] 409 0xa09d5320 C E 0x8840 0x2c6f0780    0
[  137.758500 <    0.000015>] 410 0xa09d5340     0x0840 0x2c6f0ec0  132
[  137.758515 <    0.000015>] 411 0xa09d5360     0x0840 0x2c6f1600  132
[  137.758529 <    0.000014>] 412 0xa09d5380     0x0840 0x2c6f1d40  132
[  137.758542 <    0.000013>] 413 0xa09d53a0   E 0x8840 0x2c6f2480    0
[  137.758556 <    0.000014>] 414 0xa09d53c0   E 0x8840 0x2c6f2bc0    0
[  137.758569 <    0.000013>] 415 0xa09d53e0   E 0x8840 0x2c6f3300    0
[  137.758582 <    0.000013>] 416 0xa09d5400   E 0x8840 0x2c6f3a40    0
[  137.758596 <    0.000014>] 417 0xa09d5420   E 0x8840 0x2c6f4180    0
[  137.758609 <    0.000013>] 418 0xa09d5440   E 0x8840 0x2c6f48c0    0
[  137.758622 <    0.000013>] 419 0xa09d5460   E 0x8840 0x2c6f5000    0
[  137.758905 <    0.000031>] ring error, next is ready
[  137.758917 <    0.000012>] cnt: 2341
[  137.758930 <    0.000013>] nxcnt: 2342
[  137.758938 <    0.000008>] RX ahead
[  137.758951 <    0.000013>] 409 0xa09d5320 C E 0x8840 0x2c6f0780    0
[  137.758965 <    0.000014>] 410 0xa09d5340     0x0840 0x2c6f0ec0  132
[  137.758978 <    0.000013>] 411 0xa09d5360     0x0840 0x2c6f1600  132
[  137.758991 <    0.000013>] 412 0xa09d5380     0x0840 0x2c6f1d40  132
[  137.759005 <    0.000014>] 413 0xa09d53a0     0x0840 0x2c6f2480  132
[  137.759018 <    0.000013>] 414 0xa09d53c0     0x0840 0x2c6f2bc0  132
[  137.759031 <    0.000013>] 415 0xa09d53e0   E 0x8840 0x2c6f3300    0
[  137.759044 <    0.000013>] 416 0xa09d5400   E 0x8840 0x2c6f3a40    0
[  137.759057 <    0.000013>] 417 0xa09d5420   E 0x8840 0x2c6f4180    0
[  137.759071 <    0.000014>] 418 0xa09d5440   E 0x8840 0x2c6f48c0    0
[  137.759084 <    0.000013>] 419 0xa09d5460   E 0x8840 0x2c6f5000    0

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ