netdev - Re: Ethernet issue on imx6

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20231017131901.5ae65e4d@xps-13>
Date: Tue, 17 Oct 2023 13:19:01 +0200
From: Miquel Raynal <miquel.raynal@...tlin.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: "Russell King (Oracle)" <linux@...linux.org.uk>, Wei Fang
 <wei.fang@....com>, Shenwei Wang <shenwei.wang@....com>, Clark Wang
 <xiaoning.wang@....com>, davem@...emloft.net, kuba@...nel.org,
 pabeni@...hat.com, linux-imx@....com, netdev@...r.kernel.org, Thomas
 Petazzoni <thomas.petazzoni@...tlin.com>, Alexandre Belloni
 <alexandre.belloni@...tlin.com>, Maxime Chevallier
 <maxime.chevallier@...tlin.com>, Andrew Lunn <andrew@...n.ch>, Stephen
 Hemminger <stephen@...workplumber.org>, Alexander Stein
 <alexander.stein@...tq-group.com>
Subject: Re: Ethernet issue on imx6

Hi Eric,

edumazet@...gle.com wrote on Mon, 16 Oct 2023 21:37:58 +0200:

> On Mon, Oct 16, 2023 at 5:37 PM Miquel Raynal <miquel.raynal@...tlin.com> wrote:
> >
> > Hello again,
> >  
> > > > > # iperf3 -c 192.168.1.1
> > > > > Connecting to host 192.168.1.1, port 5201
> > > > > [  5] local 192.168.1.2 port 37948 connected to 192.168.1.1 port 5201
> > > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > > [  5]   0.00-1.00   sec  11.3 MBytes  94.5 Mbits/sec   43   32.5 KBytes
> > > > > [  5]   1.00-2.00   sec  3.29 MBytes  27.6 Mbits/sec   26   1.41 KBytes
> > > > > [  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
> > > > > [  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > > > > [  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec    5   1.41 KBytes
> > > > > [  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
> > > > > [  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
> > > > > [  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
> > > > > [  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > > > > [  5]   9.00-10.00  sec  0.00 Bytes  0.00 bits/sec    0   1.41 KBytes
> > > > >
> > > > > Thanks,
> > > > > Miquèl  
> > > >
> > > > Can you experiment with :
> > > >
> > > > - Disabling TSO on your NIC (ethtool -K eth0 tso off)
> > > > - Reducing max GSO size (ip link set dev eth0 gso_max_size 16384)
> > > >
> > > > I suspect some kind of issues with fec TX completion, vs TSO emulation.  
> > >
> > > Wow, appears to have a significant effect. I am using Busybox's iproute
> > > implementation which does not know gso_max_size, but I hacked directly
> > > into netdevice.h just to see if it would have an effect. I'm adding
> > > iproute2 to the image for further testing.
> > >
> > > Here is the diff:
> > >
> > > --- a/include/linux/netdevice.h
> > > +++ b/include/linux/netdevice.h
> > > @@ -2364,7 +2364,7 @@ struct net_device {
> > >  /* TCP minimal MSS is 8 (TCP_MIN_GSO_SIZE),
> > >   * and shinfo->gso_segs is a 16bit field.
> > >   */
> > > -#define GSO_MAX_SIZE           (8 * GSO_MAX_SEGS)
> > > +#define GSO_MAX_SIZE           16384u
> > >
> > >         unsigned int            gso_max_size;
> > >  #define TSO_LEGACY_MAX_SIZE    65536
> > >
> > > And here are the results:
> > >
> > > # ethtool -K eth0 tso off
> > > # iperf3 -c 192.168.1.1 -u -b1M
> > > Connecting to host 192.168.1.1, port 5201
> > > [  5] local 192.168.1.2 port 50490 connected to 192.168.1.1 port 5201
> > > [ ID] Interval           Transfer     Bitrate         Total Datagrams
> > > [  5]   0.00-1.00   sec   123 KBytes  1.01 Mbits/sec  87
> > > [  5]   1.00-2.00   sec   122 KBytes   996 Kbits/sec  86
> > > [  5]   2.00-3.00   sec   122 KBytes   996 Kbits/sec  86
> > > [  5]   3.00-4.00   sec   123 KBytes  1.01 Mbits/sec  87
> > > [  5]   4.00-5.00   sec   122 KBytes   996 Kbits/sec  86
> > > [  5]   5.00-6.00   sec   122 KBytes   996 Kbits/sec  86
> > > [  5]   6.00-7.00   sec   123 KBytes  1.01 Mbits/sec  87
> > > [  5]   7.00-8.00   sec   122 KBytes   996 Kbits/sec  86
> > > [  5]   8.00-9.00   sec   122 KBytes   996 Kbits/sec  86
> > > [  5]   9.00-10.00  sec   123 KBytes  1.01 Mbits/sec  87
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
> > > [  5]   0.00-10.00  sec  1.19 MBytes  1.00 Mbits/sec  0.000 ms  0/864 (0%)  sender
> > > [  5]   0.00-10.05  sec  1.11 MBytes   925 Kbits/sec  0.045 ms  62/864 (7.2%)  receiver
> > > iperf Done.
> > > # iperf3 -c 192.168.1.1
> > > Connecting to host 192.168.1.1, port 5201
> > > [  5] local 192.168.1.2 port 34792 connected to 192.168.1.1 port 5201
> > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > [  5]   0.00-1.00   sec  1.63 MBytes  13.7 Mbits/sec   30   1.41 KBytes
> > > [  5]   1.00-2.00   sec  7.40 MBytes  62.1 Mbits/sec   65   14.1 KBytes
> > > [  5]   2.00-3.00   sec  7.83 MBytes  65.7 Mbits/sec  109   2.83 KBytes
> > > [  5]   3.00-4.00   sec  2.49 MBytes  20.9 Mbits/sec   46   19.8 KBytes
> > > [  5]   4.00-5.00   sec  7.89 MBytes  66.2 Mbits/sec  109   2.83 KBytes
> > > [  5]   5.00-6.00   sec   255 KBytes  2.09 Mbits/sec   22   2.83 KBytes
> > > [  5]   6.00-7.00   sec  4.35 MBytes  36.5 Mbits/sec   74   41.0 KBytes
> > > [  5]   7.00-8.00   sec  10.9 MBytes  91.8 Mbits/sec   34   45.2 KBytes
> > > [  5]   8.00-9.00   sec  5.35 MBytes  44.9 Mbits/sec   82   1.41 KBytes
> > > [  5]   9.00-10.00  sec  1.37 MBytes  11.5 Mbits/sec   73   1.41 KBytes
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Retr
> > > [  5]   0.00-10.00  sec  49.5 MBytes  41.5 Mbits/sec  644             sender
> > > [  5]   0.00-10.05  sec  49.3 MBytes  41.1 Mbits/sec                  receiver
> > > iperf Done.
> > >
> > > There is still a noticeable amount of drop/retries, but overall the
> > > results are significantly better. What is the rationale behind the
> > > choice of 16384 in particular? Could this be further improved?  
> >
> > Apparently I've been too enthusiastic. After sending this e-mail I've
> > re-generated an image with iproute2 and dd'ed the whole image into an
> > SD card, while until now I was just updating the kernel/DT manually and
> > got the same performances as above without the gro size trick. I need
> > to clarify this further.
> >  
> 
> Looking a bit at fec, I think fec_enet_txq_put_hdr_tso() is  bogus...
> 
> txq->tso_hdrs should be properly aligned by definition.
> 
> If FEC_QUIRK_SWAP_FRAME is requested, better copy the right thing, not
> original skb->data ???

I've clarified the situation after looking at the build artifacts and
going through (way) longer testing sessions, as successive 10-second
tests can lead to really different results.

On a 4.14.322 kernel (still maintained) I really get extremely crappy
throughput.

On a mainline 6.5 kernel I thought I had a similar issue but this was
due to wrong RGMII-ID timings being used (I ported the board from 4.14
to 6.5 and made a mistake). So with the right timings, I get
much better throughput but still significantly low compared to what I
would expect.

So I tested Eric's fixes:
- TCP fix:
https://lore.kernel.org/netdev/CANn89iJUBujG2AOBYsr0V7qyC5WTgzx0GucO=2ES69tTDJRziw@mail.gmail.com/
- FEC fix:
https://lore.kernel.org/netdev/CANn89iLxKQOY5ZA5o3d1y=v4MEAsAQnzmVDjmLY0_bJPG93tKQ@mail.gmail.com/
As well as different CPUfreq/CPUidle parameters, as pointed out by
Alexander:
https://lore.kernel.org/netdev/2245614.iZASKD2KPV@steina-w/

Here are the results of 100 seconds iperf uplink TCP tests, as reported
by the receiver. First value is the mean, the raw results are in the '(' ')'.
Unit: Mbps

Default setup:
CPUidle yes, CPUfreq yes, TCP fix no, FEC fix no: 30.2 (23.8, 28.4, 38.4)

CPU power management tests (with TCP fix and FEC fix):
CPUidle yes, CPUfreq yes: 26.5 (24.5, 28.5)
CPUidle  no, CPUfreq yes: 50.3 (44.8, 55.7)
CPUidle yes, CPUfreq  no: 80.2 (75.8, 79.5, 80.8, 81.8, 83.1)
CPUidle  no, CPUfreq  no: 85.4 (80.6, 81.1, 86.2, 87.5, 91.8)

Eric's fixes tests (No CPUidle, no CPUfreq):
TCP fix yes, FEC fix yes: 85.4 (80.6, 81.1, 86.2, 87.5, 91.8) (same as above)
TCP fix  no, FEC fix yes: 82.0 (74.5, 75.9, 82.2, 87.5, 90.2)
TCP fix yes, FEC fix  no: 81.4 (77.5, 77.7, 82.8, 83.7, 85.4)
TCP fix  no, FEC fix  no: 79.6 (68.2, 77.6, 78.9, 86.4, 87.1)

So indeed the TCP and FEC patches don't seem to have a real impact (or
a small one, I don't know given how scattered are the results). However
there is definitely something wrong with the low power settings and I
believe the Errata pointed by Alexander may have a real impact there
(ERR006687 ENET: Only the ENET wake-up interrupt request can wake the
system from Wait mode [i.MX 6Dual/6Quad Only]), probably that my
hardware lacks the hardware workaround.

I believe the remaining fluctuations are due to the RGMII-ID timings
not being totally optimal, I think I would need to extend them slightly
more in the Tx path but they are already set to the maximum value.
Anyhow, I no longer see any difference in the drop rate between -b1M
and -b0 (<1%) so I believe it is acceptable like that.

Now I might try to track what is missing in 4.14.322 and perhaps ask
for a backport if it's relevant.

Thanks a lot for all your feedback,
Miquèl