[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALs4sv2mq5=FehCcxPveCbCMNT1aw=8LhqZ4g3=GXhKw8hsrmA@mail.gmail.com>
Date: Mon, 1 Nov 2021 12:36:34 +0530
From: Pavan Chebbi <pavan.chebbi@...adcom.com>
To: Vitaly Bursov <vitaly@...sov.com>
Cc: Siva Reddy Kallam <siva.kallam@...adcom.com>,
Prashant Sreedharan <prashant@...adcom.com>,
Michael Chan <mchan@...adcom.com>,
Linux Netdev List <netdev@...r.kernel.org>
Subject: Re: tg3 RX packet re-order in queue 0 with RSS
On Fri, Oct 29, 2021 at 9:15 PM Vitaly Bursov <vitaly@...sov.com> wrote:
>
>
>
> 29.10.2021 08:04, Pavan Chebbi пишет:
> > 90On Thu, Oct 28, 2021 at 9:11 PM Vitaly Bursov <vitaly@...sov.com> wrote:
> >>
> >>
> >> 28.10.2021 10:33, Pavan Chebbi wrote:
> >>> On Wed, Oct 27, 2021 at 4:02 PM Vitaly Bursov <vitaly@...sov.com> wrote:
> >>>>
> >>>>
> >>>> 27.10.2021 12:30, Pavan Chebbi wrote:
> >>>>> On Wed, Sep 22, 2021 at 12:10 PM Siva Reddy Kallam
> >>>>> <siva.kallam@...adcom.com> wrote:
> >>>>>>
> >>>>>> Thank you for reporting this. Pavan(cc'd) from Broadcom looking into this issue.
> >>>>>> We will provide our feedback very soon on this.
> >>>>>>
> >>>>>> On Mon, Sep 20, 2021 at 6:59 PM Vitaly Bursov <vitaly@...sov.com> wrote:
> >>>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> We found a occassional and random (sometimes happens, sometimes not)
> >>>>>>> packet re-order when NIC is involved in UDP multicast reception, which
> >>>>>>> is sensitive to a packet re-order. Network capture with tcpdump
> >>>>>>> sometimes shows the packet re-order, sometimes not (e.g. no re-order on
> >>>>>>> a host, re-order in a container at the same time). In a pcap file
> >>>>>>> re-ordered packets have a correct timestamp - delayed packet had a more
> >>>>>>> earlier timestamp compared to a previous packet:
> >>>>>>> 1.00s packet1
> >>>>>>> 1.20s packet3
> >>>>>>> 1.10s packet2
> >>>>>>> 1.30s packet4
> >>>>>>>
> >>>>>>> There's about 300Mbps of traffic on this NIC, and server is busy
> >>>>>>> (hyper-threading enabled, about 50% overall idle) with its
> >>>>>>> computational application work.
> >>>>>>>
> >>>>>>> NIC is HPE's 4-port 331i adapter - BCM5719, in a default ring and
> >>>>>>> coalescing configuration, 1 TX queue, 4 RX queues.
> >>>>>>>
> >>>>>>> After further investigation, I believe that there are two separate
> >>>>>>> issues in tg3.c driver. Issues can be reproduced with iperf3, and
> >>>>>>> unicast UDP.
> >>>>>>>
> >>>>>>> Here are the details of how I understand this behavior.
> >>>>>>>
> >>>>>>> 1. Packet re-order.
> >>>>>>>
> >>>>>>> Driver calls napi_schedule(&tnapi->napi) when handling the interrupt,
> >>>>>>> however, sometimes it calls napi_schedule(&tp->napi[1].napi), which
> >>>>>>> handles RX queue 0 too:
> >>>>>>>
> >>>>>>> https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/broadcom/tg3.c#L6802-L7007
> >>>>>>>
> >>>>>>> static int tg3_rx(struct tg3_napi *tnapi, int budget)
> >>>>>>> {
> >>>>>>> struct tg3 *tp = tnapi->tp;
> >>>>>>>
> >>>>>>> ...
> >>>>>>>
> >>>>>>> /* Refill RX ring(s). */
> >>>>>>> if (!tg3_flag(tp, ENABLE_RSS)) {
> >>>>>>> ....
> >>>>>>> } else if (work_mask) {
> >>>>>>> ...
> >>>>>>>
> >>>>>>> if (tnapi != &tp->napi[1]) {
> >>>>>>> tp->rx_refill = true;
> >>>>>>> napi_schedule(&tp->napi[1].napi);
> >>>>>>> }
> >>>>>>> }
> >>>>>>> ...
> >>>>>>> }
> >>>>>>>
> >>>>>>> From napi_schedule() code, it should schedure RX 0 traffic handling on
> >>>>>>> a current CPU, which handles queues RX1-3 right now.
> >>>>>>>
> >>>>>>> At least two traffic flows are required - one on RX queue 0, and the
> >>>>>>> other on any other queue (1-3). Re-ordering may happend only on flow
> >>>>>>> from queue 0, the second flow will work fine.
> >>>>>>>
> >>>>>>> No idea how to fix this.
> >>>>>
> >>>>> In the case of RSS the actual rings for RX are from 1 to 4.
> >>>>> The napi of those rings are indeed processing the packets.
> >>>>> The explicit napi_schedule of napi[1] is only re-filling rx BD
> >>>>> producer ring because it is shared with return rings for 1-4.
> >>>>> I tried to repro this but I am not seeing the issue. If you are
> >>>>> receiving packets on RX 0 then the RSS must have been disabled.
> >>>>> Can you please check?
> >>>>>
> >>>>
> >>>> # ethtool -i enp2s0f0
> >>>> driver: tg3
> >>>> version: 3.137
> >>>> firmware-version: 5719-v1.46 NCSI v1.5.18.0
> >>>> expansion-rom-version:
> >>>> bus-info: 0000:02:00.0
> >>>> supports-statistics: yes
> >>>> supports-test: yes
> >>>> supports-eeprom-access: yes
> >>>> supports-register-dump: yes
> >>>> supports-priv-flags: no
> >>>>
> >>>> # ethtool -l enp2s0f0
> >>>> Channel parameters for enp2s0f0:
> >>>> Pre-set maximums:
> >>>> RX: 4
> >>>> TX: 4
> >>>> Other: 0
> >>>> Combined: 0
> >>>> Current hardware settings:
> >>>> RX: 4
> >>>> TX: 1
> >>>> Other: 0
> >>>> Combined: 0
> >>>>
> >>>> # ethtool -x enp2s0f0
> >>>> RX flow hash indirection table for enp2s0f0 with 4 RX ring(s):
> >>>> 0: 0 1 2 3 0 1 2 3
> >>>> 8: 0 1 2 3 0 1 2 3
> >>>> 16: 0 1 2 3 0 1 2 3
> >>>> 24: 0 1 2 3 0 1 2 3
> >>>> 32: 0 1 2 3 0 1 2 3
> >>>> 40: 0 1 2 3 0 1 2 3
> >>>> 48: 0 1 2 3 0 1 2 3
> >>>> 56: 0 1 2 3 0 1 2 3
> >>>> 64: 0 1 2 3 0 1 2 3
> >>>> 72: 0 1 2 3 0 1 2 3
> >>>> 80: 0 1 2 3 0 1 2 3
> >>>> 88: 0 1 2 3 0 1 2 3
> >>>> 96: 0 1 2 3 0 1 2 3
> >>>> 104: 0 1 2 3 0 1 2 3
> >>>> 112: 0 1 2 3 0 1 2 3
> >>>> 120: 0 1 2 3 0 1 2 3
> >>>> RSS hash key:
> >>>> Operation not supported
> >>>> RSS hash function:
> >>>> toeplitz: on
> >>>> xor: off
> >>>> crc32: off
> >>>>
> >>>> In /proc/interrupts there are enp2s0f0-tx-0, enp2s0f0-rx-1,
> >>>> enp2s0f0-rx-2, enp2s0f0-rx-3, enp2s0f0-rx-4 interrupts, all on
> >>>> different CPU cores. Kernel also has "threadirqs" enabled in
> >>>> command line, I didn't check if this parameter affects the issue.
> >>>>
> >>>> Yes, some things start with 0, and others with 1, sorry for a confusion
> >>>> in terminology, what I meant:
> >>>> - There are 4 RX rings/queues, I counted starting from 0, so: 0..3.
> >>>> RX0 is the first queue/ring that actually receives the traffic.
> >>>> RX0 is handled by enp2s0f0-rx-1 interrupt.
> >>>> - These are related to (tp->napi[i]), but i is in 1..4, so the first
> >>>> receiving queue relates to tp->napi[1], the second relates to
> >>>> tp->napi[2], and so on. Correct?
> >>>>
> >>>> Suppose, tg3_rx() is called for tp->napi[2], this function most likely
> >>>> calls napi_gro_receive(&tnapi->napi, skb) to further process packets in
> >>>> tp->napi[2]. And, under some conditions (RSS and work_mask), it calls
> >>>> napi_schedule(&tp->napi[1].napi), which schedules tp->napi[1] work
> >>>> on a currect CPU, which is designated for tp->napi[2], but not for
> >>>> tp->napi[1]. Correct?
> >>>>
> >>>> I don't understand what napi_schedule(&tp->napi[1].napi) does for the
> >>>> NIC or driver, "re-filling rx BD producer ring" sounds important. I
> >>>> suspect something will break badly if I simply remove it without
> >>>> replacing with something more elaborate. I guess along with re-filling
> >>>> rx BD producer ring it also can process incoming packets. Is it possible?
> >>>>
> >>>
> >>> Yes, napi[1] work may be called on the napi[2]'s CPU but it generally
> >>> won't process
> >>> any rx packets because the producer index of napi[1] has not changed. If the
> >>> producer count did change, then we get a poll from the ISR for napi[1]
> >>> to process
> >>> packets. So it is mostly used to re-fill rx buffers when called
> >>> explicitly. However
> >>> there could be a small window where the prod index is incremented but the ISR
> >>> is not fired yet. It may process some small no of packets. But I don't
> >>> think this
> >>> should lead to a reorder problem.
> >>>
> >>
> >> I tried to reproduce without using bridge and veth interfaces, and it seems
> >> like it's not reproducible, so traffic forwarding via a bridge interface may
> >> be necessary. It also does not happen if traffic load is low, but moderate
> >> load is enough - e.g. two 100 Mbps streams with 130-byte packets. It's easier
> >> to reproduce with a higher load.
> >>
> >> With about the same setup as in an original message (bridge + veth 2
> >> network namespaces), irqbalance daemon stopped, if traffic flows via
> >> enp2s0f0-rx-2 and enp2s0f0-rx-4, there's no reordering. enp2s0f0-rx-1
> >> still gets some interrupts, but at a much lower rate compared to 2 and
> >> 4.
> >>
> >> namespace 1:
> >> # iperf3 -u -c server_ip -p 5000 -R -b 300M -t 300 -l 130
> >> - - - - - - - - - - - - - - - - - - - - - - - - -
> >> [ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams
> >> [ 4] 0.00-300.00 sec 6.72 GBytes 192 Mbits/sec 0.008 ms 3805/55508325 (0.0069%)
> >> [ 4] Sent 55508325 datagrams
> >>
> >> iperf Done.
> >>
> >> namespace 2:
> >> # iperf3 -u -c server_ip -p 5001 -R -b 300M -t 300 -l 130
> >> - - - - - - - - - - - - - - - - - - - - - - - - -
> >> [ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams
> >> [ 4] 0.00-300.00 sec 6.83 GBytes 196 Mbits/sec 0.005 ms 3873/56414001 (0.0069%)
> >> [ 4] Sent 56414001 datagrams
> >>
> >> iperf Done.
> >>
> >>
> >> With the same configuration but different IP address so that instead of
> >> enp2s0f0-rx-4 enp2s0f0-rx-1 would be used, there is a reordering.
> >>
> >>
> >> namespace 1 (client IP was changed):
> >> # iperf3 -u -c server_ip -p 5000 -R -b 300M -t 300 -l 130
> >> - - - - - - - - - - - - - - - - - - - - - - - - -
> >> [ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams
> >> [ 4] 0.00-300.00 sec 6.32 GBytes 181 Mbits/sec 0.007 ms 8506/52172059 (0.016%)
> >> [ 4] Sent 52172059 datagrams
> >> [SUM] 0.0-300.0 sec 2452 datagrams received out-of-order
> >>
> >> iperf Done.
> >>
> >> namespace 2:
> >> # iperf3 -u -c server_ip -p 5001 -R -b 300M -t 300 -l 130
> >> - - - - - - - - - - - - - - - - - - - - - - - - -
> >> [ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams
> >> [ 4] 0.00-300.00 sec 6.59 GBytes 189 Mbits/sec 0.006 ms 6302/54463973 (0.012%)
> >> [ 4] Sent 54463973 datagrams
> >>
> >> iperf Done.
> >>
> >> Swapping IP addresses in these namespaces also changes the namespace exhibiting the issue,
> >> it's following the IP address.
> >>
> >>
> >> Is there something I could check to confirm that this behavior is or is not
> >> related to napi_schedule(&tp->napi[1].napi) call?
> >
> > in the function tg3_msi_1shot() you could store the cpu assigned to
> > tnapi1 (inside the struct tg3_napi)
> > and then in tg3_poll_work() you can add another check after
> > if (*(tnapi->rx_rcb_prod_idx) != tnapi->rx_rcb_ptr)
> > something like
> > if (tnapi == &tp->napi[1] && tnapi->assigned_cpu == smp_processor_id())
> > only then execute tg3_rx()
> >
> > This may stop tnapi 1 from reading rx pkts on the current CPU from
> > which refill is called.
> >
>
> Didn't work for me, perhaps I did something wrong - if tg3_rx() is not called,
> there's an infinite loop, and after I added "work_done = budget;", it still doesn't
> work - traffic does not flow.
>
I think the easiest way is to modify the tg3_rx() calling condition
like below inside
tg3_poll_work() :
if (*(tnapi->rx_rcb_prod_idx) != tnapi->rx_rcb_ptr) {
if (tnapi != &tp->napi[1] || (tnapi == &tp->napi[1] &&
!tp->rx_refill)) {
work_done += tg3_rx(tnapi, budget - work_done);
}
}
This will prevent reading rx packets when napi[1] is scheduled only for refill.
Can you see if this works?
> I added logging instead:
>
> + if (tnapi->assigned_cpu != smp_processor_id())
> + net_dbg_ratelimited("tg3 napi %ld cpu %d %d",
> + tnapi - tp->napi, tnapi->assigned_cpu, smp_processor_id());
> napi_gro_receive(&tnapi->napi, skb);
>
> And with two iperf3 streams, there's a lot of messages:
> [ 3242.007898] tg3 napi 1 cpu 10 48
> [ 3242.007899] tg3 napi 1 cpu 10 48
> [ 3242.007911] tg3 napi 1 cpu 10 48
> [ 3242.007913] tg3 napi 1 cpu 10 48
> [ 3247.011898] net_ratelimit: 546560 callbacks suppressed
> [ 3247.011900] tg3 napi 1 cpu 10 48
> [ 3247.011902] tg3 napi 1 cpu 10 48
> [ 3247.011904] tg3 napi 1 cpu 10 48
> [ 3247.011905] tg3 napi 1 cpu 10 48
> [ 3247.011906] tg3 napi 1 cpu 10 48
> [ 3247.011928] tg3 napi 1 cpu 10 48
> [ 3247.011929] tg3 napi 1 cpu 10 48
> [ 3247.011931] tg3 napi 1 cpu 10 48
> [ 3247.011932] tg3 napi 1 cpu 10 48
> [ 3247.011933] tg3 napi 1 cpu 10 48
> [ 3252.015885] net_ratelimit: 539574 callbacks suppressed
> [ 3252.015888] tg3 napi 1 cpu 10 48
> [ 3252.015889] tg3 napi 1 cpu 10 48
> [ 3252.015891] tg3 napi 1 cpu 10 48
> [ 3252.015892] tg3 napi 1 cpu 10 48
>
> cpu 10, enp2s0f0-rx-1
> # cat /proc/irq/106/effective_affinity
> 00000000,00000000,00000400
>
> cpu 48, enp2s0f0-rx-4
> # cat /proc/irq/109/effective_affinity
> 00000000,00010000,00000000
>
> Among all printed messages, there's only "napi 1".
>
> There's also a difference in interrupt thread's CPU usage:
> 201570 root -51 0 0 0 0 R 64.3 0.0 1:46.91 irq/109-enp2s0f
> 204687 root 20 0 9628 2084 1976 R 37.5 0.0 1:04.74 iperf3
> 205354 root 20 0 9628 2060 1948 R 36.7 0.0 1:01.06 iperf3
> 201567 root -51 0 0 0 0 R 23.3 0.0 0:44.45 irq/106-enp2s0f
>
> The sender is CPU-bound, so there's no overload on RX side with tg3
>
> --
> Thanks
> Vitalii
>
Powered by blists - more mailing lists