[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <1523244055.3920989.1331005248.4F27E86C@webmail.messagingengine.com>
Date: Sun, 08 Apr 2018 23:20:55 -0400
From: "Jacob S. Moroni" <mail@...emoroni.com>
To: madalin.bucur@....com
Cc: netdev@...r.kernel.org
Subject: Re: DPAA TX Issues
On Sun, Apr 8, 2018, at 7:46 PM, Jacob S. Moroni wrote:
> Hello Madalin,
>
> I've been experiencing some issues with the DPAA Ethernet driver,
> specifically related to frame transmission. Hopefully you can point
> me in the right direction.
>
> TLDR: Attempting to transmit faster than a few frames per second causes
> the TX FQ CGR to enter into the congested state and remain there forever,
> even after transmission stops.
>
> The hardware is a T2080RDB, running from the tip of net-next, using
> the standard t2080rdb device tree and corenet64_smp_defconfig kernel
> config. No changes were made to any of the files. The issue occurs
> with 4.16.1 stable as well. In fact, the only time I've been able
> to achieve reliable frame transmission was with the SDK 4.1 kernel.
>
> For my tests, I'm running iperf3 both with and without the -R
> option (send/receive). When using a USB Ethernet adapter, there
> are no issues.
>
> The issue is that it seems like the TX frame queues are getting
> "stuck" when attempting to transmit at rates greater than a few frames
> per second. Ping works fine, but it seems like anything that could
> potentially cause multiple TX frames to be enqueued causes issues.
>
> If I run iperf3 in reverse mode (with the T2080RDB receiving), then
> I can achieve ~940 Mbps, but this is also somewhat unreliable.
>
> If I run it with the T2080RDB transmitting, the test will never
> complete. Sometimes it starts transmitting for a few seconds then stops,
> and other times it never even starts. This also seems to force the
> interface into a bad state.
>
> The ethtool stats show that the interface has entered
> congestion a few times, and that it's currently congested. The fact
> that it's currently congested even after stopping transmission
> indicates that the FQ somehow stopped being drained. I've also
> noticed that whenever this issue occurs, the TX confirmation
> counters are always less than the TX packet counters.
>
> When it gets into this state, I can see that the memory usage is
> climbing, up until about the point of where the CGR threshold
> is (about 100 MB).
>
> Any idea what could prevent the TX FQ from being drained? My first
> guess was flow control, but it's completely disabled.
>
> I tried messing with the egress congestion threshold, workqueue
> assignments, etc., but nothing seemed to have any effect.
>
> If you need any more information or want me to run any tests,
> please let me know.
>
> Thanks,
> --
> Jacob S. Moroni
> mail@...emoroni.com
It turns out that irqbalance was causing all of the issues. After
disabling it and rebooting, the interfaces worked perfectly.
Perhaps there's an issue with how the qman/bman portals are defined
as per-cpu variables.
During the portal's probe, the CPUs are assigned one-by-one and
subsequently passed into request_irq as the argument.
However, it seems like if the IRQ affinity changes, then the ISR could be
passed a reference to a per-cpu variable belonging to another CPU.
At least I know where to look now.
- Jake
Powered by blists - more mailing lists