netdev - Re: [PATCH net-next v4 0/3] add tx packets aggregation to ethtool and rmnet

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAA93jw78pvJENnj5ob7WX8GV676CZprYADWeQjaciasopNTv5A@mail.gmail.com>
Date:   Fri, 13 Jan 2023 08:57:48 -0800
From:   Dave Taht <dave.taht@...il.com>
To:     Alexander Duyck <alexander.duyck@...il.com>
Cc:     Daniele Palmas <dnlplm@...il.com>,
        David Miller <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        Eric Dumazet <edumazet@...gle.com>,
        Subash Abhinov Kasiviswanathan <quic_subashab@...cinc.com>,
        Sean Tranchetti <quic_stranche@...cinc.com>,
        Jonathan Corbet <corbet@....net>,
        Alexander Lobakin <alexandr.lobakin@...el.com>,
        Gal Pressman <gal@...dia.com>,
        Bjørn Mork <bjorn@...k.no>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        netdev@...r.kernel.org
Subject: Re: [PATCH net-next v4 0/3] add tx packets aggregation to ethtool and rmnet

Dear Alexander:

Thank you for taking a look at this thread. In earlier tests, the
aggregation was indeed an improvement over the default behavior,
but the amount of data that ended up living in the device was best
case, 150ms, and worst case, 25 seconds, in the flent tcp_nup case. A
better test would be the staggered start tcp_4up_squarewave test which
would show only one tcp out of the 4, able to start with the current
overall structure of this portion of the stack, once that buffer gets
filled.

The shiny new rust based Crusader test (
https://github.com/Zoxc/crusader ) has also now got a nice staggered
start test that shows this problem in more detail.

I was so peeved about 4g/5g behaviors like this in general that I was
going to dedicate a blog entry to it; this recent rant only touches
upon the real problem with engineering to the test, here:

https://blog.cerowrt.org/post/speedtests/

Unfortunately your post is more about leveraging xmit_more - which is
good, but the BQL structure for ethernet, itself leverages the concept
of a completion interrupt (when the packets actually hit the air or
wire), which I dearly wish was some previously unknown, single line
hack to the driver or message in the usb API for 4g/5g etc.

On Fri, Jan 13, 2023 at 8:17 AM Alexander Duyck
<alexander.duyck@...il.com> wrote:
>
> On Fri, Jan 13, 2023 at 7:50 AM Daniele Palmas <dnlplm@...il.com> wrote:
> >
> > Hello Alexander,
> >
> > Il giorno ven 13 gen 2023 alle ore 00:00 Alexander H Duyck
> > <alexander.duyck@...il.com> ha scritto:
> > >
> > > On Wed, 2023-01-11 at 14:05 +0100, Daniele Palmas wrote:
> > > > Hello maintainers and all,
> > > >
> > > > this patchset implements tx qmap packets aggregation in rmnet and generic
> > > > ethtool support for that.
> > > >
> > > > Some low-cat Thread-x based modems are not capable of properly reaching the maximum
> > > > allowed throughput both in tx and rx during a bidirectional test if tx packets
> > > > aggregation is not enabled.
> > >
> > > One question I would have about this is if you are making use of Byte
> > > Queue Limits and netdev_xmit_more at all? I know for high speed devices
> > > most of us added support for xmit_more because PCIe bandwidth was
> > > limiting when we were writing the Tx ring indexes/doorbells with every
> > > packet. To overcome that we added netdev_xmit_more which told us when
> > > the Qdisc had more packets to give us. This allowed us to better
> > > utilize the PCIe bus by bursting packets through without adding
> > > additional latency.
> > >
> >
> > no, I was not aware of BQL: this development has been basically
> > modelled on what other mobile broadband drivers do (e.g.
> > cdc_mbim/cdc_ncm, Qualcomm downstream rmnet implementation), that are
> > not using BQL.
> >
> > If I understand properly documentation
> >
> > rmnet0/queues/tx-0/byte_queue_limits/limit_max
> >
> > would be the candidate for replacing ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES.
>
> Yes the general idea is that you end up targeting the upper limit for
> how many frames can be sent in a single burst.
>
> > But I can't find anything for ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES
> > and ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS, something that should work
> > in combination with the bytes limit: at least the first one is
> > mandatory, since the modem can't receive more than a certain number
> > (this is a variable value depending on the modem model and is
> > collected through userspace tools).
>
> In terms of MAX_FRAMES there isn't necessarily anything like that, but
> at the same time it isn't something that is already controlled by the
> netdev itself by using the netif_stop_queue or netif_stop_subqueue
> when there isn't space to store another frame. As such most devices
> control this by just manipulating their descriptor ring size via
> "ethtool -G <dev> tx <N>"
>
> As far as the TIME_USECS that is something I tried to propose a decade
> ago and was essentially given a hard "NAK" before xmit_more was
> introduced. We shouldn't be adding latency when we don't need to.
> Between GSO and xmit_more you should find that the network stack
> itself will naturally want to give you larger bursts of frames with
> xmit_more set. In addition, adding latency can mess with certain TCP
> algorithms and cause problematic behaviors.
>
> > ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES works also as a way to determine
> > that tx aggregation has been enabled by the userspace tool managing
> > the qmi requests, otherwise no aggregation should be performed.
>
> Is there a specific reason why you wouldn't want to take advantage of
> aggregation that is already provided by the stack in the form of
> things such as GSO and the qdisc layer? I know most of the high speed
> NICs are always making use of xmit_more since things like GSO can take
> advantage of it to increase the throughput. Enabling BQL is a way of
> taking that one step further and enabling the non-GSO cases.
>
> > > > I verified this problem with rmnet + qmi_wwan by using a MDM9207 Cat. 4 based modem
> > > > (50Mbps/150Mbps max throughput). What is actually happening is pictured at
> > > > https://drive.google.com/file/d/1gSbozrtd9h0X63i6vdkNpN68d-9sg8f9/view
> > > >
> > > > Testing with iperf TCP, when rx and tx flows are tested singularly there's no issue
> > > > in tx and minor issues in rx (not able to reach max throughput). When there are concurrent
> > > > tx and rx flows, tx throughput has an huge drop. rx a minor one, but still present.
> > > >
> > > > The same scenario with tx aggregation enabled is pictured at
> > > > https://drive.google.com/file/d/1jcVIKNZD7K3lHtwKE5W02mpaloudYYih/view
> > > > showing a regular graph.
> > > >
> > > > This issue does not happen with high-cat modems (e.g. SDX20), or at least it
> > > > does not happen at the throughputs I'm able to test currently: maybe the same
> > > > could happen when moving close to the maximum rates supported by those modems.
> > > > Anyway, having the tx aggregation enabled should not hurt.
> > > >
> > > > The first attempt to solve this issue was in qmi_wwan qmap implementation,
> > > > see the discussion at https://lore.kernel.org/netdev/20221019132503.6783-1-dnlplm@gmail.com/
> > > >
> > > > However, it turned out that rmnet was a better candidate for the implementation.
> > > >
> > > > Moreover, Greg and Jakub suggested also to use ethtool for the configuration:
> > > > not sure if I got their advice right, but this patchset add also generic ethtool
> > > > support for tx aggregation.
> > >
> > > I have concerns about this essentially moving queueing disciplines down
> > > into the device. The idea of doing Tx aggregation seems like something
> > > that should be done with the qdisc rather than the device driver.
> > > Otherwise we are looking at having multiple implementations of this
> > > aggregation floating around. It seems like it would make more sense to
> > > have this batching happen at the qdisc layer, and then the qdisc layer
> > > would pass down a batch of frames w/ xmit_more set to indicate it is
> > > flushing the batch.
> >
> > Honestly, I'm not expert enough to give a reliable opinion about this.
> >
> > I feel like these settings are more related to the hardware, requiring
> > also a configuration on the hardware itself done by the user, so
> > ethtool would seem to me a good choice, but I may be biased since I
> > did this development :-)
>
> Yeah, I get that. I went through something similar when I had
> submitted a patch to defer Tx tail writes in order to try and improve
> the PCIe throughput. I would be open to revisiting this if we gave
> xmit_more and BQL a try and it doesn't take care of this, but from my
> past experience odds are the combination will likely resolve most of
> what you are seeing without adding additional latency. At a minimum
> the xmit_more should show a significant improvement in Tx throughput
> just from the benefits of GSO sending bursts of frames through that
> can easily be batched.



-- 
This song goes out to all the folk that thought Stadia would work:
https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXtz
Dave Täht CEO, TekLibre, LLC