netdev - Re: Throughput regression with `tcp: refine TSO autosizing`

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+BoTQn3KRym0qCJ+fLav5LQgjfiN_-Eqz0xPSc2eXJ-ijNjQw@mail.gmail.com>
Date:	Mon, 9 Feb 2015 14:47:56 +0100
From:	Michal Kazior <michal.kazior@...to.com>
To:	Eric Dumazet <eric.dumazet@...il.com>
Cc:	Neal Cardwell <ncardwell@...gle.com>,
	linux-wireless <linux-wireless@...r.kernel.org>,
	Network Development <netdev@...r.kernel.org>,
	Eyal Perry <eyalpe@....mellanox.co.il>
Subject: Re: Throughput regression with `tcp: refine TSO autosizing`

On 6 February 2015 at 15:09, Michal Kazior <michal.kazior@...to.com> wrote:
> On 6 February 2015 at 14:53, Eric Dumazet <eric.dumazet@...il.com> wrote:
>> On Fri, 2015-02-06 at 05:40 -0800, Eric Dumazet wrote:
>>
>>> tcp_wfree() could maintain in tp->tx_completion_delay_ms an EWMA
>>> of TX completion delay. But this would require yet another expensive
>>> call to ktime_get() if HZ < 1000.
>>>
>>> Then tcp_write_xmit() could use it to adjust :
>>>
>>>    limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 9);
>>>
>>> to
>>>
>>>    amount = (2 + tp->tx_completion_delay_ms) * sk->sk_pacing_rate
>>>
>>>    limit = max(2 * skb->truesize, amount / 1000);
>>>
>>> I'll cook a patch.
>>
>> Hmm... doing this in all protocols would be too expensive,
>> and we do not want to include time spent in qdiscs.
>>
>> wifi could eventually do that, providing in skb->tx_completion_delay_us
>> the time spent in wifi driver.
>>
>> This way, we would have no penalty for network devices doing normal skb
>> orphaning (loopback interface, ethernet, ...)
>
> I'll play around with this idea and report back later.

I'm able to get 600mbps with 5 flows and 250mbps with 1 flow, i.e.
same as before the regression. I'm attaching the patch at the end of
my mail - is this approach viable?

I wonder if there's anything that can be done to allow 600mbps (line
rate) on 1 flow with ath10k without tweaking tcp_limit_output_bytes
(you can't expect end-users to tweak this).

Perhaps tcp_limit_output_bytes should also consider tx_completion_delay, e.g.:

  amount = sk->sk_tx_completion_delay_us;
  amount *= sk->sk_pacing_rate >> 10;
  limit = max(2 * skb->truesize, amount >> 10);
  max_limit = sysctl_tcp_limit_output_bytes;
  max_limit *= 1 + (sk->sk_tx_completion_delay_us / USEC_PER_MSEC);
  limit = min(u32, limit, max_limit);

With this I get ~400mbps on 1 flow. If I add the original 1ms extra
delay from your formula to tx_completion_delay I fill in ath10k I get
nearly line rate in 1 flow (almost 600mbps; it hops between 570-620).
Decreasing tcp_limit_output_bytes decreases throughput (e.g. 64K gives
300mbps, 32K gives 180mbps, 16K gives 110mbps). Multiple flows in
iperf seem unbalanced with 128K limit, but look okay with 32K).


Michał


diff --git a/drivers/net/wireless/ath/ath10k/core.h
b/drivers/net/wireless/ath/ath10k/core.h
index 3be3a59..4ff0ae8 100644
--- a/drivers/net/wireless/ath/ath10k/core.h
+++ b/drivers/net/wireless/ath/ath10k/core.h
@@ -82,6 +82,7 @@ struct ath10k_skb_cb {
        dma_addr_t paddr;
        u8 eid;
        u8 vdev_id;
+       ktime_t stamp;

        struct {
                u8 tid;
diff --git a/drivers/net/wireless/ath/ath10k/mac.c
b/drivers/net/wireless/ath/ath10k/mac.c
index 15e47f4..5efb2a7 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -2620,6 +2620,7 @@ static void ath10k_tx(struct ieee80211_hw *hw,
        if (info->flags & IEEE80211_TX_CTL_NO_CCK_RATE)
                ath10k_dbg(ar, ATH10K_DBG_MAC,
"IEEE80211_TX_CTL_NO_CCK_RATE\n");

+       ATH10K_SKB_CB(skb)->stamp = ktime_get();
        ATH10K_SKB_CB(skb)->htt.is_offchan = false;
        ATH10K_SKB_CB(skb)->htt.tid = ath10k_tx_h_get_tid(hdr);
        ATH10K_SKB_CB(skb)->vdev_id = ath10k_tx_h_get_vdev_id(ar, vif);
diff --git a/drivers/net/wireless/ath/ath10k/txrx.c
b/drivers/net/wireless/ath/ath10k/txrx.c
index 3f00cec..0d5539b 100644
--- a/drivers/net/wireless/ath/ath10k/txrx.c
+++ b/drivers/net/wireless/ath/ath10k/txrx.c
@@ -15,6 +15,7 @@
  * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
  */

+#include <net/sock.h>
 #include "core.h"
 #include "txrx.h"
 #include "htt.h"
@@ -82,6 +83,13 @@ void ath10k_txrx_tx_unref(struct ath10k_htt *htt,

        ath10k_report_offchan_tx(htt->ar, msdu);

+       if (msdu->sk) {
+               ACCESS_ONCE(msdu->sk->sk_tx_completion_delay_us) =
+                               ktime_to_ns(ktime_sub(ktime_get(),
+                                                     skb_cb->stamp)) /
+                               NSEC_PER_USEC;
+       }
+
        info = IEEE80211_SKB_CB(msdu);
        memset(&info->status, 0, sizeof(info->status));
        trace_ath10k_txrx_tx_unref(ar, tx_done->msdu_id);
diff --git a/include/net/sock.h b/include/net/sock.h
index 2210fec..6b15d71 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -390,6 +390,7 @@ struct sock {
        int                     sk_wmem_queued;
        gfp_t                   sk_allocation;
        u32                     sk_pacing_rate; /* bytes per second */
+       u32                     sk_tx_completion_delay_us;
        u32                     sk_max_pacing_rate;
        netdev_features_t       sk_route_caps;
        netdev_features_t       sk_route_nocaps;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 65caf8b..5e249bf 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1996,6 +1996,7 @@ static bool tcp_write_xmit(struct sock *sk,
unsigned int mss_now, int nonagle,
        max_segs = tcp_tso_autosize(sk, mss_now);
        while ((skb = tcp_send_head(sk))) {
                unsigned int limit;
+               unsigned int amount;

                tso_segs = tcp_init_tso_segs(sk, skb, mss_now);
                BUG_ON(!tso_segs);
@@ -2053,7 +2054,9 @@ static bool tcp_write_xmit(struct sock *sk,
unsigned int mss_now, int nonagle,
                 * of queued bytes to ensure line rate.
                 * One example is wifi aggregation (802.11 AMPDU)
                 */
-               limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
+               amount = sk->sk_tx_completion_delay_us *
+                        (sk->sk_pacing_rate >> 10);
+               limit = max(2 * skb->truesize, amount >> 10);
                limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);

                if (atomic_read(&sk->sk_wmem_alloc) > limit) {
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html