netdev - Potential bug in linux TCP pacing implementation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAFYr1XM_UGejZdnUYYBQomq0jBDMpV+HWCd1ZDorD=xOGXq4CQ@mail.gmail.com>
Date: Tue, 14 Nov 2023 13:59:15 -0500
From: Anup Agarwal <anupa@...rew.cmu.edu>
To: Eric Dumazet <edumazet@...gle.com>
Cc: netdev@...r.kernel.org
Subject: Potential bug in linux TCP pacing implementation

Hi Eric,

Sorry for the duplicate email. I am sending again as I got an
automated response from netdev@...r.kernel.org that my email was
rejected because of using HTML formatting. I have resent a plain text
version. Hopefully this email goes through.

I saw that you are the maintainer for the linux networking TCP stack
(https://www.kernel.org/doc/linux/MAINTAINERS), and wanted to report a
potential bug. Please let me know if I should be reaching out to
someone else or using some other method to report.

Based on my understanding there is a bug in the kernel's pacing
implementation. It does not faithfully follow the pacing rate set by
the congestion control algorithm (CCA).

Description:
For enforcing pacing, I think the kernel computes "tcp_wstamp_ns" or
the time to deliver the next packet. This computation is only done
after transmission of packets "tcp_update_skb_after_send" in
"net/ipv4/tcp_output.c". However, the rate, i.e., "sk_pacing_rate" can
be updated when packets are received (e.g., when the CCA gets a
"rate_sample" for an ACK). As a result if the rate is changed by the
CCA frequently, then the kernel uses a stale pacing value.

Example:
For a concrete example, say the pacing rate is 1 pkt per second at
t=0, and a packet was just transmitted at t=0, and the tcp_wstamp_ns
is then set to  t=1 sec. Now say an ACK arrived at t=1us and caused
the CCA to update rate to 100 pkts per second. The next packet could
then be sent at 1us + 0.01s. But since tcp_wstamp_ns is set to 1 sec.
So roughly 100 pkts worth of transmission opportunity is lost.

Thoughts:
I guess the goal of the kernel pacing is to enforce an upper bound on
transmission rate (or lower bound on inter-send time), rather than
follow the "sk_pacing_rate" as a transmission rate directly. In that
sense it is not a bug, i.e., the time between sends is never shorter
than inverse sk_pacing_rate. But if sk_pacing_rate is changed
frequently by large enough magnitudes, the time between sends can be
much longer than the inverse pacing rate. Due to not incorporating all
updates to "sk_pacing_rate", the kernel is very conservative and
misses many send opportunities.

Why this matters:
I was implementing a rate based CCA that does not use cwnd at all.
There were cases when I had to restrict inflight and would temporarily
set sk_pacing_rate close to zero. When I reset the sk_pacing_rate, the
kernel does not start using this rate for a long time as it has cached
the time to next send using the "close to zero" rate. Rate based CCAs
are more robust to jitter in the network. To me it seems useful to
actually use pacing rate as transmission rate instead of just an upper
bound on transmission rate. Fundamentally by setting a rate, a CCA can
implement any tx behavior, whereas cwnd limits the possible behaviors.
Even if folks disagree with this and want to interpret pacing rate as
an upper bound on tx rate rather than tx rate directly, I think the
enforcement can still be modified to avoid this bug and follow
sk_pacing_rate more closely.

Potential fix:
// Update credit whenever (1) sk_pacing_rate is changed, and (2)
before checking if transmission is allowed by pacing.
credit_in_bytes = last_sk_pacing_rate * (now - last_credit_update)
last_credit_update = now
last_sk_pacing_rate = sk_pacing_rate
// The idea is that last_sk_pacing_rate was set by the CCA for the
time interval [last_credit_update, now). And we integrate (sum up)
this rate over the interval to computing credits.
// I think this is also robust to OS jitter as credits increase even
for any intervals missed due to scheduling delays.

// To check if it is ok to send pkt due to pacing, one can just check
if (sent_till_now + pkt_size <= credit_in_bytes)

Please let me know if you have additional thoughts/feedback.

Thanks,
Anup

PhD Student in CS at CMU
https://www.cs.cmu.edu/~anupa/