netdev - Re: Potential bug in linux TCP pacing implementation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANn89iKhboBst+Jx2bjF6xvi1UALnxwC+pv-VFaL+82r_XQ9Hg@mail.gmail.com>
Date: Tue, 14 Nov 2023 20:09:01 +0100
From: Eric Dumazet <edumazet@...gle.com>
To: Anup Agarwal <anupa@...rew.cmu.edu>
Cc: netdev@...r.kernel.org
Subject: Re: Potential bug in linux TCP pacing implementation

On Tue, Nov 14, 2023 at 7:59 PM Anup Agarwal <anupa@...rew.cmu.edu> wrote:
>
> Hi Eric,
>
> Sorry for the duplicate email. I am sending again as I got an
> automated response from netdev@...r.kernel.org that my email was
> rejected because of using HTML formatting. I have resent a plain text
> version. Hopefully this email goes through.
>
> I saw that you are the maintainer for the linux networking TCP stack
> (https://www.kernel.org/doc/linux/MAINTAINERS), and wanted to report a
> potential bug. Please let me know if I should be reaching out to
> someone else or using some other method to report.
>
> Based on my understanding there is a bug in the kernel's pacing
> implementation. It does not faithfully follow the pacing rate set by
> the congestion control algorithm (CCA).
>
> Description:
> For enforcing pacing, I think the kernel computes "tcp_wstamp_ns" or
> the time to deliver the next packet. This computation is only done
> after transmission of packets "tcp_update_skb_after_send" in
> "net/ipv4/tcp_output.c". However, the rate, i.e., "sk_pacing_rate" can
> be updated when packets are received (e.g., when the CCA gets a
> "rate_sample" for an ACK). As a result if the rate is changed by the
> CCA frequently, then the kernel uses a stale pacing value.
>
> Example:
> For a concrete example, say the pacing rate is 1 pkt per second at
> t=0, and a packet was just transmitted at t=0, and the tcp_wstamp_ns
> is then set to  t=1 sec. Now say an ACK arrived at t=1us and caused
> the CCA to update rate to 100 pkts per second. The next packet could
> then be sent at 1us + 0.01s. But since tcp_wstamp_ns is set to 1 sec.
> So roughly 100 pkts worth of transmission opportunity is lost.
>
> Thoughts:
> I guess the goal of the kernel pacing is to enforce an upper bound on
> transmission rate (or lower bound on inter-send time), rather than
> follow the "sk_pacing_rate" as a transmission rate directly. In that
> sense it is not a bug, i.e., the time between sends is never shorter
> than inverse sk_pacing_rate. But if sk_pacing_rate is changed
> frequently by large enough magnitudes, the time between sends can be
> much longer than the inverse pacing rate. Due to not incorporating all
> updates to "sk_pacing_rate", the kernel is very conservative and
> misses many send opportunities.
>
> Why this matters:
> I was implementing a rate based CCA that does not use cwnd at all.
> There were cases when I had to restrict inflight and would temporarily
> set sk_pacing_rate close to zero. When I reset the sk_pacing_rate, the
> kernel does not start using this rate for a long time as it has cached
> the time to next send using the "close to zero" rate. Rate based CCAs
> are more robust to jitter in the network. To me it seems useful to
> actually use pacing rate as transmission rate instead of just an upper
> bound on transmission rate. Fundamentally by setting a rate, a CCA can
> implement any tx behavior, whereas cwnd limits the possible behaviors.
> Even if folks disagree with this and want to interpret pacing rate as
> an upper bound on tx rate rather than tx rate directly, I think the
> enforcement can still be modified to avoid this bug and follow
> sk_pacing_rate more closely.
>
> Potential fix:
> // Update credit whenever (1) sk_pacing_rate is changed, and (2)
> before checking if transmission is allowed by pacing.
> credit_in_bytes = last_sk_pacing_rate * (now - last_credit_update)
> last_credit_update = now
> last_sk_pacing_rate = sk_pacing_rate
> // The idea is that last_sk_pacing_rate was set by the CCA for the
> time interval [last_credit_update, now). And we integrate (sum up)
> this rate over the interval to computing credits.
> // I think this is also robust to OS jitter as credits increase even
> for any intervals missed due to scheduling delays.
>
> // To check if it is ok to send pkt due to pacing, one can just check
> if (sent_till_now + pkt_size <= credit_in_bytes)
>
> Please let me know if you have additional thoughts/feedback.

This was a conscious choice really. More state in TCP socket (or in
FQ) means higher costs.

For conventional CC, difference between pacing P1 and P2, and usual
packet sizes (typically 1ms of airtime)
make the difference pretty small in practice.

Also, pacing is offloaded (either in FQ qdisc, or in timer wheel on
some NIC hardware).

With EDT  model, you probably can implement whatever schem you prefer
in your CC module, storing extra state in the CC private state.