lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAFYr1XMRLx_ZsJDxjZh5cv5Nx3gSWiiY76VZE0610gw284-Wcg@mail.gmail.com>
Date: Tue, 14 Nov 2023 15:39:09 -0500
From: Anup Agarwal <anupa@...rew.cmu.edu>
To: Eric Dumazet <edumazet@...gle.com>
Cc: netdev@...r.kernel.org
Subject: Re: Potential bug in linux TCP pacing implementation

Thanks for your response.

Yeah, I think for the currently deployed CCAs, the current pacing
implementation works fine.

I wanted to clarify, the issue is caused due to temporal change in
sk_pacing_rate and is independent of pkt sizes or network parameters
(bandwidth, rtt, etc.). If the sk_pacing_rate goes from r1=0.1 pkt per
ms (~1.2 Mbps for ~1500B MTU) to r2=10 pkts per ms (~120 Mbps), then
opportunity to send 99 pkts (=(r2/r1)-1) is missed. This is because
tcp_wstamp_ns was computed as =10ms using r1, even though
sk_pacing_rate changed to r2 (immediately after tcp_wstamp_ns
computation) and a pkt could have been sent at 0.1ms.

The ratio of the new and old rate matters, not the pkt sizes, or other
network params. Typical CCAs perhaps only change rate by ~2 times so
only 1 pkt (=r2/r1-1 = 2-1) worth of sending opportunity is lost. This
is why I guess the issue has not been observed in practice.

Yeah I did see there is an option to specify "skb_mstamp_ns", that
might allow CCAs to enforce rates better. I don't know how easy or
difficult it would be for CCAs to set skb_mstamp_ns. Because CCAs may
not look at individual skbuffs and also given tcp_congestion_ops only
has callbacks on ACK events and not pkt send events. I guess BPFs are
to be used? (https://netdevconf.info//0x14/pub/papers/55/0x14-paper55-talk-paper.pdf)

Also, to clarify, the reason for the conscious choice is that the fix
would require more state in TCP socket? Or are there more reasons, any
pointers? I imagine, for the fix, the state would increase by ~2-3 u64
values, e.g., credits in units of bytes, the time the credits was
updated, and the time the credits expire). Is this too much? Or will
the fix require more state than this?


On Tue, Nov 14, 2023 at 2:09 PM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Tue, Nov 14, 2023 at 7:59 PM Anup Agarwal <anupa@...rew.cmu.edu> wrote:
> >
> > Hi Eric,
> >
> > Sorry for the duplicate email. I am sending again as I got an
> > automated response from netdev@...r.kernel.org that my email was
> > rejected because of using HTML formatting. I have resent a plain text
> > version. Hopefully this email goes through.
> >
> > I saw that you are the maintainer for the linux networking TCP stack
> > (https://www.kernel.org/doc/linux/MAINTAINERS), and wanted to report a
> > potential bug. Please let me know if I should be reaching out to
> > someone else or using some other method to report.
> >
> > Based on my understanding there is a bug in the kernel's pacing
> > implementation. It does not faithfully follow the pacing rate set by
> > the congestion control algorithm (CCA).
> >
> > Description:
> > For enforcing pacing, I think the kernel computes "tcp_wstamp_ns" or
> > the time to deliver the next packet. This computation is only done
> > after transmission of packets "tcp_update_skb_after_send" in
> > "net/ipv4/tcp_output.c". However, the rate, i.e., "sk_pacing_rate" can
> > be updated when packets are received (e.g., when the CCA gets a
> > "rate_sample" for an ACK). As a result if the rate is changed by the
> > CCA frequently, then the kernel uses a stale pacing value.
> >
> > Example:
> > For a concrete example, say the pacing rate is 1 pkt per second at
> > t=0, and a packet was just transmitted at t=0, and the tcp_wstamp_ns
> > is then set to  t=1 sec. Now say an ACK arrived at t=1us and caused
> > the CCA to update rate to 100 pkts per second. The next packet could
> > then be sent at 1us + 0.01s. But since tcp_wstamp_ns is set to 1 sec.
> > So roughly 100 pkts worth of transmission opportunity is lost.
> >
> > Thoughts:
> > I guess the goal of the kernel pacing is to enforce an upper bound on
> > transmission rate (or lower bound on inter-send time), rather than
> > follow the "sk_pacing_rate" as a transmission rate directly. In that
> > sense it is not a bug, i.e., the time between sends is never shorter
> > than inverse sk_pacing_rate. But if sk_pacing_rate is changed
> > frequently by large enough magnitudes, the time between sends can be
> > much longer than the inverse pacing rate. Due to not incorporating all
> > updates to "sk_pacing_rate", the kernel is very conservative and
> > misses many send opportunities.
> >
> > Why this matters:
> > I was implementing a rate based CCA that does not use cwnd at all.
> > There were cases when I had to restrict inflight and would temporarily
> > set sk_pacing_rate close to zero. When I reset the sk_pacing_rate, the
> > kernel does not start using this rate for a long time as it has cached
> > the time to next send using the "close to zero" rate. Rate based CCAs
> > are more robust to jitter in the network. To me it seems useful to
> > actually use pacing rate as transmission rate instead of just an upper
> > bound on transmission rate. Fundamentally by setting a rate, a CCA can
> > implement any tx behavior, whereas cwnd limits the possible behaviors.
> > Even if folks disagree with this and want to interpret pacing rate as
> > an upper bound on tx rate rather than tx rate directly, I think the
> > enforcement can still be modified to avoid this bug and follow
> > sk_pacing_rate more closely.
> >
> > Potential fix:
> > // Update credit whenever (1) sk_pacing_rate is changed, and (2)
> > before checking if transmission is allowed by pacing.
> > credit_in_bytes = last_sk_pacing_rate * (now - last_credit_update)
> > last_credit_update = now
> > last_sk_pacing_rate = sk_pacing_rate
> > // The idea is that last_sk_pacing_rate was set by the CCA for the
> > time interval [last_credit_update, now). And we integrate (sum up)
> > this rate over the interval to computing credits.
> > // I think this is also robust to OS jitter as credits increase even
> > for any intervals missed due to scheduling delays.
> >
> > // To check if it is ok to send pkt due to pacing, one can just check
> > if (sent_till_now + pkt_size <= credit_in_bytes)
> >
> > Please let me know if you have additional thoughts/feedback.
>
> This was a conscious choice really. More state in TCP socket (or in
> FQ) means higher costs.
>
> For conventional CC, difference between pacing P1 and P2, and usual
> packet sizes (typically 1ms of airtime)
> make the difference pretty small in practice.
>
> Also, pacing is offloaded (either in FQ qdisc, or in timer wheel on
> some NIC hardware).
>
> With EDT  model, you probably can implement whatever schem you prefer
> in your CC module, storing extra state in the CC private state.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ