lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADVnQy=Z697P_gtkXMgPiASS6YwJ4PLDkqei3NvGJ5csKE8nhw@mail.gmail.com>
Date: Mon, 26 Aug 2024 09:25:46 -0400
From: Neal Cardwell <ncardwell@...gle.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: "David S . Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, 
	Paolo Abeni <pabeni@...hat.com>, netdev@...r.kernel.org, eric.dumazet@...il.com, 
	Mingrui Zhang <mrzhang97@...il.com>, Lisong Xu <xu@....edu>, Yuchung Cheng <ycheng@...gle.com>
Subject: Re: [PATCH net] tcp_cubic: switch ca->last_time to usec resolution

On Mon, Aug 26, 2024 at 5:27 AM Eric Dumazet <edumazet@...gle.com> wrote:
>
> bictcp_update() uses ca->last_time as a timestamp
> to decide of several heuristics.
>
> Historically this timestamp has been fed with jiffies,
> which has too coarse resolution, some distros are
> still using CONFIG_HZ_250=y
>
> It is time to switch to usec resolution, now TCP stack
> already caches in tp->tcp_mstamp the high resolution time.
>
> Also remove the 'inline' qualifier, this helper is used
> once and compilers are smarts.
>
> Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> Link: https://lore.kernel.org/netdev/20240817163400.2616134-1-mrzhang97@gmail.com/T/#mb6a64c9e2309eb98eaeeeb4b085c4a2270b6789d
> Cc: Mingrui Zhang <mrzhang97@...il.com>
> Cc: Lisong Xu <xu@....edu>
> ---
>  net/ipv4/tcp_cubic.c | 18 ++++++++++--------
>  1 file changed, 10 insertions(+), 8 deletions(-)
>
> diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
> index 5dbed91c6178257df8d2ccd1c8690a10bdbaf56a..3b1845103ee1866a316926a130c212e6f5e78ef0 100644
> --- a/net/ipv4/tcp_cubic.c
> +++ b/net/ipv4/tcp_cubic.c
> @@ -87,7 +87,7 @@ struct bictcp {
>         u32     cnt;            /* increase cwnd by 1 after ACKs */
>         u32     last_max_cwnd;  /* last maximum snd_cwnd */
>         u32     last_cwnd;      /* the last snd_cwnd */
> -       u32     last_time;      /* time when updated last_cwnd */
> +       u32     last_time;      /* time when updated last_cwnd (usec) */
>         u32     bic_origin_point;/* origin point of bic function */
>         u32     bic_K;          /* time to origin point
>                                    from the beginning of the current epoch */
> @@ -211,26 +211,28 @@ static u32 cubic_root(u64 a)
>  /*
>   * Compute congestion window to use.
>   */
> -static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked)
> +static void bictcp_update(struct sock *sk, u32 cwnd, u32 acked)
>  {
> +       const struct tcp_sock *tp = tcp_sk(sk);
> +       struct bictcp *ca = inet_csk_ca(sk);
>         u32 delta, bic_target, max_cnt;
>         u64 offs, t;
>
>         ca->ack_cnt += acked;   /* count the number of ACKed packets */
>
> -       if (ca->last_cwnd == cwnd &&
> -           (s32)(tcp_jiffies32 - ca->last_time) <= HZ / 32)
> +       delta = tp->tcp_mstamp - ca->last_time;
> +       if (ca->last_cwnd == cwnd && delta <= USEC_PER_SEC / 32)
>                 return;
>
> -       /* The CUBIC function can update ca->cnt at most once per jiffy.
> +       /* The CUBIC function can update ca->cnt at most once per ms.
>          * On all cwnd reduction events, ca->epoch_start is set to 0,
>          * which will force a recalculation of ca->cnt.
>          */
> -       if (ca->epoch_start && tcp_jiffies32 == ca->last_time)
> +       if (ca->epoch_start && delta < USEC_PER_MSEC)
>                 goto tcp_friendliness;

AFAICT there is a problem here. It is switching this line of code to
use microsecond resolution without also changing the core CUBIC slope
(ca->cnt) calculation to also use microseconds.  AFAICT that means we
would be re-introducing the bug that was fixed in 2015 in
d6b1a8a92a1417f8859a6937d2e6ffe2dfab4e6d (see below). Basically, if
the CUBIC slope (ca->cnt) calculation uses jiffies, then we should
only run that code once per jiffy, to avoid getting the wrong answer
for the slope:

commit d6b1a8a92a1417f8859a6937d2e6ffe2dfab4e6d
Author: Neal Cardwell <ncardwell@...gle.com>
Date:   Wed Jan 28 20:01:39 2015 -0500

    tcp: fix timing issue in CUBIC slope calculation

    This patch fixes a bug in CUBIC that causes cwnd to increase slightly
    too slowly when multiple ACKs arrive in the same jiffy.

    If cwnd is supposed to increase at a rate of more than once per jiffy,
    then CUBIC was sometimes too slow. Because the bic_target is
    calculated for a future point in time, calculated with time in
    jiffies, the cwnd can increase over the course of the jiffy while the
    bic_target calculated as the proper CUBIC cwnd at time
    t=tcp_time_stamp+rtt does not increase, because tcp_time_stamp only
    increases on jiffy tick boundaries.

    So since the cnt is set to:
            ca->cnt = cwnd / (bic_target - cwnd);
    as cwnd increases but bic_target does not increase due to jiffy
    granularity, the cnt becomes too large, causing cwnd to increase
    too slowly.

    For example:
    - suppose at the beginning of a jiffy, cwnd=40, bic_target=44
    - so CUBIC sets:
       ca->cnt =  cwnd / (bic_target - cwnd) = 40 / (44 - 40) = 40/4 = 10
    - suppose we get 10 acks, each for 1 segment, so tcp_cong_avoid_ai()
       increases cwnd to 41
    - so CUBIC sets:
       ca->cnt =  cwnd / (bic_target - cwnd) = 41 / (44 - 41) = 41 / 3 = 13

    So now CUBIC will wait for 13 packets to be ACKed before increasing
    cwnd to 42, insted of 10 as it should.

    The fix is to avoid adjusting the slope (determined by ca->cnt)
    multiple times within a jiffy, and instead skip to compute the Reno
    cwnd, the "TCP friendliness" code path.

    Reported-by: Eyal Perry <eyalpe@...lanox.com>
    Signed-off-by: Neal Cardwell <ncardwell@...gle.com>
    Signed-off-by: Yuchung Cheng <ycheng@...gle.com>
    Signed-off-by: Eric Dumazet <edumazet@...gle.com>
    Signed-off-by: David S. Miller <davem@...emloft.net>

diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
index ffc045da2fd5..4b276d1ed980 100644
--- a/net/ipv4/tcp_cubic.c
+++ b/net/ipv4/tcp_cubic.c
@@ -213,6 +213,13 @@ static inline void bictcp_update(struct bictcp
*ca, u32 cwnd, u32 acked)
            (s32)(tcp_time_stamp - ca->last_time) <= HZ / 32)
                return;

+       /* The CUBIC function can update ca->cnt at most once per jiffy.
+        * On all cwnd reduction events, ca->epoch_start is set to 0,
+        * which will force a recalculation of ca->cnt.
+        */
+       if (ca->epoch_start && tcp_time_stamp == ca->last_time)
+               goto tcp_friendliness;
+
        ca->last_cwnd = cwnd;
        ca->last_time = tcp_time_stamp;

@@ -280,6 +287,7 @@ static inline void bictcp_update(struct bictcp
*ca, u32 cwnd, u32 acked)
        if (ca->last_max_cwnd == 0 && ca->cnt > 20)
                ca->cnt = 20;   /* increase cwnd 5% per RTT */

+tcp_friendliness:
        /* TCP Friendly */
        if (tcp_friendliness) {
                u32 scale = beta_scale;
---

neal

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ