netdev - Re: [REGRESSION] sk_memory_allocated counter leaking on aarch64

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CANn89iLO_xpjAacnMB2H4ozPHnNfGO9_OhB87A_3mgQEYP+81A@mail.gmail.com>
Date: Fri, 19 Apr 2024 17:21:26 +0200
From: Eric Dumazet <edumazet@...gle.com>
To: Jonathan Heathcote <jonathan.heathcote@....co.uk>
Cc: "netdev@...r.kernel.org" <netdev@...r.kernel.org>, 
	"regressions@...ts.linux.dev" <regressions@...ts.linux.dev>
Subject: Re: [REGRESSION] sk_memory_allocated counter leaking on aarch64

On Fri, Apr 19, 2024 at 4:46 PM Jonathan Heathcote
<jonathan.heathcote@....co.uk> wrote:
>
> Since Linux 6.0.0-rc1 (including v6.8.7), there appears to be a leak in
> the counter used to monitor TCP memory consumption which leads to
> spurious memory pressure and, eventually, unrecoverable OOM behaviour on
> aarch64.
>
> I am running an nginx web server on aarch64 which is running a media CDN
> style workload at ~350 GBit/s over ~100k HTTPS sessions. Over the course
> of a few hours, the memory reported as consumed by TCP in
> /proc/net/sockstat grows steadily until eventually hitting the hard
> limit configured in /proc/sys/net/ipv4/tcp_mem (see plot [0] -- the
> slight knee at about 18:25 coincides with the memory pressure threshold
> being reached).
>
> [0] https://www.dropbox.com/scl/fi/xsh8a2of9pluj5hspc41p/oom.png?rlkey=7dzfx36z5tnkf5wlqulzqufdl&st=yk887z0e&dl=1
>
> If the load is removed (all connections cleanly closed and nginx shut
> down) the reported memory consumption does not reduce. Plot [1] shows a
> test where all connections are closed and nginx terminated around 10:22
> without memory reducing to levels seen before the test. A reboot appears
> necessary to bring the counter back to zero.
>
> [1] https://www.dropbox.com/scl/fi/36ainpx7mbwe5q3o2zfry/nrz.png?rlkey=01a2bw2lyj9dih9fwws81tchi&st=83aqxzwj&dl=1
>
> (NB: All plots show the reported memory in bytes rather than pages.
> Initial peaks coincide with the initial opening of tens of thousands
> of connections.)
>
> Prior to Linux v6.0.0-rc1, this issue does not occur. Plot [2] shows a
> similar test running on v5.19.17. No unbounded growth in memory
> consumption is observed and usage drops back to zero when all
> connections are closed at 15:10.
>
> [2] https://www.dropbox.com/scl/fi/dz2nqs8p6ogl7yqwn8cmw/expected.png?rlkey=co77565mr4tq4pvvimtju1xnx&st=zu9j2id7&dl=1
>
> After some investigation, I noticed that the memory reported as consumed
> did not match system memory usage. Following the implementation of
> /proc/net/sockstat to the underlying counter, sk_memory_allocated, I put
> together a crude bpftrace [3] script to monitor the places where this
> counter is updated in the TCP implementation and implement my own count.
> The bpftrace based counts can be seen to diverge from the value reported
> by /proc/net/sockstat in plot [4] suggesting that the 'leak' might be an
> intermittent failure to update the counter.
>
> [3] https://www.dropbox.com/scl/fi/17cgytnte3odh3ovo9psw/counts.bt?rlkey=ry90zdyk0qwrhdf4xnzhkfevq&st=bj9jmovt&dl=1
> [4] https://www.dropbox.com/scl/fi/ynlvbooqvz9e38emsd9n7/bpftrace.png?rlkey=dae6s68lekct1605z9vq7h7an&st=ykmeb4du&dl=1
>
> After a bit of manual looking around, I've come to suspect suspect that
> commit 3cd3399 (which introduces the use of per-CPU counters with
> intermittent updating of the system counter) might be at least some way
> relevant to this regression. Manually reverting this change in 6.x
> kernels appears to fix the issue in any case.
>
> Unfortunately whilst I have binary-searched the kernel releases to find
> the regressing release, I have not had the time to bisect between 5.19
> and 6.0. As such, I cannot confirm that the commit above was
> definitively the source of the regression, only that undoing it appears
> to fix it! My apologies if this proves a red-herring!
>
> For completeness, a more thorough description of the system under test
> is given below:
>
> * CPU: Ampere Altra Max M128-30 (128 64 bit ARM cores)
> * Distribution: Rocky Linux 9
> * Linux kernel: (compiled from kernel.org sources)
>   * Exhibits bug:
>     * 6.8.7 (latest release at time of writing)
>     * ... and a few others tested inbetween ...
>     * 6.0.0-rc1 (first release containing bug)
>   * Does not exhibit bug:
>     * 5.19.17 (latest version not to exhibit bug)
>     * ... and a few others back to 5.14.0 ...
> * Linux kernel config consists of the config from Rocky Linux 9
>   configured to use 64kb pages. Specifically, I'm using the config from
>   the kernel-64k package version 5.14.0-284.30.1.el9_2.aarch64+64k,
>   updated as necessary for building newer kernels using `make
>   olddefconfig`. The resulting configuration used for v6.8.7 can be
>   found here: [5].
> * Workload: nginx 1.20.1 serving an in-RAM dataset to ~100k synthetic
>   HTTPS clients at ~350 GBit/s. (Non-hardware accelerated) kTLS is used.
>
> [5] https://www.dropbox.com/scl/fi/x0t2jufmnlcul9vbvn48p/config-6.8.7?rlkey=hwu0al2p6k7f92o1ks40deci9&st=9ol3cc45&dl=1
>
> I have also spotted an Ubuntu/AWS bug report [6] in which another person
> seems to be running into (what might be) this bug in a different
> environment and distribution. The symptoms there are very similar:
> aarch64, high connection count server workload, memory not reclaimed on
> connections closing, fixed by migrating from a 6.x kernel to a 5.x
> kernel. I'm mentioning here in case that report adds any useful
> information.
>
> [6] https://bugs.launchpad.net/ubuntu/+source/linux-signed-aws-6.2/+bug/2045560
>
> Thanks very much for your help!
>
> Jonathan Heathcote
>
> #regzbot introduced: v5.19.17..v6.0.0-rc1

Ouch.

Hi Jonathan, thanks a lot for this detailed and awesome report !

Could you try the following patch ?

Thanks again !

diff --git a/include/net/sock.h b/include/net/sock.h
index f57bfd8a2ad2deaedf3f351325ab9336ae040504..1811c51d180d2de8bf8fc45e06caa073b429f524
100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1416,9 +1416,9 @@ sk_memory_allocated_add(struct sock *sk, int amt)
        int local_reserve;

        preempt_disable();
-       local_reserve =
__this_cpu_add_return(*sk->sk_prot->per_cpu_fw_alloc, amt);
+       local_reserve =
this_cpu_add_return(*sk->sk_prot->per_cpu_fw_alloc, amt);
        if (local_reserve >= READ_ONCE(sysctl_mem_pcpu_rsv)) {
-               __this_cpu_sub(*sk->sk_prot->per_cpu_fw_alloc, local_reserve);
+               local_reserve =
this_cpu_xchg(*sk->sk_prot->per_cpu_fw_alloc, 0);
                atomic_long_add(local_reserve, sk->sk_prot->memory_allocated);
        }
        preempt_enable();
@@ -1430,9 +1430,9 @@ sk_memory_allocated_sub(struct sock *sk, int amt)
        int local_reserve;

        preempt_disable();
-       local_reserve =
__this_cpu_sub_return(*sk->sk_prot->per_cpu_fw_alloc, amt);
+       local_reserve =
this_cpu_sub_return(*sk->sk_prot->per_cpu_fw_alloc, amt);
        if (local_reserve <= -READ_ONCE(sysctl_mem_pcpu_rsv)) {
-               __this_cpu_sub(*sk->sk_prot->per_cpu_fw_alloc, local_reserve);
+               local_reserve =
this_cpu_xchg(*sk->sk_prot->per_cpu_fw_alloc, 0);
                atomic_long_add(local_reserve, sk->sk_prot->memory_allocated);
        }
        preempt_enable();