netdev - [REGRESSION] sk_memory_allocated counter leaking on aarch64

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <VI1PR01MB42407D7947B2EA448F1E04EFD10D2@VI1PR01MB4240.eurprd01.prod.exchangelabs.com>
Date: Fri, 19 Apr 2024 14:46:01 +0000
From: Jonathan Heathcote <jonathan.heathcote@....co.uk>
To: "edumazet@...gle.com" <edumazet@...gle.com>
CC: "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "regressions@...ts.linux.dev" <regressions@...ts.linux.dev>
Subject: [REGRESSION] sk_memory_allocated counter leaking on aarch64

Since Linux 6.0.0-rc1 (including v6.8.7), there appears to be a leak in
the counter used to monitor TCP memory consumption which leads to
spurious memory pressure and, eventually, unrecoverable OOM behaviour on
aarch64.

I am running an nginx web server on aarch64 which is running a media CDN
style workload at ~350 GBit/s over ~100k HTTPS sessions. Over the course
of a few hours, the memory reported as consumed by TCP in
/proc/net/sockstat grows steadily until eventually hitting the hard
limit configured in /proc/sys/net/ipv4/tcp_mem (see plot [0] -- the
slight knee at about 18:25 coincides with the memory pressure threshold
being reached).

[0] https://www.dropbox.com/scl/fi/xsh8a2of9pluj5hspc41p/oom.png?rlkey=7dzfx36z5tnkf5wlqulzqufdl&st=yk887z0e&dl=1

If the load is removed (all connections cleanly closed and nginx shut
down) the reported memory consumption does not reduce. Plot [1] shows a
test where all connections are closed and nginx terminated around 10:22
without memory reducing to levels seen before the test. A reboot appears
necessary to bring the counter back to zero.

[1] https://www.dropbox.com/scl/fi/36ainpx7mbwe5q3o2zfry/nrz.png?rlkey=01a2bw2lyj9dih9fwws81tchi&st=83aqxzwj&dl=1

(NB: All plots show the reported memory in bytes rather than pages.
Initial peaks coincide with the initial opening of tens of thousands
of connections.)

Prior to Linux v6.0.0-rc1, this issue does not occur. Plot [2] shows a
similar test running on v5.19.17. No unbounded growth in memory
consumption is observed and usage drops back to zero when all
connections are closed at 15:10.

[2] https://www.dropbox.com/scl/fi/dz2nqs8p6ogl7yqwn8cmw/expected.png?rlkey=co77565mr4tq4pvvimtju1xnx&st=zu9j2id7&dl=1

After some investigation, I noticed that the memory reported as consumed
did not match system memory usage. Following the implementation of
/proc/net/sockstat to the underlying counter, sk_memory_allocated, I put
together a crude bpftrace [3] script to monitor the places where this
counter is updated in the TCP implementation and implement my own count.
The bpftrace based counts can be seen to diverge from the value reported
by /proc/net/sockstat in plot [4] suggesting that the 'leak' might be an
intermittent failure to update the counter.

[3] https://www.dropbox.com/scl/fi/17cgytnte3odh3ovo9psw/counts.bt?rlkey=ry90zdyk0qwrhdf4xnzhkfevq&st=bj9jmovt&dl=1
[4] https://www.dropbox.com/scl/fi/ynlvbooqvz9e38emsd9n7/bpftrace.png?rlkey=dae6s68lekct1605z9vq7h7an&st=ykmeb4du&dl=1

After a bit of manual looking around, I've come to suspect suspect that
commit 3cd3399 (which introduces the use of per-CPU counters with
intermittent updating of the system counter) might be at least some way
relevant to this regression. Manually reverting this change in 6.x
kernels appears to fix the issue in any case.

Unfortunately whilst I have binary-searched the kernel releases to find
the regressing release, I have not had the time to bisect between 5.19
and 6.0. As such, I cannot confirm that the commit above was
definitively the source of the regression, only that undoing it appears
to fix it! My apologies if this proves a red-herring!

For completeness, a more thorough description of the system under test
is given below:

* CPU: Ampere Altra Max M128-30 (128 64 bit ARM cores)
* Distribution: Rocky Linux 9
* Linux kernel: (compiled from kernel.org sources)
  * Exhibits bug:
    * 6.8.7 (latest release at time of writing)
    * ... and a few others tested inbetween ...
    * 6.0.0-rc1 (first release containing bug)
  * Does not exhibit bug:
    * 5.19.17 (latest version not to exhibit bug)
    * ... and a few others back to 5.14.0 ...
* Linux kernel config consists of the config from Rocky Linux 9
  configured to use 64kb pages. Specifically, I'm using the config from
  the kernel-64k package version 5.14.0-284.30.1.el9_2.aarch64+64k,
  updated as necessary for building newer kernels using `make
  olddefconfig`. The resulting configuration used for v6.8.7 can be
  found here: [5].
* Workload: nginx 1.20.1 serving an in-RAM dataset to ~100k synthetic
  HTTPS clients at ~350 GBit/s. (Non-hardware accelerated) kTLS is used.

[5] https://www.dropbox.com/scl/fi/x0t2jufmnlcul9vbvn48p/config-6.8.7?rlkey=hwu0al2p6k7f92o1ks40deci9&st=9ol3cc45&dl=1

I have also spotted an Ubuntu/AWS bug report [6] in which another person
seems to be running into (what might be) this bug in a different
environment and distribution. The symptoms there are very similar:
aarch64, high connection count server workload, memory not reclaimed on
connections closing, fixed by migrating from a 6.x kernel to a 5.x
kernel. I'm mentioning here in case that report adds any useful
information.

[6] https://bugs.launchpad.net/ubuntu/+source/linux-signed-aws-6.2/+bug/2045560

Thanks very much for your help!

Jonathan Heathcote

#regzbot introduced: v5.19.17..v6.0.0-rc1