netdev - Re: [PATCH net-next 1/2] net: Keep sk->sk_forward

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20230509171910.yka3hucbwfnnq5fq@google.com>
Date: Tue, 9 May 2023 17:19:10 +0000
From: Shakeel Butt <shakeelb@...gle.com>
To: Cathy Zhang <cathy.zhang@...el.com>
Cc: edumazet@...gle.com, davem@...emloft.net, kuba@...nel.org, 
	pabeni@...hat.com, jesse.brandeburg@...el.com, suresh.srinivas@...el.com, 
	tim.c.chen@...el.com, lizhen.you@...el.com, eric.dumazet@...il.com, 
	netdev@...r.kernel.org, linux-mm@...ck.org, cgroups@...r.kernel.org
Subject: Re: [PATCH net-next 1/2] net: Keep sk->sk_forward_alloc as a proper size

On Sun, May 07, 2023 at 07:08:00PM -0700, Cathy Zhang wrote:
> Before commit 4890b686f408 ("net: keep sk->sk_forward_alloc as small as
> possible"), each TCP can forward allocate up to 2 MB of memory and
> tcp_memory_allocated might hit tcp memory limitation quite soon.

Not just the system level tcp memory limit but we have actually seen in
production unneeded and unexpected memcg OOMs and the commit 4890b686f408
fixes those OOMs as well.

> To
> reduce the memory pressure, that commit keeps sk->sk_forward_alloc as
> small as possible, which will be less than 1 page size if SO_RESERVE_MEM
> is not specified.
> 
> However, with commit 4890b686f408 ("net: keep sk->sk_forward_alloc as
> small as possible"), memcg charge hot paths are observed while system is
> stressed with a large amount of connections. That is because
> sk->sk_forward_alloc is too small and it's always less than
> sk->truesize, network handlers like tcp_rcv_established() should jump to
> slow path more frequently to increase sk->sk_forward_alloc. Each memory
> allocation will trigger memcg charge, then perf top shows the following
> contention paths on the busy system.
> 
>     16.77%  [kernel]            [k] page_counter_try_charge
>     16.56%  [kernel]            [k] page_counter_cancel
>     15.65%  [kernel]            [k] try_charge_memcg
> 
> In order to avoid the memcg overhead and performance penalty,

IMO this is not the right place to fix memcg performance overhead,
specifically because it will re-introduce the memcg OOMs issue. Please
fix the memcg overhead in the memcg code.

Please share the detail profile of the memcg code. I can help in
brainstorming and reviewing the fix.

> sk->sk_forward_alloc should be kept with a proper size instead of as
> small as possible. Keep memory up to 64KB from reclaims when uncharging
> sk_buff memory, which is closer to the maximum size of sk_buff. It will
> help reduce the frequency of allocating memory during TCP connection.
> The original reclaim threshold for reserved memory per-socket is 2MB, so
> the extraneous memory reserved now is about 32 times less than before
> commit 4890b686f408 ("net: keep sk->sk_forward_alloc as small as
> possible").
> 
> Run memcached with memtier_benchamrk to verify the optimization fix. 8
> server-client pairs are created with bridge network on localhost, server
> and client of the same pair share 28 logical CPUs.
> 
> Results (Average for 5 run)
> RPS (with/without patch)	+2.07x
> 

Do you have regression data from any production workload? Please keep in
mind that many times we (MM subsystem) accepts the regressions of
microbenchmarks over complicated optimizations. So, if there is a real
production regression, please be very explicit about it.