lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Mon, 29 May 2023 19:58:45 +0800
From: Abel Wu <wuyun.abel@...edance.com>
To: Shakeel Butt <shakeelb@...gle.com>
Cc: "David S . Miller" <davem@...emloft.net>,
 Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>,
 Paolo Abeni <pabeni@...hat.com>, Glauber Costa <glommer@...allels.com>,
 netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 3/4] sock: Consider memcg pressure when raising sockmem

Hi Shakeel, thanks for reviewing! And sorry for replying so late,
I was on a vocation :)

On 5/25/23 9:22 AM, Shakeel Butt wrote:
> On Mon, May 22, 2023 at 03:01:21PM +0800, Abel Wu wrote:
>> For now __sk_mem_raise_allocated() mainly considers global socket
>> memory pressure and allows to raise if no global pressure observed,
>> including the sockets whose memcgs are in pressure, which might
>> result in longer memcg memstall.
>>
>> So take net-memcg's pressure into consideration when allocating
>> socket memory to alleviate long tail latencies.
>>
>> Signed-off-by: Abel Wu <wuyun.abel@...edance.com>
> 
> Hi Abel,
> 
> Have you seen any real world production issue which is fixed by this
> patch or is it more of a fix after reading code?

The latter. But we do observe one common case in the production env
that p2p service, which mainly downloads container images, running
inside a container with tight memory limit can easily be throttled and
keep memstalled for a long period of time and sometimes even be OOM-
killed. This service shows burst usage of TCP memory and I think it
indeed needs suppressing sockmem allocation if memcg is already under
pressure. The memcg pressure is usually caused by too many page caches
and the dirty ones starting to be wrote back to slow backends. So it
is insane to continuously receive net data to consume more memory.

> 
> This code is quite subtle and small changes can cause unintended
> behavior changes. At the moment the tcp memory accounting and memcg
> accounting is intermingled and I think we should decouple them.

My original intention to post this patchset is to clarify that:

   - proto pressure only considers sysctl_mem[] (patch 2)
   - memcg pressure only indicates the pressure inside itself
   - consider both whenever needs allocation or reclaim (patch 1,3)

In this way, the two kinds of pressure maintain purer semantics, and
socket core can react on both of them properly and consistently.

> 
>> ---
>>   net/core/sock.c | 23 ++++++++++++++++-------
>>   1 file changed, 16 insertions(+), 7 deletions(-)
>>
>> diff --git a/net/core/sock.c b/net/core/sock.c
>> index 801df091e37a..7641d64293af 100644
>> --- a/net/core/sock.c
>> +++ b/net/core/sock.c
>> @@ -2977,21 +2977,30 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
>>   {
>>   	bool memcg_charge = mem_cgroup_sockets_enabled && sk->sk_memcg;
>>   	struct proto *prot = sk->sk_prot;
>> -	bool charged = true;
>> +	bool charged = true, pressured = false;
>>   	long allocated;
>>   
>>   	sk_memory_allocated_add(sk, amt);
>>   	allocated = sk_memory_allocated(sk);
>> -	if (memcg_charge &&
>> -	    !(charged = mem_cgroup_charge_skmem(sk->sk_memcg, amt,
>> -						gfp_memcg_charge())))
>> -		goto suppress_allocation;
>> +
>> +	if (memcg_charge) {
>> +		charged = mem_cgroup_charge_skmem(sk->sk_memcg, amt,
>> +						  gfp_memcg_charge());
>> +		if (!charged)
>> +			goto suppress_allocation;
>> +		if (mem_cgroup_under_socket_pressure(sk->sk_memcg))
> 
> The memcg under pressure callback does a upward memcg tree walk, do
> please make sure you have tested the performance impact of this.

Yes, I have tested several benchmarks on a dual socket machine modeled
Intel Xeon(R) Platinum 8260 with SNC disabled, that is 2 NUMA nodes each
of which has 24C/48T. All the benchmarks are done inside a separate
cgroup in a clean host. Below shows the result of tbench4 and netperf:

tbench4 Throughput (misleading but traditional)
                             baseline               patchset
Hmean     1        377.62 (   0.00%)      375.06 *  -0.68%*
Hmean     2        753.99 (   0.00%)      753.21 *  -0.10%*
Hmean     4       1503.50 (   0.00%)     1493.07 *  -0.69%*
Hmean     8       2941.43 (   0.00%)     2925.18 *  -0.55%*
Hmean     16      5637.59 (   0.00%)     5603.64 *  -0.60%*
Hmean     32      9042.90 (   0.00%)     9022.53 *  -0.23%*
Hmean     64     10530.55 (   0.00%)    10554.89 *   0.23%*
Hmean     128    24230.20 (   0.00%)    24424.74 *   0.80%*
Hmean     256    23798.21 (   0.00%)    23941.24 *   0.60%*
Hmean     384    23620.63 (   0.00%)    23569.54 *  -0.22%*

netperf-udp
                                    baseline               patchset
Hmean     send-64         281.99 (   0.00%)      274.50 *  -2.65%*
Hmean     send-128        556.70 (   0.00%)      545.82 *  -1.96%*
Hmean     send-256       1102.60 (   0.00%)     1091.21 *  -1.03%*
Hmean     send-1024      4180.48 (   0.00%)     4073.87 *  -2.55%*
Hmean     send-2048      7837.61 (   0.00%)     7707.12 *  -1.66%*
Hmean     send-3312     12157.49 (   0.00%)    11845.03 *  -2.57%*
Hmean     send-4096     14512.64 (   0.00%)    14156.45 *  -2.45%*
Hmean     send-8192     24015.40 (   0.00%)    23920.94 (  -0.39%)
Hmean     send-16384    39875.21 (   0.00%)    39696.67 (  -0.45%)
Hmean     recv-64         281.99 (   0.00%)      274.50 *  -2.65%*
Hmean     recv-128        556.70 (   0.00%)      545.82 *  -1.96%*
Hmean     recv-256       1102.60 (   0.00%)     1091.21 *  -1.03%*
Hmean     recv-1024      4180.48 (   0.00%)     4073.76 *  -2.55%*
Hmean     recv-2048      7837.61 (   0.00%)     7707.11 *  -1.67%*
Hmean     recv-3312     12157.49 (   0.00%)    11845.03 *  -2.57%*
Hmean     recv-4096     14512.62 (   0.00%)    14156.45 *  -2.45%*
Hmean     recv-8192     24015.29 (   0.00%)    23920.88 (  -0.39%)
Hmean     recv-16384    39873.93 (   0.00%)    39696.02 (  -0.45%)

netperf-tcp
                               baseline               patchset
Hmean     64        1777.05 (   0.00%)     1793.04 (   0.90%)
Hmean     128       3364.25 (   0.00%)     3451.05 *   2.58%*
Hmean     256       6309.21 (   0.00%)     6506.84 *   3.13%*
Hmean     1024     19571.52 (   0.00%)    19606.65 (   0.18%)
Hmean     2048     26467.00 (   0.00%)    26658.12 (   0.72%)
Hmean     3312     31312.36 (   0.00%)    31403.54 (   0.29%)
Hmean     4096     33263.37 (   0.00%)    33278.77 (   0.05%)
Hmean     8192     39961.82 (   0.00%)    40149.77 (   0.47%)
Hmean     16384    46065.33 (   0.00%)    46683.67 (   1.34%)

Except slight regression in netperf-udp, no obvious performance win
or loss. But as you reminded me of the cost of hierarchical behavior,
I re-tested the cases in a 5-level depth cgroup (originally 2-level),
and the results are:

tbench4 Throughput (misleading but traditional)
                             baseline               patchset
Hmean     1        361.93 (   0.00%)      367.58 *   1.56%*
Hmean     2        734.39 (   0.00%)      730.33 *  -0.55%*
Hmean     4       1426.82 (   0.00%)     1440.81 *   0.98%*
Hmean     8       2848.86 (   0.00%)     2860.87 *   0.42%*
Hmean     16      5436.72 (   0.00%)     5491.72 *   1.01%*
Hmean     32      8743.34 (   0.00%)     8913.27 *   1.94%*
Hmean     64     10345.41 (   0.00%)    10436.92 *   0.88%*
Hmean     128    23390.36 (   0.00%)    23353.09 *  -0.16%*
Hmean     256    23823.20 (   0.00%)    23509.79 *  -1.32%*
Hmean     384    23268.09 (   0.00%)    23178.10 *  -0.39%*

netperf-udp
                                    baseline               patchset
Hmean     send-64         278.31 (   0.00%)      275.68 *  -0.94%*
Hmean     send-128        554.52 (   0.00%)      547.46 (  -1.27%)
Hmean     send-256       1106.64 (   0.00%)     1103.01 (  -0.33%)
Hmean     send-1024      4135.84 (   0.00%)     4057.47 *  -1.89%*
Hmean     send-2048      7816.13 (   0.00%)     7732.71 *  -1.07%*
Hmean     send-3312     12068.32 (   0.00%)    11895.94 *  -1.43%*
Hmean     send-4096     14358.02 (   0.00%)    14304.06 (  -0.38%)
Hmean     send-8192     24041.57 (   0.00%)    24061.70 (   0.08%)
Hmean     send-16384    39996.09 (   0.00%)    39936.08 (  -0.15%)
Hmean     recv-64         278.31 (   0.00%)      275.68 *  -0.94%*
Hmean     recv-128        554.52 (   0.00%)      547.46 (  -1.27%)
Hmean     recv-256       1106.64 (   0.00%)     1103.01 (  -0.33%)
Hmean     recv-1024      4135.84 (   0.00%)     4057.47 *  -1.89%*
Hmean     recv-2048      7816.13 (   0.00%)     7732.71 *  -1.07%*
Hmean     recv-3312     12068.32 (   0.00%)    11895.94 *  -1.43%*
Hmean     recv-4096     14357.99 (   0.00%)    14304.04 (  -0.38%)
Hmean     recv-8192     24041.43 (   0.00%)    24061.58 (   0.08%)
Hmean     recv-16384    39995.72 (   0.00%)    39935.68 (  -0.15%)

netperf-tcp
                               baseline               patchset
Hmean     64        1779.93 (   0.00%)     1784.75 (   0.27%)
Hmean     128       3380.32 (   0.00%)     3424.14 (   1.30%)
Hmean     256       6383.37 (   0.00%)     6504.97 *   1.90%*
Hmean     1024     19345.07 (   0.00%)    19604.06 *   1.34%*
Hmean     2048     26547.60 (   0.00%)    26743.94 *   0.74%*
Hmean     3312     30948.40 (   0.00%)    31419.11 *   1.52%*
Hmean     4096     32888.83 (   0.00%)    33125.01 *   0.72%*
Hmean     8192     40020.38 (   0.00%)    39949.53 (  -0.18%)
Hmean     16384    46084.48 (   0.00%)    46300.43 (   0.47%)

Still no obvious difference, and even the udp regression is reduced.

Thanks & Best,
	Abel

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ