netdev - Re: [PATCH net-next 1/2] net: Keep sk->sk_forward

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230517162447.dztfzmx3hhetfs2q@google.com>
Date: Wed, 17 May 2023 16:24:47 +0000
From: Shakeel Butt <shakeelb@...gle.com>
To: Oliver Sang <oliver.sang@...el.com>
Cc: Zhang Cathy <cathy.zhang@...el.com>, Yin Fengwei <fengwei.yin@...el.com>, 
	Feng Tang <feng.tang@...el.com>, Eric Dumazet <edumazet@...gle.com>, Linux MM <linux-mm@...ck.org>, 
	Cgroups <cgroups@...r.kernel.org>, Paolo Abeni <pabeni@...hat.com>, 
	"davem@...emloft.net" <davem@...emloft.net>, "kuba@...nel.org" <kuba@...nel.org>, 
	Brandeburg Jesse <jesse.brandeburg@...el.com>, Srinivas Suresh <suresh.srinivas@...el.com>, 
	Chen Tim C <tim.c.chen@...el.com>, You Lizhen <lizhen.you@...el.com>, 
	"eric.dumazet@...il.com" <eric.dumazet@...il.com>, "netdev@...r.kernel.org" <netdev@...r.kernel.org>, philip.li@...el.com, 
	yujie.liu@...el.com
Subject: Re: [PATCH net-next 1/2] net: Keep sk->sk_forward_alloc as a proper size

On Tue, May 16, 2023 at 01:46:55PM +0800, Oliver Sang wrote:
> hi Shakeel,
> 
> On Mon, May 15, 2023 at 12:50:31PM -0700, Shakeel Butt wrote:
> > +Feng, Yin and Oliver
> > 
> > >
> > > > Thanks a lot Cathy for testing. Do you see any performance improvement for
> > > > the memcached benchmark with the patch?
> > >
> > > Yep, absolutely :- ) RPS (with/without patch) = +1.74
> > 
> > Thanks a lot Cathy.
> > 
> > Feng/Yin/Oliver, can you please test the patch at [1] with other
> > workloads used by the test robot? Basically I wanted to know if it has
> > any positive or negative impact on other perf benchmarks.
> 
> is it possible for you to resend patch with Signed-off-by?
> without it, test robot will regard the patch as informal, then it cannot feed
> into auto test process.
> and could you tell us the base of this patch? it will help us apply it
> correctly.
> 
> on the other hand, due to resource restraint, we normally cannot support
> this type of on-demand test upon a single patch, patch set, or a branch.
> instead, we try to merge them into so-called hourly-kernels, then distribute
> tests and auto-bisects to various platforms.
> after we applying your patch and merging it to hourly-kernels sccussfully,
> if it really causes some performance changes, the test robot could spot out
> this patch as 'fbc' and we will send report to you. this could happen within
> several weeks after applying.
> but due to the complexity of whole process (also limited resourse, such like
> we cannot run all tests on all platforms), we cannot guanrantee capture all
> possible performance impacts of this patch. and it's hard for us to provide
> a big picture like what's the general performance impact of this patch.
> this maybe is not exactly what you want. is it ok for you?
> 
> 

Yes, that is fine and thanks for the help. The patch is below:


>From 93b3b4c5f356a5090551519522cfd5740ae7e774 Mon Sep 17 00:00:00 2001
From: Shakeel Butt <shakeelb@...gle.com>
Date: Tue, 16 May 2023 20:30:26 +0000
Subject: [PATCH] memcg: skip stock refill in irq context

The linux kernel processes incoming packets in softirq on a given CPU
and those packets may belong to different jobs. This is very normal on
large systems running multiple workloads. With memcg enabled, network
memory for such packets is charged to the corresponding memcgs of the
jobs.

Memcg charging can be a costly operation and the memcg code implements
a per-cpu memcg charge caching optimization to reduce the cost of
charging. More specifically, the kernel charges the given memcg for more
memory than requested and keep the remaining charge in a local per-cpu
cache. The insight behind this heuristic is that there will be more
charge requests for that memcg in near future. This optimization works
well when a specific job runs on a CPU for long time and majority of the
charging requests happen in process context. However the kernel's
incoming packet processing does not work well with this optimization.

Recently Cathy Zhang has shown [1] that memcg charge flushing within the
memcg charge path can become a performance bottleneck for the memcg
charging of network traffic.

Perf profile:

8.98%  mc-worker        [kernel.vmlinux]          [k] page_counter_cancel
    |
     --8.97%--page_counter_cancel
	       |
		--8.97%--page_counter_uncharge
			  drain_stock
			  __refill_stock
			  refill_stock
			  |
			   --8.91%--try_charge_memcg
				     mem_cgroup_charge_skmem
				     |
				      --8.91%--__sk_mem_raise_allocated
						__sk_mem_schedule
						|
						|--5.41%--tcp_try_rmem_schedule
						|          tcp_data_queue
						|          tcp_rcv_established
						|          tcp_v4_do_rcv
						|          tcp_v4_rcv

The simplest way to solve this issue is to not refill the memcg charge
stock in the irq context. Since networking is the main source of memcg
charging in the irq context, other users will not be impacted. In
addition, this will preseve the memcg charge cache of the application
running on that CPU.

There are also potential side effects. What if all the packets belong to
the same application and memcg? More specifically, users can use Receive
Flow Steering (RFS) to make sure the kernel process the packets of the
application on the CPU where the application is running. This change may
cause the kernel to do slowpath memcg charging more often in irq
context.

Link: https://lore.kernel.org/all/IA0PR11MB73557DEAB912737FD61D2873FC749@IA0PR11MB7355.namprd11.prod.outlook.com [1]
Signed-off-by: Shakeel Butt <shakeelb@...gle.com>
---
 mm/memcontrol.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5abffe6f8389..2635aae82b3e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2652,6 +2652,14 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	bool raised_max_event = false;
 	unsigned long pflags;
 
+	/*
+	 * Skip the refill in irq context as it may flush the charge cache of
+	 * the process running on the CPUs or the kernel may have to process
+	 * incoming packets for different memcgs.
+	 */
+	if (!in_task())
+		batch = nr_pages;
+
 retry:
 	if (consume_stock(memcg, nr_pages))
 		return 0;
-- 
2.40.1.606.ga4b1b128d6-goog