[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20201210080844.23741-1-sjpark@amazon.com>
Date: Thu, 10 Dec 2020 09:08:43 +0100
From: SeongJae Park <sjpark@...zon.com>
To: <davem@...emloft.net>
CC: SeongJae Park <sjpark@...zon.de>, <kuba@...nel.org>,
<kuznet@....inr.ac.ru>, <edumazet@...gle.com>, <fw@...len.de>,
<paulmck@...nel.org>, <netdev@...r.kernel.org>,
<rcu@...r.kernel.org>, <linux-kernel@...r.kernel.org>
Subject: [PATCH v2 0/1] net: Reduce rcu_barrier() contentions from 'unshare(CLONE_NEWNET)'
From: SeongJae Park <sjpark@...zon.de>
On a few of our systems, I found frequent 'unshare(CLONE_NEWNET)' calls
make the number of active slab objects including 'sock_inode_cache' type
rapidly and continuously increase. As a result, memory pressure occurs.
In more detail, I made an artificial reproducer that resembles the
workload that we found the problem and reproduce the problem faster. It
merely repeats 'unshare(CLONE_NEWNET)' 50,000 times in a loop. It takes
about 2 minutes. On 40 CPU cores, 70GB DRAM machine, it reduced about
15GB of available memory in total. Note that the issue don't reproduce
on every machine. On my 6 CPU cores machine, the problem didn't
reproduce.
'cleanup_net()' and 'fqdir_work_fn()' are functions that deallocate the
relevant memory objects. They are asynchronously invoked by the work
queues and internally use 'rcu_barrier()' to ensure safe destructions.
'cleanup_net()' works in a batched maneer in a single thread worker,
while 'fqdir_work_fn()' works for each 'fqdir_exit()' call in the
'system_wq'.
Therefore, 'fqdir_work_fn()' called frequently under the workload and
made the contention for 'rcu_barrier()' high. In more detail, the
global mutex, 'rcu_state.barrier_mutex' became the bottleneck.
I tried making 'fqdir_work_fn()' batched and confirmed it works. The
following patch is for the change. I think this is the right solution
for point fix of this issue, but someone might blame different parts.
1. User: Frequent 'unshare()' calls
>From some point of view, such frequent 'unshare()' calls might seem only
insane.
2. Global mutex in 'rcu_barrier()'
Because of the global mutex, 'rcu_barrier()' callers could wait long
even after the callbacks started before the call finished. Therefore,
similar issues could happen in another 'rcu_barrier()' usages. Maybe we
can use some wait queue like mechanism to notify the waiters when the
desired time came.
I personally believe applying the point fix for now and making
'rcu_barrier()' improvement in longterm make sense. If I'm missing
something or you have different opinion, please feel free to let me
know.
Patch History
-------------
Changes from v1
(https://lore.kernel.org/netdev/20201208094529.23266-1-sjpark@amazon.com/)
- Keep xmas tree variable ordering (Jakub Kicinski)
- Add more numbers (Eric Dumazet)
- Use 'llist_for_each_entry_safe()' (Eric Dumazet)
SeongJae Park (1):
net/ipv4/inet_fragment: Batch fqdir destroy works
include/net/inet_frag.h | 2 +-
net/ipv4/inet_fragment.c | 28 ++++++++++++++++++++--------
2 files changed, 21 insertions(+), 9 deletions(-)
--
2.17.1
Powered by blists - more mailing lists