netdev - [PATCH bpf-next v1 0/3] bpf, sockmap: Improve performance with CPU affinity

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250428081744.52375-1-jiayuan.chen@linux.dev>
Date: Mon, 28 Apr 2025 16:16:51 +0800
From: Jiayuan Chen <jiayuan.chen@...ux.dev>
To: bpf@...r.kernel.org
Cc: mrpre@....com,
	Jiayuan Chen <jiayuan.chen@...ux.dev>,
	Alexei Starovoitov <ast@...nel.org>,
	Daniel Borkmann <daniel@...earbox.net>,
	Andrii Nakryiko <andrii@...nel.org>,
	Martin KaFai Lau <martin.lau@...ux.dev>,
	Eduard Zingerman <eddyz87@...il.com>,
	Song Liu <song@...nel.org>,
	Yonghong Song <yonghong.song@...ux.dev>,
	John Fastabend <john.fastabend@...il.com>,
	KP Singh <kpsingh@...nel.org>,
	Stanislav Fomichev <sdf@...ichev.me>,
	Hao Luo <haoluo@...gle.com>,
	Jiri Olsa <jolsa@...nel.org>,
	Jonathan Corbet <corbet@....net>,
	Jakub Sitnicki <jakub@...udflare.com>,
	"David S. Miller" <davem@...emloft.net>,
	Eric Dumazet <edumazet@...gle.com>,
	Jakub Kicinski <kuba@...nel.org>,
	Paolo Abeni <pabeni@...hat.com>,
	Simon Horman <horms@...nel.org>,
	Kuniyuki Iwashima <kuniyu@...zon.com>,
	Willem de Bruijn <willemb@...gle.com>,
	Mykola Lysenko <mykolal@...com>,
	Shuah Khan <shuah@...nel.org>,
	Jiapeng Chong <jiapeng.chong@...ux.alibaba.com>,
	linux-doc@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	netdev@...r.kernel.org,
	linux-kselftest@...r.kernel.org
Subject: [PATCH bpf-next v1 0/3] bpf, sockmap: Improve performance with CPU affinity

Abstract
===
This patchset improves the performance of sockmap by providing CPU affinity,
resulting in a 1-10x increase in throughput.


Motivation
===
Traditional user-space reverse proxy:

              Reserve Proxy
            _________________
client  -> | fd1  <->  fd2   | -> server
           |_________________|

Using sockmap for reverse proxy:

              Reserve Proxy
            _________________
client ->  |  fd1  <->  fd2  |  -> server
         | |_________________| |
         |      |       |      |
         |      _________      |
         |     | sockmap |     |
          -->  |_________|  -->

By adding fds to sockmap and using a BPF program, we can quickly forward
data and avoid data copying between user space and kernel space.

Mainstream multi-process reverse proxy applications, such as Nginx and
HAProxy, support CPU affinity settings, which allow each process to be
pinned to a specific CPU, avoiding conflicts between data plane processes
and other processes, especially in multi-tenant environments.


Current Issues
===
The current design of sockmap uses a workqueue to forward ingress_skb and
wakes up the workqueue without specifying a CPU
(by calling schedule_delayed_work()). In the current implementation of
schedule_delayed_work, it tends to run the workqueue on the current CPU.

This approach has a high probability of running on the current CPU, which
is the same CPU that handles the net rx soft interrupt, especially for
programs that access each other using local interfaces.

The loopback driver's transmit interface, loopback_xmit(), directly calls
__netif_rx() on the current CPU, which means that the CPU handling
sockmap's workqueue and the client's sending CPU are the same, resulting
in contention.

For a TCP flow, if the request or response is very large, the
psock->ingress_skb queue can become very long. When the workqueue
traverses this queue to forward the data, it can consume a significant
amount of CPU time.


Solution
===
Configuring RPS on a loopback interface can be useful, but it will trigger
additional softirq, and furthermore, it fails to achieve our expected
effect of CPU isolation from other processes.

Instead, we provide a kfunc that allow users to specify the CPU on which
the workqueue runs through a BPF program.

We can use the existing benchmark to test the performance, which allows
us to evaluate the effectiveness of this optimization.

Because we use local interfaces for communication and the client consumes
a significant amount of CPU when sending data, this prevents the workqueue
from processing ingress_skb in a timely manner, ultimately causing the
server to fail to read data quickly.

Without cpu-affinity:
./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress --no-verify
Setting up benchmark 'sockmap'...
create socket fd c1:14 p1:15 c2:16 p2:17
Benchmark 'sockmap' started.
Iter   0 ( 36.031us): Send Speed 1143.693 MB/s ... Rcv Speed  109.572 MB/s
Iter   1 (  0.608us): Send Speed 1320.550 MB/s ... Rcv Speed   48.103 MB/s
Iter   2 ( -5.448us): Send Speed 1314.790 MB/s ... Rcv Speed   47.842 MB/s
Iter   3 ( -0.613us): Send Speed 1320.158 MB/s ... Rcv Speed   46.531 MB/s
Iter   4 ( -3.441us): Send Speed 1319.375 MB/s ... Rcv Speed   46.662 MB/s
Iter   5 (  3.764us): Send Speed 1166.667 MB/s ... Rcv Speed   42.467 MB/s
Iter   6 ( -4.404us): Send Speed 1319.508 MB/s ... Rcv Speed   47.973 MB/s
Summary: total trans     7758 MB ± 1293.506 MB/s

Without cpu-affinity(RPS enabled):
./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress --no-verify
Setting up benchmark 'sockmap'...
create socket fd c1:14 p1:15 c2:16 p2:17
Benchmark 'sockmap' started.
Iter   0 ( 28.925us): Send Speed 1630.357 MB/s ... Rcv Speed  850.960 MB/s
Iter   1 ( -2.042us): Send Speed 1644.564 MB/s ... Rcv Speed  822.478 MB/s
Iter   2 (  0.754us): Send Speed 1644.297 MB/s ... Rcv Speed  850.787 MB/s
Iter   3 (  0.159us): Send Speed 1644.429 MB/s ... Rcv Speed  850.198 MB/s
Iter   4 ( -2.898us): Send Speed 1646.924 MB/s ... Rcv Speed  830.867 MB/s
Iter   5 ( -0.210us): Send Speed 1649.410 MB/s ... Rcv Speed  824.246 MB/s
Iter   6 ( -1.448us): Send Speed 1650.723 MB/s ... Rcv Speed  808.256 MB/s

With cpu-affinity(RPS disabled):
./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress --no-verify --cpu-affinity
Setting up benchmark 'sockmap'...
create socket fd c1:14 p1:15 c2:16 p2:17
Benchmark 'sockmap' started.
Iter   0 ( 36.051us): Send Speed 1883.437 MB/s ... Rcv Speed 1865.087 MB/s
Iter   1 (  1.246us): Send Speed 1900.542 MB/s ... Rcv Speed 1761.737 MB/s
Iter   2 ( -8.595us): Send Speed 1883.128 MB/s ... Rcv Speed 1860.714 MB/s
Iter   3 (  7.033us): Send Speed 1890.831 MB/s ... Rcv Speed 1806.684 MB/s
Iter   4 ( -8.397us): Send Speed 1884.700 MB/s ... Rcv Speed 1973.568 MB/s
Iter   5 ( -1.822us): Send Speed 1894.125 MB/s ... Rcv Speed 1775.046 MB/s
Iter   6 (  4.936us): Send Speed 1877.597 MB/s ... Rcv Speed 1959.320 MB/s
Summary: total trans    11328 MB ± 1888.507 MB/s


Appendix
===
Through this optimization, we discovered that sk_mem_charge()
and sk_mem_uncharge() have concurrency issues. The performance improvement
brought by this optimization has made these concurrency issues more
evident.

This concurrency issue can cause the WARN_ON_ONCE(sk->sk_forward_alloc)
check to be triggered when the socket is released. Since this patch is a
feature-type patch and does not intend to fix this bug, I will provide
additional patches to fix this issue later.

Jiayuan Chen (3):
  bpf, sockmap: Introduce a new kfunc for sockmap
  bpf, sockmap: Affinitize workqueue to a specific CPU
  selftest/bpf/benchs: Add cpu-affinity for sockmap bench

 Documentation/bpf/map_sockmap.rst             | 14 +++++++
 include/linux/skmsg.h                         | 15 +++++++
 kernel/bpf/btf.c                              |  3 ++
 net/core/skmsg.c                              | 10 +++--
 net/core/sock_map.c                           | 39 +++++++++++++++++++
 .../selftests/bpf/benchs/bench_sockmap.c      | 35 +++++++++++++++--
 tools/testing/selftests/bpf/bpf_kfuncs.h      |  6 +++
 .../selftests/bpf/progs/bench_sockmap_prog.c  |  7 ++++
 8 files changed, 122 insertions(+), 7 deletions(-)

-- 
2.47.1