linux-kernel - Re: [PATCH 1/2] bpf: Introduce cpu affinity for sockmap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4BzbVqcCN1p8ydLN17LygK5R=gBYJV0A-cnycjtsUzrX34g@mail.gmail.com>
Date: Fri, 1 Nov 2024 12:25:51 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: mrpre <mrpre@....com>
Cc: yonghong.song@...ux.dev, john.fastabend@...il.com, martin.lau@...nel.org, 
	edumazet@...gle.com, jakub@...udflare.com, davem@...emloft.net, 
	dsahern@...nel.org, kuba@...nel.org, pabeni@...hat.com, 
	netdev@...r.kernel.org, bpf@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2] bpf: Introduce cpu affinity for sockmap

On Thu, Oct 31, 2024 at 7:40 PM mrpre <mrpre@....com> wrote:
>
> Why we need cpu affinity:
> Mainstream data planes, like Nginx and HAProxy, utilize CPU affinity
> by binding user processes to specific CPUs. This avoids interference
> between processes and prevents impact from other processes.
>
> Sockmap, as an optimization to accelerate such proxy programs,
> currently lacks the ability to specify CPU affinity. The current
> implementation of sockmap handling backlog is based on workqueue,
> which operates by calling 'schedule_delayed_work()'. It's current
> implementation prefers to schedule on the local CPU, i.e., the CPU
> that handled the packet under softirq.
>
> For extremely high traffic with large numbers of packets,
> 'sk_psock_backlog' becomes a large loop.
>
> For multi-threaded programs with only one map, we expect different
> sockets to run on different CPUs. It is important to note that this
> feature is not a general performance optimization. Instead, it
> provides users with the ability to bind to specific CPU, allowing
> them to enhance overall operating system utilization based on their
> own system environments.
>
> Implementation:
> 1.When updating the sockmap, support passing a CPU parameter and
> save it to the psock.
> 2.When scheduling psock, determine which CPU to run on using the
> psock's CPU information.
> 3.For thoes sockmap without CPU affinity, keep original logic by using
> 'schedule_delayed_work()'.
>
> Performance Testing:
> 'client <-> sockmap proxy <-> server'
>
> Using 'iperf3' tests, with the iperf server bound to CPU5 and the iperf
> client bound to CPU6, performance without using CPU affinity is
> around 34 Gbits/s, and CPU usage is concentrated on CPU5 and CPU6.
> '''
> [  5] local 127.0.0.1 port 57144 connected to 127.0.0.1 port 10000
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-1.00   sec  3.95 GBytes  33.9 Gbits/sec
> [  5]   1.00-2.00   sec  3.95 GBytes  34.0 Gbits/sec
> ......
> '''
>
> With using CPU affinity, the performnce is close to direct connection
> (without any proxy).
> '''
> [  5] local 127.0.0.1 port 56518 connected to 127.0.0.1 port 10000
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-1.00   sec  7.76 GBytes  66.6 Gbits/sec
> [  5]   1.00-2.00   sec  7.76 GBytes  66.7 Gbits/sec
> ......
> '''
>
> Signed-off-by: Jiayuan Chen <mrpre@....com>
> ---
>  include/linux/bpf.h      |  3 ++-
>  include/linux/skmsg.h    |  8 ++++++++
>  include/uapi/linux/bpf.h |  4 ++++
>  kernel/bpf/syscall.c     | 23 +++++++++++++++++------
>  net/core/skmsg.c         | 11 +++++++----
>  net/core/sock_map.c      | 12 +++++++-----
>  6 files changed, 45 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index c3ba4d475174..a56028c389e7 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -3080,7 +3080,8 @@ int bpf_prog_test_run_syscall(struct bpf_prog *prog,
>
>  int sock_map_get_from_fd(const union bpf_attr *attr, struct bpf_prog *prog);
>  int sock_map_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype);
> -int sock_map_update_elem_sys(struct bpf_map *map, void *key, void *value, u64 flags);
> +int sock_map_update_elem_sys(struct bpf_map *map, void *key, void *value, u64 flags,
> +                            s32 target_cpu);
>  int sock_map_bpf_prog_query(const union bpf_attr *attr,
>                             union bpf_attr __user *uattr);
>  int sock_map_link_create(const union bpf_attr *attr, struct bpf_prog *prog);
> diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> index d9b03e0746e7..919425a92adf 100644
> --- a/include/linux/skmsg.h
> +++ b/include/linux/skmsg.h
> @@ -117,6 +117,7 @@ struct sk_psock {
>         struct delayed_work             work;
>         struct sock                     *sk_pair;
>         struct rcu_work                 rwork;
> +       s32                             target_cpu;
>  };
>
>  int sk_msg_alloc(struct sock *sk, struct sk_msg *msg, int len,
> @@ -514,6 +515,13 @@ static inline bool sk_psock_strp_enabled(struct sk_psock *psock)
>         return !!psock->saved_data_ready;
>  }
>
> +static inline int sk_psock_strp_get_cpu(struct sk_psock *psock)
> +{
> +       if (psock->target_cpu != -1)
> +               return psock->target_cpu;
> +       return WORK_CPU_UNBOUND;
> +}
> +
>  #if IS_ENABLED(CONFIG_NET_SOCK_MSG)
>
>  #define BPF_F_STRPARSER        (1UL << 1)
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index f28b6527e815..2019a87b5d4a 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1509,6 +1509,10 @@ union bpf_attr {
>                         __aligned_u64 next_key;
>                 };
>                 __u64           flags;
> +               union {
> +                       /* specify the CPU where the sockmap job run on */
> +                       __aligned_u64 target_cpu;

I have no opinion on the feature itself, I'll leave this to others.
But from UAPI perspective:

a) why is this a u64 and not, say, int?
b) maybe we should just specify this as flags and not have to update
all the UAPIs (including libbpf-side)? Just add a new
BPF_F_SOCKNMAP_TARGET_CPU flag or something, and specify that highest
32 bits specify the CPU itself?

We have similar schema for some other helpers, so not *that* unusual.

> +               };
>         };
>
>         struct { /* struct used by BPF_MAP_*_BATCH commands */
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 8254b2973157..95f719b9c3f3 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -239,10 +239,9 @@ static int bpf_obj_pin_uptrs(struct btf_record *rec, void *obj)
>  }
>
>  static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
> -                               void *key, void *value, __u64 flags)
> +                               void *key, void *value, __u64 flags, s32 target_cpu)

yeah, this is what I'm talking about. Think how ridiculous it is for a
generic "BPF map update" operation to accept the "target_cpu"
parameter.

pw-bot: cr

>  {
>         int err;
> -

why? don't break whitespace formatting

>         /* Need to create a kthread, thus must support schedule */
>         if (bpf_map_is_offloaded(map)) {
>                 return bpf_map_offload_update_elem(map, key, value, flags);
> @@ -252,7 +251,7 @@ static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
>                 return map->ops->map_update_elem(map, key, value, flags);
>         } else if (map->map_type == BPF_MAP_TYPE_SOCKHASH ||
>                    map->map_type == BPF_MAP_TYPE_SOCKMAP) {
> -               return sock_map_update_elem_sys(map, key, value, flags);
> +               return sock_map_update_elem_sys(map, key, value, flags, target_cpu);
>         } else if (IS_FD_PROG_ARRAY(map)) {
>                 return bpf_fd_array_map_update_elem(map, map_file, key, value,
>                                                     flags);

[...]