lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 6 Jan 2021 11:47:56 -0800
From:   Martin KaFai Lau <kafai@...com>
To:     Stanislav Fomichev <sdf@...gle.com>
CC:     <netdev@...r.kernel.org>, <bpf@...r.kernel.org>, <ast@...nel.org>,
        <daniel@...earbox.net>, Song Liu <songliubraving@...com>,
        Eric Dumazet <edumazet@...gle.com>
Subject: Re: [PATCH bpf-next v3 3/3] bpf: remove extra lock_sock for
 TCP_ZEROCOPY_RECEIVE

On Tue, Jan 05, 2021 at 01:43:50PM -0800, Stanislav Fomichev wrote:
> Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE.
> We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom
> call in do_tcp_getsockopt using the on-stack data. This removes
> 3% overhead for locking/unlocking the socket.
> 
> Also:
> - Removed BUILD_BUG_ON (zerocopy doesn't depend on the buf size anymore)
> - Separated on-stack buffer into bpf_sockopt_buf and downsized to 32 bytes
>   (let's keep it to help with the other options)
> 
> (I can probably split this patch into two: add new features and rework
>  bpf_sockopt_buf; can follow up if the approach in general sounds
>  good).
> 
> Without this patch:
>      3.29%     0.07%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt
>             |
>              --3.22%--__cgroup_bpf_run_filter_getsockopt
>                        |
>                        |--0.66%--lock_sock_nested
A general question for sockopt prog, why the BPF_CGROUP_(GET|SET)SOCKOPT prog
has to run under lock_sock()?

>                        |
>                        |--0.57%--__might_fault
>                        |
>                         --0.56%--release_sock
> 
> With the patch applied:
>      0.42%     0.10%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt_kern
>      0.02%     0.02%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt
> 
[ ... ]

> @@ -1445,15 +1442,29 @@ int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level,
>  				       int __user *optlen, int max_optlen,
>  				       int retval)
>  {
> -	struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> -	struct bpf_sockopt_kern ctx = {
> -		.sk = sk,
> -		.level = level,
> -		.optname = optname,
> -		.retval = retval,
> -	};
> +	struct bpf_sockopt_kern ctx;
> +	struct bpf_sockopt_buf buf;
> +	struct cgroup *cgrp;
>  	int ret;
>  
> +#ifdef CONFIG_INET
> +	/* TCP do_tcp_getsockopt has optimized getsockopt implementation
> +	 * to avoid extra socket lock for TCP_ZEROCOPY_RECEIVE.
> +	 */
> +	if (sk->sk_prot->getsockopt == tcp_getsockopt &&
> +	    level == SOL_TCP && optname == TCP_ZEROCOPY_RECEIVE)
> +		return retval;
> +#endif
That seems too much protocol details and not very scalable.
It is not very related to kernel/bpf/cgroup.c which has very little idea
whether a specific protocol has optimized things in some ways (e.g. by
directly calling cgroup's bpf prog at some strategic places in this patch).
Lets see if it can be done better.

At least, these protocol checks belong to the net's socket.c
more than the bpf's cgroup.c here.  If it also looks like layering
breakage in socket.c, may be adding a signal in sk_prot (for example)
to tell if the sk_prot->getsockopt has already called the cgroup's bpf
prog?  (e.g. tcp_getsockopt() can directly call the cgroup's bpf for all
optname instead of only TCP_ZEROCOPY_RECEIVE).

For example:

int __sys_getsockopt(...)
{
	/* ... */

	if (!sk_prot->bpf_getsockopt_handled)
		BPF_CGROUP_RUN_PROG_GETSOCKOPT(...);
}

Thoughts?

> +
> +	memset(&buf, 0, sizeof(buf));
> +	memset(&ctx, 0, sizeof(ctx));
> +
> +	cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> +	ctx.sk = sk;
> +	ctx.level = level;
> +	ctx.optname = optname;
> +	ctx.retval = retval;
> +
>  	/* Opportunistic check to see whether we have any BPF program
>  	 * attached to the hook so we don't waste time allocating
>  	 * memory and locking the socket.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ