netdev - Re: [PATCH bpf-next v6 1/3] bpf: remove extra lock_sock for TCP_ZEROCOPY

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKH8qBscw4NkOavRZ2nDiB7Yz_BbO5nLwmczkMraMYgrDWWxGg@mail.gmail.com>
Date:   Mon, 11 Jan 2021 10:50:14 -0800
From:   Stanislav Fomichev <sdf@...gle.com>
To:     Martin KaFai Lau <kafai@...com>
Cc:     Netdev <netdev@...r.kernel.org>, bpf <bpf@...r.kernel.org>,
        Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Song Liu <songliubraving@...com>,
        Eric Dumazet <edumazet@...gle.com>
Subject: Re: [PATCH bpf-next v6 1/3] bpf: remove extra lock_sock for TCP_ZEROCOPY_RECEIVE

On Fri, Jan 8, 2021 at 5:37 PM Martin KaFai Lau <kafai@...com> wrote:
>
> On Fri, Jan 08, 2021 at 01:02:21PM -0800, Stanislav Fomichev wrote:
> > Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE.
> > We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom
> > call in do_tcp_getsockopt using the on-stack data. This removes
> > 3% overhead for locking/unlocking the socket.
> >
> > Without this patch:
> >      3.38%     0.07%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt
> >             |
> >              --3.30%--__cgroup_bpf_run_filter_getsockopt
> >                        |
> >                         --0.81%--__kmalloc
> >
> > With the patch applied:
> >      0.52%     0.12%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt_kern
> >
> > Signed-off-by: Stanislav Fomichev <sdf@...gle.com>
> > Cc: Martin KaFai Lau <kafai@...com>
> > Cc: Song Liu <songliubraving@...com>
> > Cc: Eric Dumazet <edumazet@...gle.com>
> > ---
> >  include/linux/bpf-cgroup.h                    | 27 +++++++++++--
> >  include/linux/indirect_call_wrapper.h         |  6 +++
> >  include/net/sock.h                            |  2 +
> >  include/net/tcp.h                             |  1 +
> >  kernel/bpf/cgroup.c                           | 38 +++++++++++++++++++
> >  net/ipv4/tcp.c                                | 14 +++++++
> >  net/ipv4/tcp_ipv4.c                           |  1 +
> >  net/ipv6/tcp_ipv6.c                           |  1 +
> >  net/socket.c                                  |  3 ++
> >  .../selftests/bpf/prog_tests/sockopt_sk.c     | 22 +++++++++++
> >  .../testing/selftests/bpf/progs/sockopt_sk.c  | 15 ++++++++
> >  11 files changed, 126 insertions(+), 4 deletions(-)
> >
> [ ... ]
>
> > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> > index 6ec088a96302..c41bb2f34013 100644
> > --- a/kernel/bpf/cgroup.c
> > +++ b/kernel/bpf/cgroup.c
> > @@ -1485,6 +1485,44 @@ int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level,
> >       sockopt_free_buf(&ctx);
> >       return ret;
> >  }
> > +
> > +int __cgroup_bpf_run_filter_getsockopt_kern(struct sock *sk, int level,
> > +                                         int optname, void *optval,
> > +                                         int *optlen, int retval)
> > +{
> > +     struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> > +     struct bpf_sockopt_kern ctx = {
> > +             .sk = sk,
> > +             .level = level,
> > +             .optname = optname,
> > +             .retval = retval,
> > +             .optlen = *optlen,
> > +             .optval = optval,
> > +             .optval_end = optval + *optlen,
> > +     };
> > +     int ret;
> > +
> The current behavior only passes kernel optval to bpf prog when
> retval == 0.  Can you explain a few words here about
> the difference and why it is fine?
> Just in case some other options may want to reuse the
> __cgroup_bpf_run_filter_getsockopt_kern() in the future.
IIRC, whatever we do in __cgroup_bpf_run_filter_getsockopt
with skipping the copy for retval != 0 is just an optimization.
I was assuming that on the error, kernel wouldn't copy
anything back to the users (not sure how true in real
life it is). I'll add a comment here to signify the difference.

> > +     ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[BPF_CGROUP_GETSOCKOPT],
> > +                              &ctx, BPF_PROG_RUN);
> > +     if (!ret)
> > +             return -EPERM;
> > +
> > +     if (ctx.optlen > *optlen)
> > +             return -EFAULT;
> > +
> > +     /* BPF programs only allowed to set retval to 0, not some
> > +      * arbitrary value.
> > +      */
> > +     if (ctx.retval != 0 && ctx.retval != retval)
> > +             return -EFAULT;
> > +
> > +     /* BPF programs can shrink the buffer, export the modifications.
> > +      */
> > +     if (ctx.optlen != 0)
> > +             *optlen = ctx.optlen;
> > +
> > +     return ctx.retval;
> > +}
> >  #endif
> >
> >  static ssize_t sysctl_cpy_dir(const struct ctl_dir *dir, char **bufp,
>
> [ ... ]
>
> > diff --git a/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c b/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c
> > index b25c9c45c148..6bb18b1d8578 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/sockopt_sk.c
> > @@ -11,6 +11,7 @@ static int getsetsockopt(void)
> >               char u8[4];
> >               __u32 u32;
> >               char cc[16]; /* TCP_CA_NAME_MAX */
> > +             struct tcp_zerocopy_receive zc;
> I suspect it won't compile at least in my setup.
>
> However, I compile tools/testing/selftests/net/tcp_mmap.c fine though.
> I _guess_ it is because the net's test has included kernel/usr/include.
>
> AFAIK, bpf's tests use tools/include/uapi/.
>
> Others LGTM.
Sure, let me add export it to tools/include/uapi. I didn't do it
because it also compiled for me and I assumed that
tcp_zerocopy_receive was exported too long ago to care (we are using
the first field anyway so don't really need the latest layout).