linux-kernel - Re: [PATCH net-next v1 6/7] net: fix SO_DEVMEM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHS8izMWbcKSr3uOVWQDmo5=aQvFdcD6o_myz1bw=a1rzrJE_A@mail.gmail.com>
Date: Tue, 5 Nov 2024 13:28:48 -0800
From: Mina Almasry <almasrymina@...gle.com>
To: Stanislav Fomichev <stfomichev@...il.com>
Cc: netdev@...r.kernel.org, linux-kernel@...r.kernel.org, 
	linux-kselftest@...r.kernel.org, "David S. Miller" <davem@...emloft.net>, 
	Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, 
	Simon Horman <horms@...nel.org>, Jesper Dangaard Brouer <hawk@...nel.org>, 
	Ilias Apalodimas <ilias.apalodimas@...aro.org>, Shuah Khan <shuah@...nel.org>, 
	Yi Lai <yi1.lai@...ux.intel.com>
Subject: Re: [PATCH net-next v1 6/7] net: fix SO_DEVMEM_DONTNEED looping too long

On Wed, Oct 30, 2024 at 8:07 AM Stanislav Fomichev <stfomichev@...il.com> wrote:
>
> On 10/30, Mina Almasry wrote:
> > On Wed, Oct 30, 2024 at 7:33 AM Stanislav Fomichev <stfomichev@...il.com> wrote:
> > >
> > > On 10/29, Mina Almasry wrote:
> > > > Check we're going to free a reasonable number of frags in token_count
> > > > before starting the loop, to prevent looping too long.
> > > >
> > > > Also minor code cleanups:
> > > > - Flip checks to reduce indentation.
> > > > - Use sizeof(*tokens) everywhere for consistentcy.
> > > >
> > > > Cc: Yi Lai <yi1.lai@...ux.intel.com>
> > > >
> > > > Signed-off-by: Mina Almasry <almasrymina@...gle.com>
> > > >
> > > > ---
> > > >  net/core/sock.c | 46 ++++++++++++++++++++++++++++------------------
> > > >  1 file changed, 28 insertions(+), 18 deletions(-)
> > > >
> > > > diff --git a/net/core/sock.c b/net/core/sock.c
> > > > index 7f398bd07fb7..8603b8d87f2e 100644
> > > > --- a/net/core/sock.c
> > > > +++ b/net/core/sock.c
> > > > @@ -1047,11 +1047,12 @@ static int sock_reserve_memory(struct sock *sk, int bytes)
> > > >
> > > >  #ifdef CONFIG_PAGE_POOL
> > > >
> > > > -/* This is the number of tokens that the user can SO_DEVMEM_DONTNEED in
> > > > +/* This is the number of frags that the user can SO_DEVMEM_DONTNEED in
> > > >   * 1 syscall. The limit exists to limit the amount of memory the kernel
> > > > - * allocates to copy these tokens.
> > > > + * allocates to copy these tokens, and to prevent looping over the frags for
> > > > + * too long.
> > > >   */
> > > > -#define MAX_DONTNEED_TOKENS 128
> > > > +#define MAX_DONTNEED_FRAGS 1024
> > > >
> > > >  static noinline_for_stack int
> > > >  sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
> > > > @@ -1059,43 +1060,52 @@ sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
> > > >       unsigned int num_tokens, i, j, k, netmem_num = 0;
> > > >       struct dmabuf_token *tokens;
> > > >       netmem_ref netmems[16];
> > > > +     u64 num_frags = 0;
> > > >       int ret = 0;
> > > >
> > > >       if (!sk_is_tcp(sk))
> > > >               return -EBADF;
> > > >
> > > > -     if (optlen % sizeof(struct dmabuf_token) ||
> > > > -         optlen > sizeof(*tokens) * MAX_DONTNEED_TOKENS)
> > > > +     if (optlen % sizeof(*tokens) ||
> > > > +         optlen > sizeof(*tokens) * MAX_DONTNEED_FRAGS)
> > > >               return -EINVAL;
> > > >
> > > > -     tokens = kvmalloc_array(optlen, sizeof(*tokens), GFP_KERNEL);
> > > > +     num_tokens = optlen / sizeof(*tokens);
> > > > +     tokens = kvmalloc_array(num_tokens, sizeof(*tokens), GFP_KERNEL);
> > > >       if (!tokens)
> > > >               return -ENOMEM;
> > > >
> > > > -     num_tokens = optlen / sizeof(struct dmabuf_token);
> > > >       if (copy_from_sockptr(tokens, optval, optlen)) {
> > > >               kvfree(tokens);
> > > >               return -EFAULT;
> > > >       }
> > > >
> > > > +     for (i = 0; i < num_tokens; i++) {
> > > > +             num_frags += tokens[i].token_count;
> > > > +             if (num_frags > MAX_DONTNEED_FRAGS) {
> > > > +                     kvfree(tokens);
> > > > +                     return -E2BIG;
> > > > +             }
> > > > +     }
> > > > +
> > > >       xa_lock_bh(&sk->sk_user_frags);
> > > >       for (i = 0; i < num_tokens; i++) {
> > > >               for (j = 0; j < tokens[i].token_count; j++) {
> > > >                       netmem_ref netmem = (__force netmem_ref)__xa_erase(
> > > >                               &sk->sk_user_frags, tokens[i].token_start + j);
> > > >
> > > > -                     if (netmem &&
> > > > -                         !WARN_ON_ONCE(!netmem_is_net_iov(netmem))) {
> > > > -                             netmems[netmem_num++] = netmem;
> > > > -                             if (netmem_num == ARRAY_SIZE(netmems)) {
> > > > -                                     xa_unlock_bh(&sk->sk_user_frags);
> > > > -                                     for (k = 0; k < netmem_num; k++)
> > > > -                                             WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
> > > > -                                     netmem_num = 0;
> > > > -                                     xa_lock_bh(&sk->sk_user_frags);
> > > > -                             }
> > > > -                             ret++;
> > >
> > > [..]
> > >
> > > > +                     if (!netmem || WARN_ON_ONCE(!netmem_is_net_iov(netmem)))
> > > > +                             continue;
> > >
> > > Any reason we are not returning explicit error to the callers here?
> > > That probably needs some mechanism to signal which particular one failed
> > > so the users can restart?
> >
> > Only because I can't think of a simple way to return an array of frags
> > failed to DONTNEED to the user.
>
> I'd expect the call to return as soon as it hits the invalid frag
> entry (plus the number of entries that it successfully refilled up to
> the invalid one). But too late I guess.
>
> > Also, this error should be extremely rare or never hit really. I don't
> > know how we end up not finding a netmem here or the netmem is page.
> > The only way is if the user is malicious (messing with the token ids
> > passed to the kernel) or if a kernel bug is happening.
>
> I do hit this error with 1500 mtu, so it would've been nice to
> understand which particular token triggered that. It might be
> something buggy on the driver side, I need to investigate. (it's
> super low prio because 1500)
>

Hmm, I've never seen this, in production (code is similar to
upstreamed, but I guess not exactly the same), and in my ncdevmem
upstream testing.

FWIW leaked frags are extremely bad, because there is no opportunity
to reap them until the entire dmabuf has been rebound. You will need
to root cause this if you're seeing it and are interested in using
devmem tcp in prod.

sk_user_frags is only really touched in:
- sock_devmem_dontneed(), where they are removed from the xarray.
- tcp_recvmsg_dmabuf(), where they are added to the xarray.
- tcp_v4_destroy_sock(), where they are freed (but not removed from
the xarray?!).

The only root causes for this bug I see are:

1. You're racing tcp_v4_destroy_sock() with sock_devmem_dontneed(), so
somehow you're trying to release a frag already released in that loop?
Or,
2. You're releasing a frag that was never added by
tcp_recvmsg_dmabuf(). I.e. There is a bug in tcp_recvmsg_dmabuf() that
it put_cmsg the frag_id to the userspace but never adds it to the
sk_user_frags. That should be accompanied by a ncdevmem validation
error.

The way to debug #2 is really to test with the ncdevmem validation. I
got the sense from reviewing the test series that you don't like to
use it, but it's how I root cause such issues. You should familiarize
yourself with it if you want to root cause such issues as well. To use
it:

client: yes $(echo -e \\x01\\x02\\x03\\x04\\x05\\x06) | tr \\n \\0 \
          | head -c 1G | nc <server ip> -p 5224
 server: ncdevmem -s <server IP> -c <client IP> -f eth1 -l -p 5224 -v 7

If you see a validation error with your missing frag, send me the
logs, I may be able to guess what's wrong.

> > Also, the information is useless to the user. If the user sees 'frag
> > 128 failed to free'. There is nothing really the user can do to
> > recover at runtime. Only usefulness that could come is for the user to
> > log the error. We already WARN_ON_ONCE on the error the user would not
> > be able to trigger.
>
> I'm thinking from the pow of user application. It might have bugs as
> well and try to refill something that should not have been refilled.
> Having info about which particular token has failed (even just for
> the logging purposes) might have been nice.

Yeah, it may have been nice. On the flip side it complicates calling
sock_devmem_dontneed(). The userspace need to count the freed frags in
its input, remove them, skip the leaked one, and re-call the syscall.
On the flipside the userspace gets to know the id of the frag that
leaked but the usefulness of the information is slightly questionable
for me. :shrug:


-- 
Thanks,
Mina