netdev - Re: [PATCH net] sk_msg: Keep reference on socket file while psock lives

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <5ce81306aacbe_39402ae86c50a5bc2f@john-XPS-13-9360.notmuch>
Date:   Fri, 24 May 2019 08:51:34 -0700
From:   John Fastabend <john.fastabend@...il.com>
To:     Jakub Sitnicki <jakub@...udflare.com>,
        John Fastabend <john.fastabend@...il.com>
Cc:     Daniel Borkmann <daniel@...earbox.net>, netdev@...r.kernel.org,
        Marek Majkowski <marek@...udflare.com>
Subject: Re: [PATCH net] sk_msg: Keep reference on socket file while psock
 lives

Jakub Sitnicki wrote:
> On Thu, May 23, 2019 at 05:58 PM CEST, John Fastabend wrote:
> > [...]
> >
> >>
> >> Thanks for taking a look at it. Setting MSG_DONTWAIT works great for
> >> me. No more crashes in sk_stream_wait_memory. I've tested it on top of
> >> current bpf-next (f49aa1de9836). Here's my:
> >>
> >>   Tested-by: Jakub Sitnicki <jakub@...udflare.com>
> >>
> >> The actual I've tested is below, for completeness.
> >>
> >> BTW. I've ran into another crash which I haven't seen before while
> >> testing sockmap-echo, but it looks unrelated:
> >>
> >>   https://lore.kernel.org/netdev/20190522100142.28925-1-jakub@cloudflare.com/
> >>
> >> -Jakub
> >>
> >> --- 8< ---
> >>
> >> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> >> index e89be6282693..4a7c656b195b 100644
> >> --- a/net/core/skbuff.c
> >> +++ b/net/core/skbuff.c
> >> @@ -2337,6 +2337,7 @@ int skb_send_sock_locked(struct sock *sk, struct sk_buff *skb, int offset,
> >>                 kv.iov_base = skb->data + offset;
> >>                 kv.iov_len = slen;
> >>                 memset(&msg, 0, sizeof(msg));
> >> +               msg.msg_flags = MSG_DONTWAIT;
> >>
> >>                 ret = kernel_sendmsg_locked(sk, &msg, &kv, 1, slen);
> >>                 if (ret <= 0)
> >
> > I went ahead and submitted this feel free to add your signed-off-by.
> 
> Thanks! The fix was all your idea :-)

If I can push the correct patch to Daniel it should be in bpf tree
soon.

> 
> Now that those pesky crashes are gone, we plan to look into drops when
> doing echo with sockmap. Marek tried running echo-sockmap [1] with
> latest bpf-next (plus mentioned crash fixes) and reports that not all
> data bounces back:
> 
> $ yes| head -c $[1024*1024] | nc -q2 192.168.1.33 1234 |wc -c
> 971832
> $ yes| head -c $[1024*1024] | nc -q2 192.168.1.33 1234 |wc -c
> 867352
> $ yes| head -c $[1024*1024] | nc -q2 192.168.1.33 1234 |wc -c
> 952648
> 
> I'm tring to turn echo-sockmap into a selftest but as you can probably
> guess over loopback all works fine.
> 

Right, sockmap when used from recvmsg with redirect is lossy. This
was a design choice I made that apparently caught a few people
by surprise. The original rationale for this was when doing a
multiplex operation, e.g. single ingress socket to many egress
sockets blocking would cause head of line blocking on all
sockets. To resolve this I simply dropped the packet and then allow
the flow to continue. This pushes the logic up to the application
to do retries, etc. when this happens. FWIW userspace proxies I
tested also had similar points where they fell over and dropped
packets. In hind sight though it probably would have made more
sense to make this behavior opt-in vs the default. But, the
use case I was solving at the time I wrote this could handle
drops and was actually a NxM sockets with N ingress sockets and M
egress sockets so head of line blocking was a real problem.

Adding a flag to turn this into a blocking op has been on my
todo list for awhile. Especially when sockmap is being used as
a single ingress to single egress socket then blocking vs dropping
makes much more sense.

The place to look is in sk_psock_verdict_apply() in __SK_REDIRECT
case there is a few checks and notice we can fallthrough to a
kfree_skb(skb). This is most likely the drops you are hitting.
Maybe annotate it with a dbg statement to check.

To fix this we could have a flag to _not_ drop but enqueue the
packet regardless of the test or hold it until space is
available. I even think sk_psock_strp_read could push back
on the stream parser which would eventually push back via TCP
and get the behavior you want.

Also, I have a couple items on my TODO list that I'll eventually
get to. First we run without stream parsers in some Cilium
use cases. I'll push some patches to allow this in the next
months or so. This avoids the annoying stream parser prog that
simply returns skb->len. This is mostly an optimizations. A
larger change I want to make at some point is to remove the
backlog workqueue altogether. Originally it was added for
simplicity but actually causes some latency spikes when
looking at 99+ percentiles. It really doesn't need to be
there it was a hold over from some original architecture that
got pushed upstream. If you have time and want to let me know
if you would like to tackle removing it.

Thanks,
John