[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <87k1ecz1q4.fsf@cloudflare.com>
Date: Mon, 27 May 2019 11:27:31 +0200
From: Jakub Sitnicki <jakub@...udflare.com>
To: John Fastabend <john.fastabend@...il.com>
Cc: Daniel Borkmann <daniel@...earbox.net>, netdev@...r.kernel.org,
Marek Majkowski <marek@...udflare.com>
Subject: Re: [PATCH net] sk_msg: Keep reference on socket file while psock lives
On Fri, May 24, 2019 at 05:51 PM CEST, John Fastabend wrote:
> Jakub Sitnicki wrote:
>>
>> Now that those pesky crashes are gone, we plan to look into drops when
>> doing echo with sockmap. Marek tried running echo-sockmap [1] with
>> latest bpf-next (plus mentioned crash fixes) and reports that not all
>> data bounces back:
>>
>> $ yes| head -c $[1024*1024] | nc -q2 192.168.1.33 1234 |wc -c
>> 971832
>> $ yes| head -c $[1024*1024] | nc -q2 192.168.1.33 1234 |wc -c
>> 867352
>> $ yes| head -c $[1024*1024] | nc -q2 192.168.1.33 1234 |wc -c
>> 952648
>>
>> I'm tring to turn echo-sockmap into a selftest but as you can probably
>> guess over loopback all works fine.
>>
>
> Right, sockmap when used from recvmsg with redirect is lossy. This
> was a design choice I made that apparently caught a few people
> by surprise. The original rationale for this was when doing a
> multiplex operation, e.g. single ingress socket to many egress
> sockets blocking would cause head of line blocking on all
> sockets. To resolve this I simply dropped the packet and then allow
> the flow to continue. This pushes the logic up to the application
> to do retries, etc. when this happens. FWIW userspace proxies I
> tested also had similar points where they fell over and dropped
> packets. In hind sight though it probably would have made more
> sense to make this behavior opt-in vs the default. But, the
> use case I was solving at the time I wrote this could handle
> drops and was actually a NxM sockets with N ingress sockets and M
> egress sockets so head of line blocking was a real problem.
>
> Adding a flag to turn this into a blocking op has been on my
> todo list for awhile. Especially when sockmap is being used as
> a single ingress to single egress socket then blocking vs dropping
> makes much more sense.
>
> The place to look is in sk_psock_verdict_apply() in __SK_REDIRECT
> case there is a few checks and notice we can fallthrough to a
> kfree_skb(skb). This is most likely the drops you are hitting.
> Maybe annotate it with a dbg statement to check.
>
> To fix this we could have a flag to _not_ drop but enqueue the
> packet regardless of the test or hold it until space is
> available. I even think sk_psock_strp_read could push back
> on the stream parser which would eventually push back via TCP
> and get the behavior you want.
>
> Also, I have a couple items on my TODO list that I'll eventually
> get to. First we run without stream parsers in some Cilium
> use cases. I'll push some patches to allow this in the next
> months or so. This avoids the annoying stream parser prog that
> simply returns skb->len. This is mostly an optimizations. A
> larger change I want to make at some point is to remove the
> backlog workqueue altogether. Originally it was added for
> simplicity but actually causes some latency spikes when
> looking at 99+ percentiles. It really doesn't need to be
> there it was a hold over from some original architecture that
> got pushed upstream. If you have time and want to let me know
> if you would like to tackle removing it.
This is great stuff. Thanks for explaining the sockmap's design and
decisions behind it. The opt-in blocking mode idea is spot on.
I imagine we'll get back to sockmap once we have a replacement for
TPROXY figured out (unrelated to sockmap). Let's sync then.
-Jakub
Powered by blists - more mailing lists