[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <64ff278e16f06_2e8f2083a@john.notmuch>
Date: Mon, 11 Sep 2023 07:43:26 -0700
From: John Fastabend <john.fastabend@...il.com>
To: Andrii Nakryiko <andrii.nakryiko@...il.com>,
John Fastabend <john.fastabend@...il.com>
Cc: bpf <bpf@...r.kernel.org>,
Networking <netdev@...r.kernel.org>,
"davidhwei@...a.com" <davidhwei@...a.com>
Subject: Re: Sockmap's parser/verdict programs and epoll notifications
Andrii Nakryiko wrote:
> On Sun, Jul 16, 2023 at 9:37 PM John Fastabend <john.fastabend@...il.com> wrote:
> >
> > Andrii Nakryiko wrote:
> > > Hey John,
> >
> > Sorry missed this while I was on PTO that week.
>
> yeah, vacations tend to cause missing things :)
>
> >
> > >
> > > We've been recently experimenting with using BPF_SK_SKB_STREAM_PARSER
> > > and BPF_SK_SKB_STREAM_VERDICT with sockmap/sockhash to perform
> > > in-kernel parsing of RSocket frames. A very simple format ([0]) where
> > > the first 3 bytes specify the size of the frame payload. The idea was
> > > to collect the entire frame in the kernel before notifying user-space
> > > that data is available. This is meant to minimize unnecessary wakeups
> > > due to incomplete logical frames, saving CPU.
> >
> > Nice.
> >
> > >
> > > You can find the BPF source code I've used at [1], it has lots of
> > > extra logging and stuff, but the idea is to read the first 3 bytes of
> > > each logical frame, and return the expected full frame size from the
> > > parser program. The verdict program always just returns SK_PASS.
> > >
> > > This seems to work exactly as expected in manual simulations of
> > > various packet size distributions, and even for a bunch of
> > > ping/pong-like benchmark (which are very sensitive to correct frame
> > > length determination, so I'm reasonably confident we don't screw that
> > > up much). And yet, when benchmarking sending multiple logical RPC
> > > streams over the same single socket (so many interleaving RSocket
> > > frames on single socket, but in terms of logical frames nothing should
> > > change), we often see that while full frame hasn't been accumulated in
> > > socket receive buffer yet, epoll_wait() for that socket would return
> > > with success notifying user space that there is data on socket.
> > > Subsequent recvfrom() call would immediately return -EAGAIN and no
> > > data, and our benchmark would go on this loop of useless
> > > epoll_wait()+recvfrom() calls back to back, many times over.
> >
> > Aha yes this sounds bad.
> >
> > >
> > > So I have a few questions:
> > > - is the above use case something that was meant to be handled by
> > > sockmap+parser/verdict?
> >
> > We shouldn't wake up user space if there is nothing to read. So
> > yes this seems like a valid use case to me.
> >
> > > - is it correct to assume that epoll won't wake up until amount of
> > > bytes requested by parser program is accumulated (this seems to be the
> > > case from manually experimenting with various "packet delays");
> >
> > Seems there is some bug that races and causes it to wake up
> > user space. I'm aware of a couple bugs in the stream parser
> > that I wanted to fix. Not sure I can get to them this week
> > but should have time next week. We have a couple more fixes
> > to resolve a few HTTPS server compliance tests as well.
> >
> > > - is there some known bug or race in how sockmap and strparser
> > > framework interacts with epoll subsystem that could cause this weird
> > > epoll_wait() behavior?
> >
> > Yes I know of some races in strparser. I'll elaborate later
> > probably with patches as I don't recall them readily at the
> > moment.
>
> So I missed a good chunk of BPF mailing list traffic while I was on my
> PTO. Did you end up getting to these bugs in strparser logic? Should I
> try running the latest bpf-next/net-next on our production workload to
> see if this is still happening?
You will likely still hit there error I haven't got it out of my queue
yet. I just knocked off a couple things last week so could probably
take a look at flushing my queue this week. Then it would make sense
to retest to see if its something new or not.
I'll at least send an RFC with the idea even if I don't get to testing
it yet.
Thanks,
John
Powered by blists - more mailing lists