netdev - Sockmap's parser/verdict programs and epoll notifications

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAEf4BzYMAAhwscTWWTenvyr-PQ7E5tMg_iqXsPj_dyZEMVCrKg@mail.gmail.com>
Date: Fri, 7 Jul 2023 11:30:30 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: john fastabend <john.fastabend@...il.com>, bpf <bpf@...r.kernel.org>, 
	Networking <netdev@...r.kernel.org>
Cc: "davidhwei@...a.com" <davidhwei@...a.com>
Subject: Sockmap's parser/verdict programs and epoll notifications

Hey John,

We've been recently experimenting with using BPF_SK_SKB_STREAM_PARSER
and BPF_SK_SKB_STREAM_VERDICT with sockmap/sockhash to perform
in-kernel parsing of RSocket frames. A very simple format ([0]) where
the first 3 bytes specify the size of the frame payload. The idea was
to collect the entire frame in the kernel before notifying user-space
that data is available. This is meant to minimize unnecessary wakeups
due to incomplete logical frames, saving CPU.

You can find the BPF source code I've used at [1], it has lots of
extra logging and stuff, but the idea is to read the first 3 bytes of
each logical frame, and return the expected full frame size from the
parser program. The verdict program always just returns SK_PASS.

This seems to work exactly as expected in manual simulations of
various packet size distributions, and even for a bunch of
ping/pong-like benchmark (which are very sensitive to correct frame
length determination, so I'm reasonably confident we don't screw that
up much). And yet, when benchmarking sending multiple logical RPC
streams over the same single socket (so many interleaving RSocket
frames on single socket, but in terms of logical frames nothing should
change), we often see that while full frame hasn't been accumulated in
socket receive buffer yet, epoll_wait() for that socket would return
with success notifying user space that there is data on socket.
Subsequent recvfrom() call would immediately return -EAGAIN and no
data, and our benchmark would go on this loop of useless
epoll_wait()+recvfrom() calls back to back, many times over.

So I have a few questions:
  - is the above use case something that was meant to be handled by
sockmap+parser/verdict?
  - is it correct to assume that epoll won't wake up until amount of
bytes requested by parser program is accumulated (this seems to be the
case from manually experimenting with various "packet delays");
  - is there some known bug or race in how sockmap and strparser
framework interacts with epoll subsystem that could cause this weird
epoll_wait() behavior?

It does seem like some sort of timing issue, but I couldn't pin down
exactly what are the conditions that this happens in. But it's quite
reproducible with a pretty high frequency using our internal benchmark
when multiple logical streams are involved.

Any thoughts or suggestions?

  [0] https://rsocket.io/about/protocol/#framing-format
  [1] https://github.com/anakryiko/libbpf-bootstrap/blob/thrift-coalesce-rcvlowat/examples/c/bootstrap.bpf.c

-- Andrii