lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <ZUp-gYT7OMb9wun3@google.com> Date: Tue, 7 Nov 2023 10:14:25 -0800 From: Stanislav Fomichev <sdf@...gle.com> To: Willem de Bruijn <willemdebruijn.kernel@...il.com> Cc: Mina Almasry <almasrymina@...gle.com>, David Ahern <dsahern@...nel.org>, netdev@...r.kernel.org, linux-kernel@...r.kernel.org, linux-arch@...r.kernel.org, linux-kselftest@...r.kernel.org, linux-media@...r.kernel.org, dri-devel@...ts.freedesktop.org, linaro-mm-sig@...ts.linaro.org, "David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, Jesper Dangaard Brouer <hawk@...nel.org>, Ilias Apalodimas <ilias.apalodimas@...aro.org>, Arnd Bergmann <arnd@...db.de>, Shuah Khan <shuah@...nel.org>, Sumit Semwal <sumit.semwal@...aro.org>, "Christian König" <christian.koenig@....com>, Shakeel Butt <shakeelb@...gle.com>, Jeroen de Borst <jeroendb@...gle.com>, Praveen Kaligineedi <pkaligineedi@...gle.com>, Willem de Bruijn <willemb@...gle.com>, Kaiyuan Zhang <kaiyuanz@...gle.com> Subject: Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags On 11/07, Willem de Bruijn wrote: > On Tue, Nov 7, 2023 at 12:44 PM Stanislav Fomichev <sdf@...gle.com> wrote: > > > > On 11/06, Willem de Bruijn wrote: > > > > > > > I think my other issue with MSG_SOCK_DEVMEM being on recvmsg is that > > > > > > > it somehow implies that I have an option of passing or not passing it > > > > > > > for an individual system call. > > > > > > > If we know that we're going to use dmabuf with the socket, maybe we > > > > > > > should move this flag to the socket() syscall? > > > > > > > > > > > > > > fd = socket(AF_INET6, SOCK_STREAM, SOCK_DEVMEM); > > > > > > > > > > > > > > ? > > > > > > > > > > > > I think it should then be a setsockopt called before any data is > > > > > > exchanged, with no change of modifying mode later. We generally use > > > > > > setsockopts for the mode of a socket. This use of the protocol field > > > > > > in socket() for setting a mode would be novel. Also, it might miss > > > > > > passively opened connections, or be overly restrictive: one approach > > > > > > for all accepted child sockets. > > > > > > > > > > I was thinking this is similar to SOCK_CLOEXEC or SOCK_NONBLOCK? There > > > > > are plenty of bits we can grab. But setsockopt works as well! > > > > > > > > To follow up: if we have this flag on a socket, not on a per-message > > > > basis, can we also use recvmsg for the recycling part maybe? > > > > > > > > while (true) { > > > > memset(msg, 0, ...); > > > > > > > > /* receive the tokens */ > > > > ret = recvmsg(fd, &msg, 0); > > > > > > > > /* recycle the tokens from the above recvmsg() */ > > > > ret = recvmsg(fd, &msg, MSG_RECYCLE); > > > > } > > > > > > > > recvmsg + MSG_RECYCLE can parse the same format that regular recvmsg > > > > exports (SO_DEVMEM_OFFSET) and we can also add extra cmsg option > > > > to recycle a range. > > > > > > > > Will this be more straightforward than a setsockopt(SO_DEVMEM_DONTNEED)? > > > > Or is it more confusing? > > > > > > It would have to be sendmsg, as recvmsg is a copy_to_user operation. > > > > > > > > > I am not aware of any precedent in multiplexing the data stream and a > > > control operation stream in this manner. It would also require adding > > > a branch in the sendmsg hot path. > > > > Is it too much plumbing to copy_from_user msg_control deep in recvmsg > > stack where we need it? Mixing in sendmsg is indeed ugly :-( > > I tried exactly the inverse of that when originally adding > MSG_ZEROCOPY: to allow piggy-backing zerocopy completion notifications > on sendmsg calls by writing to sendmsg msg_control on return to user. > It required significant code churn, which the performance gains did > not warrant. Doing so also breaks the simple rule that recv is for > reading and send is for writing. We're breaking so many rules here, so not sure we should be super constrained :-D > > Regarding hot patch: aren't we already doing copy_to_user for the tokens in > > this hot path, so having one extra condition shouldn't hurt too much? > > We're doing that in the optional cmsg handling of recvmsg, which is > already a slow path (compared to the data read() itself). > > > > The memory is associated with the socket, freed when the socket is > > > closed as well as on SO_DEVMEM_DONTNEED. Fundamentally it is a socket > > > state operation, for which setsockopt is the socket interface. > > > > > > Is your request purely a dislike, or is there some technical concern > > > with BPF and setsockopt? > > > > It's mostly because I've been bitten too much by custom socket options that > > are not really on/off/update-value operations: > > > > 29ebbba7d461 - bpf: Don't EFAULT for {g,s}setsockopt with wrong optlen > > 00e74ae08638 - bpf: Don't EFAULT for getsockopt with optval=NULL > > 9cacf81f8161 - bpf: Remove extra lock_sock for TCP_ZEROCOPY_RECEIVE > > d8fe449a9c51 - bpf: Don't return EINVAL from {get,set}sockopt when optlen > PAGE_SIZE > > > > I do agree that this particular case of SO_DEVMEM_DONTNEED seems ok, but > > things tend to evolve and change. > > I see. I'm a bit concerned if we start limiting what we can do in > sockets because of dependencies that BPF processing places on them. > The use case for BPF [gs]etsockopt is limited to specific control mode > calls. Would it make sense to just exclude calls like > SO_DEVMEM_DONTNEED from this interpositioning? Yup, that's why I'm asking. We already have ->bpf_bypass_getsockopt() to special-case tcp zerocopy. We might add another bpf_bypass_setsockopt to special case SO_DEVMEM_DONTNEED. That's why I'm trying to see if there is a better alternative. > At a high level what we really want is a high rate metadata path from > user to kernel. And there are no perfect solutions. From kernel to > user we use the socket error queue for this. That was never intended > for high event rate itself, dealing with ICMP errors and the like > before timestamps and zerocopy notifications were added. > > If I squint hard enough I can see some prior art in mixing data and > high rate state changes within the same channel in NIC descriptor > queues, where some devices do this, e.g., { "insert encryption key", > "send packet" }. But fundamentally I think we should keep the socket > queues for data only. +1, we keep taking an easy route with using sockopt for this :-( Anyway, let's see if any better suggestions pop up. Worst case - we stick with a socket option and will add a bypass on the bpf side.
Powered by blists - more mailing lists