[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <56624947-a319-4781-a9f7-1cf09ee71d8c@bytedance.com>
Date: Tue, 23 Apr 2024 12:51:59 -0700
From: Zijian Zhang <zijianzhang@...edance.com>
To: Willem de Bruijn <willemdebruijn.kernel@...il.com>, netdev@...r.kernel.org
Cc: edumazet@...gle.com, davem@...emloft.net, kuba@...nel.org,
cong.wang@...edance.com, xiaochun.lu@...edance.com
Subject: Re: [External] Re: [PATCH net-next v2 2/3] sock: add MSG_ZEROCOPY
notification mechanism based on msg_control
On 4/21/24 8:45 AM, Willem de Bruijn wrote:
> zijianzhang@ wrote:
>> From: Zijian Zhang <zijianzhang@...edance.com>
>>
>> The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
>> However, zerocopy is not a free lunch. Apart from the management of user
>> pages, the combination of poll + recvmsg to receive notifications incurs
>> unignorable overhead in the applications. The overhead of such sometimes
>> might be more than the CPU savings from zerocopy. We try to solve this
>> problem with a new notification mechanism based on msgcontrol.
>> This new mechanism aims to reduce the overhead associated with receiving
>> notifications by embedding them directly into user arguments passed with
>> each sendmsg control message. By doing so, we can significantly reduce
>> the complexity and overhead for managing notifications. In an ideal
>> pattern, the user will keep calling sendmsg with SO_ZC_NOTIFICATION
>> msg_control, and the notification will be delivered as soon as possible.
>>
>> Signed-off-by: Zijian Zhang <zijianzhang@...edance.com>
>> Signed-off-by: Xiaochun Lu <xiaochun.lu@...edance.com>
>> ---
>> arch/alpha/include/uapi/asm/socket.h | 2 +
>> arch/mips/include/uapi/asm/socket.h | 2 +
>> arch/parisc/include/uapi/asm/socket.h | 2 +
>> arch/sparc/include/uapi/asm/socket.h | 2 +
>> include/uapi/asm-generic/socket.h | 2 +
>> include/uapi/linux/socket.h | 16 ++++++
>> net/core/sock.c | 70 +++++++++++++++++++++++++++
>> 7 files changed, 96 insertions(+)
>>
>> diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h
>> index e94f621903fe..b24622a9cd47 100644
>> --- a/arch/alpha/include/uapi/asm/socket.h
>> +++ b/arch/alpha/include/uapi/asm/socket.h
>> @@ -140,6 +140,8 @@
>> #define SO_PASSPIDFD 76
>> #define SO_PEERPIDFD 77
>>
>> +#define SO_ZC_NOTIFICATION 78
>> +
>
> SCM_ for cmsgs
>
Ack.
>> /*
>> * Desired design of maximum size and alignment (see RFC2553)
>> */
>> @@ -35,4 +37,18 @@ struct __kernel_sockaddr_storage {
>> #define SOCK_TXREHASH_DISABLED 0
>> #define SOCK_TXREHASH_ENABLED 1
>>
>> +#define SOCK_ZC_INFO_MAX 256
>> +
>> +struct zc_info_elem {
>> + __u32 lo;
>> + __u32 hi;
>> + __u8 zerocopy;
>> +};
>> +
>> +struct zc_info_usr {
>> + __u64 usr_addr;
>> + unsigned int length;
>> + struct zc_info_elem info[];
>> +};
>> +
>
> Don't pass a pointer to user memory, just have msg_control point to an
> array of zc_info_elem.
>
Ack.
>> #endif /* _UAPI_LINUX_SOCKET_H */
>> diff --git a/net/core/sock.c b/net/core/sock.c
>> index fe9195186c13..13f06480f2d8 100644
>> --- a/net/core/sock.c
>> +++ b/net/core/sock.c
>> @@ -2809,6 +2809,13 @@ int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg,
>> struct sockcm_cookie *sockc)
>> {
>> u32 tsflags;
>> + int ret, zc_info_size, i = 0;
>> + unsigned long flags;
>> + struct sk_buff_head *q, local_q;
>> + struct sk_buff *skb, *tmp;
>> + struct sock_exterr_skb *serr;
>> + struct zc_info_usr *zc_info_usr_p, *zc_info_kern_p;
>> + void __user *usr_addr;
>
> Please wrap the case in parentheses and define variables in that scope
> (Since there are so many variables for this case only.)
>
>>
>> switch (cmsg->cmsg_type) {
>> case SO_MARK:
>> @@ -2842,6 +2849,69 @@ int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg,
>> case SCM_RIGHTS:
>> case SCM_CREDENTIALS:
>> break;
>> + case SO_ZC_NOTIFICATION:
>> + if (!sock_flag(sk, SOCK_ZEROCOPY) || sk->sk_family == PF_RDS)
>> + return -EINVAL;
>> +
>
> Why allow PF_RDS without the sock flag set?
>
PF_RDS uses POLLIN instead of POLLERR, thus this mechanism cannot work
for PF_RDS. I am rejecting any PF_RDS socket here.
>> + zc_info_usr_p = (struct zc_info_usr *)CMSG_DATA(cmsg);
>> + if (zc_info_usr_p->length <= 0 || zc_info_usr_p->length > SOCK_ZC_INFO_MAX)
>> + return -EINVAL;
>> +
>> + zc_info_size = struct_size(zc_info_usr_p, info, zc_info_usr_p->length);
>> + if (cmsg->cmsg_len != CMSG_LEN(zc_info_size))
>> + return -EINVAL;
>
> By passing a straightforward array, the array len can be inferred from
> cmsg_len, simplifying all these checks.
>
> See for instance how SO_DEVMEM_DONTNEED returns an array of tokens to
> the kernel.
>
Ack.
>> +
>> + usr_addr = (void *)(uintptr_t)(zc_info_usr_p->usr_addr);
>> + if (!access_ok(usr_addr, zc_info_size))
>> + return -EFAULT;
>> +
>> + zc_info_kern_p = kmalloc(zc_info_size, GFP_KERNEL);
>> + if (!zc_info_kern_p)
>> + return -ENOMEM;
>> +
>> + q = &sk->sk_error_queue;
>> + skb_queue_head_init(&local_q);
>> + spin_lock_irqsave(&q->lock, flags);
>> + skb = skb_peek(q);
>> + while (skb && i < zc_info_usr_p->length) {
>> + struct sk_buff *skb_next = skb_peek_next(skb, q);
>> +
>> + serr = SKB_EXT_ERR(skb);
>> + if (serr->ee.ee_errno == 0 &&
>> + serr->ee.ee_origin == SO_EE_ORIGIN_ZEROCOPY) {
>> + zc_info_kern_p->info[i].hi = serr->ee.ee_data;
>> + zc_info_kern_p->info[i].lo = serr->ee.ee_info;
>> + zc_info_kern_p->info[i].zerocopy = !(serr->ee.ee_code
>> + & SO_EE_CODE_ZEROCOPY_COPIED);
>> + __skb_unlink(skb, q);
>> + __skb_queue_tail(&local_q, skb);
>> + i++;
>> + }
>> + skb = skb_next;
>> + }
>> + spin_unlock_irqrestore(&q->lock, flags);
>
> In almost all sane cases, all outstanding notifications can be passed
> to userspace.
>
> It may be interesting to experiment with briefly taking the lock to
> move to a private list. See for instance net_rx_action.
>
Nice catch.
> Then if userspace cannot handle all notifications, the rest have to be
> spliced back. This can reorder notifications. But rare reordering is
> not a correctness issue.
>
> I would choose the more complex splice approach only if it shows
> benefit, i.e., if taking the lock does contend with error enqueue
> events.
>
Maybe when the network is very busy, it will contend with
__msg_zerocopy_callback(where new notifications are added)?
I think splice is a better idea.
>> +
>> + zc_info_kern_p->usr_addr = zc_info_usr_p->usr_addr;
>> + zc_info_kern_p->length = i;
>> +
>> + ret = copy_to_user(usr_addr,
>> + zc_info_kern_p,
>> + struct_size(zc_info_kern_p, info, i));
>
> You'll still need to support the gnarly MSG_CMSG_COMPAT version too.
>
Assume users pass in zc_info_elem array pointer here, I may use
in_compat_syscall function to check if it's compat. If so, I can
use compat_ptr to convert it. Is it correct?
> Wait, is this the reason to pass a usr_addr explicitly? To get around
> any compat issues?
>
Yes, I try to make it u64 regardless of 32-bit or 64-bit program, but
it is indeed ugly though.
> Or even the entire issue of having to copy msg_sys->msg_control to
> user if !msg_control_is_user.
>
> I suppose this simplifies a lot in terms of development. If making the
> user interface uglier.
>
> IMHO the sane interface should be used eventually. There may also be
> other uses of passing msg_control data up to userspace from sendmsg.
>
> But this approach works for now for evaluation and discussion.
>
>
Powered by blists - more mailing lists