[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <3ce746bd-e1a3-4e3d-bff9-9d692f4d7f20@bytedance.com>
Date: Mon, 3 Jun 2024 18:01:55 -0700
From: Zijian Zhang <zijianzhang@...edance.com>
To: Willem de Bruijn <willemdebruijn.kernel@...il.com>, netdev@...r.kernel.org
Cc: xiaochun.lu@...edance.com, cong.wang@...edance.com
Subject: Re: [External] Re: [PATCH net-next v4 2/3] sock: add MSG_ZEROCOPY
notification mechanism based on msg_control
On 6/2/24 3:29 PM, Willem de Bruijn wrote:
> On Fri, May 31, 2024 at 7:20 PM Zijian Zhang <zijianzhang@...edance.com> wrote:
>>
>>
>>
>> On 5/28/24 2:21 PM, zijianzhang@...edance.com wrote:
>>> From: Zijian Zhang <zijianzhang@...edance.com>
>>>
>>> The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
>>> However, zerocopy is not a free lunch. Apart from the management of user
>>> pages, the combination of poll + recvmsg to receive notifications incurs
>>> unignorable overhead in the applications. The overhead of such sometimes
>>> might be more than the CPU savings from zerocopy. We try to solve this
>>> problem with a new notification mechanism based on msgcontrol.
>>> This new mechanism aims to reduce the overhead associated with receiving
>>> notifications by embedding them directly into user arguments passed with
>>> each sendmsg control message. By doing so, we can significantly reduce
>>> the complexity and overhead for managing notifications. In an ideal
>>> pattern, the user will keep calling sendmsg with SCM_ZC_NOTIFICATION
>>> msg_control, and the notification will be delivered as soon as possible.
>>>
>>> Signed-off-by: Zijian Zhang <zijianzhang@...edance.com>
>>> Signed-off-by: Xiaochun Lu <xiaochun.lu@...edance.com>
>>> ---
>>> arch/alpha/include/uapi/asm/socket.h | 2 +
>>> arch/mips/include/uapi/asm/socket.h | 2 +
>>> arch/parisc/include/uapi/asm/socket.h | 2 +
>>> arch/sparc/include/uapi/asm/socket.h | 2 +
>>> include/uapi/asm-generic/socket.h | 2 +
>>> include/uapi/linux/socket.h | 10 ++++
>>> net/core/sock.c | 68 +++++++++++++++++++++++++++
>>> 7 files changed, 88 insertions(+)
>>>
>>> ...
>>> diff --git a/include/uapi/linux/socket.h b/include/uapi/linux/socket.h
>>> index d3fcd3b5ec53..15cec8819f34 100644
>>> --- a/include/uapi/linux/socket.h
>>> +++ b/include/uapi/linux/socket.h
>>> @@ -2,6 +2,8 @@
>>> #ifndef _UAPI_LINUX_SOCKET_H
>>> #define _UAPI_LINUX_SOCKET_H
>>>
>>> +#include <linux/types.h>
>>> +
>>> /*
>>> * Desired design of maximum size and alignment (see RFC2553)
>>> */
>>> @@ -35,4 +37,12 @@ struct __kernel_sockaddr_storage {
>>> #define SOCK_TXREHASH_DISABLED 0
>>> #define SOCK_TXREHASH_ENABLED 1
>>>
>>> +#define SOCK_ZC_INFO_MAX 128
>>> +
>>> +struct zc_info_elem {
>>> + __u32 lo;
>>> + __u32 hi;
>>> + __u8 zerocopy;
>>> +};
>>> +
>>> #endif /* _UAPI_LINUX_SOCKET_H */
>>> diff --git a/net/core/sock.c b/net/core/sock.c
>>> index 521e6373d4f7..21239469d75c 100644
>>> --- a/net/core/sock.c
>>> +++ b/net/core/sock.c
>>> @@ -2847,6 +2847,74 @@ int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg,
>>> case SCM_RIGHTS:
>>> case SCM_CREDENTIALS:
>>> break;
>>> + case SCM_ZC_NOTIFICATION: {
>>> + int ret, i = 0;
>>> + int cmsg_data_len, zc_info_elem_num;
>>> + void __user *usr_addr;
>>> + struct zc_info_elem zc_info_kern[SOCK_ZC_INFO_MAX];
>>> + unsigned long flags;
>>> + struct sk_buff_head *q, local_q;
>>> + struct sk_buff *skb, *tmp;
>>> + struct sock_exterr_skb *serr;
>>> +
>>> + if (!sock_flag(sk, SOCK_ZEROCOPY) || sk->sk_family == PF_RDS)
>>> + return -EINVAL;
>>> +
>>> + cmsg_data_len = cmsg->cmsg_len - sizeof(struct cmsghdr);
>>> + if (cmsg_data_len % sizeof(struct zc_info_elem))
>>> + return -EINVAL;
>>> +
>>> + zc_info_elem_num = cmsg_data_len / sizeof(struct zc_info_elem);
>>> + if (!zc_info_elem_num || zc_info_elem_num > SOCK_ZC_INFO_MAX)
>>> + return -EINVAL;
>>> +
>>> + if (in_compat_syscall())
>>> + usr_addr = compat_ptr(*(compat_uptr_t *)CMSG_DATA(cmsg));
>>> + else
>>> + usr_addr = (void __user *)*(void **)CMSG_DATA(cmsg);
>>
>> First of all, thanks for your efforts and time to review this series of
>> patchsets!
>
> Please try to keep this conversation on the netdev list. What I
> respond below would be good to have in public discourse. Among
> others for others to come disagree with me and tell you that they
> prefer your patchset just the way it is ;-)
>
>> I believe compat issue has been resolved in this if code block? I know
>> that the current design is quite hacky, and want to discuss next steps
>> with you,
>>
>> 1. Is it possible to change ____sys_sendmsg? So that we can copy
>> msg_controldata back to the user space.
>
> I do think that this is the clean way to support passing metadata
> up to userspace.
>
> The current approach looks like a hack to me. It works, but arguably
> is not how you implement a serious user API (ABI).
>
> A more complete solution can potentially also be reused by other
> features that want to piggy-back information from the kernel back
> up to userspace with sendmsg. Timestamps is an example.
>
> I did implement the full method once.
>
> Initially, don't spend time on modifying ___sys_sendmsg *and*
> on supporting the compat version. With the right structs (that are
> not ambiguous between 32 and 64 bit) it may not even be needed.
>
> But I would be more supportive of this full interface.
>
> It also ties into the performance benefits observed. If they were
> shockingly good, I would still argue in favor of a cleaner API, to be
> clear. But a minor improvement is even less reason to consider a
> hacky API.
>
> All of this is just one person's opinion, of course.
>
>> 2. Is it possible to support this feature in recvmsg instead of
>> sendmsg? In the case of selftest where one thread keeps sending
>> and another keeps recving, this feature is useless. But in other
>> cases where one socket will send and recv, this might help?
>
> That's a lot easier. And halves the cost of having to calll both
> recvmsg + recvmsg MSG_ERRQUEUE.
>
>> For sockets that send many times but recv very few times, a hybrid
>> mode of notification using msg_control and errmsg_queue can be used.
>>
>> 3. Or, do you have any idea?
>>
I did not remember the reason why we directly pass in the user address
and have ctl_len to be the array size. Can I pass in a struct like
{
__user void *user_addr;
__u32 array_size;
}
and have ctl_len to be the sizeof the struct, just like my first patch
set? Of course, I will consider the compat issue for user_addr this time.
>> If the above solutions are hard to be accepted, I can totally
>> understand, and I will submit the fix to msg_zerocopy selftest only.
Powered by blists - more mailing lists