lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <3ce746bd-e1a3-4e3d-bff9-9d692f4d7f20@bytedance.com>
Date: Mon, 3 Jun 2024 18:01:55 -0700
From: Zijian Zhang <zijianzhang@...edance.com>
To: Willem de Bruijn <willemdebruijn.kernel@...il.com>, netdev@...r.kernel.org
Cc: xiaochun.lu@...edance.com, cong.wang@...edance.com
Subject: Re: [External] Re: [PATCH net-next v4 2/3] sock: add MSG_ZEROCOPY
 notification mechanism based on msg_control

On 6/2/24 3:29 PM, Willem de Bruijn wrote:
> On Fri, May 31, 2024 at 7:20 PM Zijian Zhang <zijianzhang@...edance.com> wrote:
>>
>>
>>
>> On 5/28/24 2:21 PM, zijianzhang@...edance.com wrote:
>>> From: Zijian Zhang <zijianzhang@...edance.com>
>>>
>>> The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
>>> However, zerocopy is not a free lunch. Apart from the management of user
>>> pages, the combination of poll + recvmsg to receive notifications incurs
>>> unignorable overhead in the applications. The overhead of such sometimes
>>> might be more than the CPU savings from zerocopy. We try to solve this
>>> problem with a new notification mechanism based on msgcontrol.
>>> This new mechanism aims to reduce the overhead associated with receiving
>>> notifications by embedding them directly into user arguments passed with
>>> each sendmsg control message. By doing so, we can significantly reduce
>>> the complexity and overhead for managing notifications. In an ideal
>>> pattern, the user will keep calling sendmsg with SCM_ZC_NOTIFICATION
>>> msg_control, and the notification will be delivered as soon as possible.
>>>
>>> Signed-off-by: Zijian Zhang <zijianzhang@...edance.com>
>>> Signed-off-by: Xiaochun Lu <xiaochun.lu@...edance.com>
>>> ---
>>>    arch/alpha/include/uapi/asm/socket.h  |  2 +
>>>    arch/mips/include/uapi/asm/socket.h   |  2 +
>>>    arch/parisc/include/uapi/asm/socket.h |  2 +
>>>    arch/sparc/include/uapi/asm/socket.h  |  2 +
>>>    include/uapi/asm-generic/socket.h     |  2 +
>>>    include/uapi/linux/socket.h           | 10 ++++
>>>    net/core/sock.c                       | 68 +++++++++++++++++++++++++++
>>>    7 files changed, 88 insertions(+)
>>>
>>> ...
>>> diff --git a/include/uapi/linux/socket.h b/include/uapi/linux/socket.h
>>> index d3fcd3b5ec53..15cec8819f34 100644
>>> --- a/include/uapi/linux/socket.h
>>> +++ b/include/uapi/linux/socket.h
>>> @@ -2,6 +2,8 @@
>>>    #ifndef _UAPI_LINUX_SOCKET_H
>>>    #define _UAPI_LINUX_SOCKET_H
>>>
>>> +#include <linux/types.h>
>>> +
>>>    /*
>>>     * Desired design of maximum size and alignment (see RFC2553)
>>>     */
>>> @@ -35,4 +37,12 @@ struct __kernel_sockaddr_storage {
>>>    #define SOCK_TXREHASH_DISABLED      0
>>>    #define SOCK_TXREHASH_ENABLED       1
>>>
>>> +#define SOCK_ZC_INFO_MAX 128
>>> +
>>> +struct zc_info_elem {
>>> +     __u32 lo;
>>> +     __u32 hi;
>>> +     __u8 zerocopy;
>>> +};
>>> +
>>>    #endif /* _UAPI_LINUX_SOCKET_H */
>>> diff --git a/net/core/sock.c b/net/core/sock.c
>>> index 521e6373d4f7..21239469d75c 100644
>>> --- a/net/core/sock.c
>>> +++ b/net/core/sock.c
>>> @@ -2847,6 +2847,74 @@ int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg,
>>>        case SCM_RIGHTS:
>>>        case SCM_CREDENTIALS:
>>>                break;
>>> +     case SCM_ZC_NOTIFICATION: {
>>> +             int ret, i = 0;
>>> +             int cmsg_data_len, zc_info_elem_num;
>>> +             void __user     *usr_addr;
>>> +             struct zc_info_elem zc_info_kern[SOCK_ZC_INFO_MAX];
>>> +             unsigned long flags;
>>> +             struct sk_buff_head *q, local_q;
>>> +             struct sk_buff *skb, *tmp;
>>> +             struct sock_exterr_skb *serr;
>>> +
>>> +             if (!sock_flag(sk, SOCK_ZEROCOPY) || sk->sk_family == PF_RDS)
>>> +                     return -EINVAL;
>>> +
>>> +             cmsg_data_len = cmsg->cmsg_len - sizeof(struct cmsghdr);
>>> +             if (cmsg_data_len % sizeof(struct zc_info_elem))
>>> +                     return -EINVAL;
>>> +
>>> +             zc_info_elem_num = cmsg_data_len / sizeof(struct zc_info_elem);
>>> +             if (!zc_info_elem_num || zc_info_elem_num > SOCK_ZC_INFO_MAX)
>>> +                     return -EINVAL;
>>> +
>>> +             if (in_compat_syscall())
>>> +                     usr_addr = compat_ptr(*(compat_uptr_t *)CMSG_DATA(cmsg));
>>> +             else
>>> +                     usr_addr = (void __user *)*(void **)CMSG_DATA(cmsg);
>>
>> First of all, thanks for your efforts and time to review this series of
>> patchsets!
> 
> Please try to keep this conversation on the netdev list. What I
> respond below would be good to have in public discourse. Among
> others for others to come disagree with me and tell you that they
> prefer your patchset just the way it is ;-)
> 
>> I believe compat issue has been resolved in this if code block? I know
>> that the current design is quite hacky, and want to discuss next steps
>> with you,
>>
>> 1. Is it possible to change ____sys_sendmsg? So that we can copy
>> msg_controldata back to the user space.
> 
> I do think that this is the clean way to support passing metadata
> up to userspace.
> 
> The current approach looks like a hack to me. It works, but arguably
> is not how you implement a serious user API (ABI).
> 
> A more complete solution can potentially also be reused by other
> features that want to piggy-back information from the kernel back
> up to userspace with sendmsg. Timestamps is an example.
> 
> I did implement the full method once.
> 
> Initially, don't spend time on modifying ___sys_sendmsg *and*
> on supporting the compat version. With the right structs (that are
> not ambiguous between 32 and 64 bit) it may not even be needed.
> 
> But I would be more supportive of this full interface.
> 
> It also ties into the performance benefits observed. If they were
> shockingly good, I would still argue in favor of a cleaner API, to be
> clear. But a minor improvement is even less reason to consider a
> hacky API.
> 
> All of this is just one person's opinion, of course.
> 
>> 2. Is it possible to support this feature in recvmsg instead of
>> sendmsg? In the case of selftest where one thread keeps sending
>> and another keeps recving, this feature is useless. But in other
>> cases where one socket will send and recv, this might help?
> 
> That's a lot easier. And halves the cost of having to calll both
> recvmsg + recvmsg MSG_ERRQUEUE.
> 
>> For sockets that send many times but recv very few times, a hybrid
>> mode of notification using msg_control and errmsg_queue can be used.
>>
>> 3. Or, do you have any idea?
>>

I did not remember the reason why we directly pass in the user address
and have ctl_len to be the array size. Can I pass in a struct like
{
   __user void *user_addr;
   __u32 array_size;
}
and have ctl_len to be the sizeof the struct, just like my first patch
set? Of course, I will consider the compat issue for user_addr this time.

>> If the above solutions are hard to be accepted, I can totally
>> understand, and I will submit the fix to msg_zerocopy selftest only.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ