lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <56624947-a319-4781-a9f7-1cf09ee71d8c@bytedance.com>
Date: Tue, 23 Apr 2024 12:51:59 -0700
From: Zijian Zhang <zijianzhang@...edance.com>
To: Willem de Bruijn <willemdebruijn.kernel@...il.com>, netdev@...r.kernel.org
Cc: edumazet@...gle.com, davem@...emloft.net, kuba@...nel.org,
 cong.wang@...edance.com, xiaochun.lu@...edance.com
Subject: Re: [External] Re: [PATCH net-next v2 2/3] sock: add MSG_ZEROCOPY
 notification mechanism based on msg_control

On 4/21/24 8:45 AM, Willem de Bruijn wrote:
> zijianzhang@ wrote:
>> From: Zijian Zhang <zijianzhang@...edance.com>
>>
>> The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
>> However, zerocopy is not a free lunch. Apart from the management of user
>> pages, the combination of poll + recvmsg to receive notifications incurs
>> unignorable overhead in the applications. The overhead of such sometimes
>> might be more than the CPU savings from zerocopy. We try to solve this
>> problem with a new notification mechanism based on msgcontrol.
>> This new mechanism aims to reduce the overhead associated with receiving
>> notifications by embedding them directly into user arguments passed with
>> each sendmsg control message. By doing so, we can significantly reduce
>> the complexity and overhead for managing notifications. In an ideal
>> pattern, the user will keep calling sendmsg with SO_ZC_NOTIFICATION
>> msg_control, and the notification will be delivered as soon as possible.
>>
>> Signed-off-by: Zijian Zhang <zijianzhang@...edance.com>
>> Signed-off-by: Xiaochun Lu <xiaochun.lu@...edance.com>
>> ---
>>   arch/alpha/include/uapi/asm/socket.h  |  2 +
>>   arch/mips/include/uapi/asm/socket.h   |  2 +
>>   arch/parisc/include/uapi/asm/socket.h |  2 +
>>   arch/sparc/include/uapi/asm/socket.h  |  2 +
>>   include/uapi/asm-generic/socket.h     |  2 +
>>   include/uapi/linux/socket.h           | 16 ++++++
>>   net/core/sock.c                       | 70 +++++++++++++++++++++++++++
>>   7 files changed, 96 insertions(+)
>>
>> diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h
>> index e94f621903fe..b24622a9cd47 100644
>> --- a/arch/alpha/include/uapi/asm/socket.h
>> +++ b/arch/alpha/include/uapi/asm/socket.h
>> @@ -140,6 +140,8 @@
>>   #define SO_PASSPIDFD		76
>>   #define SO_PEERPIDFD		77
>>   
>> +#define SO_ZC_NOTIFICATION 78
>> +
> 
> SCM_ for cmsgs
> 

Ack.

>>   /*
>>    * Desired design of maximum size and alignment (see RFC2553)
>>    */
>> @@ -35,4 +37,18 @@ struct __kernel_sockaddr_storage {
>>   #define SOCK_TXREHASH_DISABLED	0
>>   #define SOCK_TXREHASH_ENABLED	1
>>   
>> +#define SOCK_ZC_INFO_MAX 256
>> +
>> +struct zc_info_elem {
>> +	__u32 lo;
>> +	__u32 hi;
>> +	__u8 zerocopy;
>> +};
>> +
>> +struct zc_info_usr {
>> +	__u64 usr_addr;
>> +	unsigned int length;
>> +	struct zc_info_elem info[];
>> +};
>> +
> 
> Don't pass a pointer to user memory, just have msg_control point to an
> array of zc_info_elem.
> 

Ack.

>>   #endif /* _UAPI_LINUX_SOCKET_H */
>> diff --git a/net/core/sock.c b/net/core/sock.c
>> index fe9195186c13..13f06480f2d8 100644
>> --- a/net/core/sock.c
>> +++ b/net/core/sock.c
>> @@ -2809,6 +2809,13 @@ int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg,
>>   		     struct sockcm_cookie *sockc)
>>   {
>>   	u32 tsflags;
>> +	int ret, zc_info_size, i = 0;
>> +	unsigned long flags;
>> +	struct sk_buff_head *q, local_q;
>> +	struct sk_buff *skb, *tmp;
>> +	struct sock_exterr_skb *serr;
>> +	struct zc_info_usr *zc_info_usr_p, *zc_info_kern_p;
>> +	void __user	*usr_addr;
> 
> Please wrap the case in parentheses and define variables in that scope
> (Since there are so many variables for this case only.)
> 
>>   
>>   	switch (cmsg->cmsg_type) {
>>   	case SO_MARK:
>> @@ -2842,6 +2849,69 @@ int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg,
>>   	case SCM_RIGHTS:
>>   	case SCM_CREDENTIALS:
>>   		break;
>> +	case SO_ZC_NOTIFICATION:
>> +		if (!sock_flag(sk, SOCK_ZEROCOPY) || sk->sk_family == PF_RDS)
>> +			return -EINVAL;
>> +
> 
> Why allow PF_RDS without the sock flag set?
> 

PF_RDS uses POLLIN instead of POLLERR, thus this mechanism cannot work
for PF_RDS. I am rejecting any PF_RDS socket here.

>> +		zc_info_usr_p = (struct zc_info_usr *)CMSG_DATA(cmsg);
>> +		if (zc_info_usr_p->length <= 0 || zc_info_usr_p->length > SOCK_ZC_INFO_MAX)
>> +			return -EINVAL;
>> +
>> +		zc_info_size = struct_size(zc_info_usr_p, info, zc_info_usr_p->length);
>> +		if (cmsg->cmsg_len != CMSG_LEN(zc_info_size))
>> +			return -EINVAL;
> 
> By passing a straightforward array, the array len can be inferred from
> cmsg_len, simplifying all these checks.
> 
> See for instance how SO_DEVMEM_DONTNEED returns an array of tokens to
> the kernel.
> 

Ack.

>> +
>> +		usr_addr = (void *)(uintptr_t)(zc_info_usr_p->usr_addr);
>> +		if (!access_ok(usr_addr, zc_info_size))
>> +			return -EFAULT;
>> +
>> +		zc_info_kern_p = kmalloc(zc_info_size, GFP_KERNEL);
>> +		if (!zc_info_kern_p)
>> +			return -ENOMEM;
>> +
>> +		q = &sk->sk_error_queue;
>> +		skb_queue_head_init(&local_q);
>> +		spin_lock_irqsave(&q->lock, flags);
>> +		skb = skb_peek(q);
>> +		while (skb && i < zc_info_usr_p->length) {
>> +			struct sk_buff *skb_next = skb_peek_next(skb, q);
>> +
>> +			serr = SKB_EXT_ERR(skb);
>> +			if (serr->ee.ee_errno == 0 &&
>> +			    serr->ee.ee_origin == SO_EE_ORIGIN_ZEROCOPY) {
>> +				zc_info_kern_p->info[i].hi = serr->ee.ee_data;
>> +				zc_info_kern_p->info[i].lo = serr->ee.ee_info;
>> +				zc_info_kern_p->info[i].zerocopy = !(serr->ee.ee_code
>> +								& SO_EE_CODE_ZEROCOPY_COPIED);
>> +				__skb_unlink(skb, q);
>> +				__skb_queue_tail(&local_q, skb);
>> +				i++;
>> +			}
>> +			skb = skb_next;
>> +		}
>> +		spin_unlock_irqrestore(&q->lock, flags);
> 
> In almost all sane cases, all outstanding notifications can be passed
> to userspace.
> 
> It may be interesting to experiment with briefly taking the lock to
> move to a private list. See for instance net_rx_action.
> 

Nice catch.

> Then if userspace cannot handle all notifications, the rest have to be
> spliced back. This can reorder notifications. But rare reordering is
> not a correctness issue.
> 
> I would choose the more complex splice approach only if it shows
> benefit, i.e., if taking the lock does contend with error enqueue
> events.
> 

Maybe when the network is very busy, it will contend with
__msg_zerocopy_callback(where new notifications are added)?
I think splice is a better idea.

>> +
>> +		zc_info_kern_p->usr_addr = zc_info_usr_p->usr_addr;
>> +		zc_info_kern_p->length = i;
>> +
>> +		ret = copy_to_user(usr_addr,
>> +				   zc_info_kern_p,
>> +					struct_size(zc_info_kern_p, info, i));
> 
> You'll still need to support the gnarly MSG_CMSG_COMPAT version too.
> 

Assume users pass in zc_info_elem array pointer here, I may use
in_compat_syscall function to check if it's compat. If so, I can
use compat_ptr to convert it. Is it correct?

> Wait, is this the reason to pass a usr_addr explicitly? To get around
> any compat issues?
> 

Yes, I try to make it u64 regardless of 32-bit or 64-bit program, but
it is indeed ugly though.

> Or even the entire issue of having to copy msg_sys->msg_control to
> user if !msg_control_is_user.
> 
> I suppose this simplifies a lot in terms of development. If making the
> user interface uglier.
> 
> IMHO the sane interface should be used eventually. There may also be
> other uses of passing msg_control data up to userspace from sendmsg.
> 
> But this approach works for now for evaluation and discussion.
> 
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ