netdev - Re: [PATCH net-next] tls: Add opt-in zerocopy mode of sendfile()

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d99c36fd-2bd3-acc6-6c37-7eb439b04949@nvidia.com>
Date:   Fri, 29 Apr 2022 17:21:59 +0300
From:   Maxim Mikityanskiy <maximmi@...dia.com>
To:     Jakub Kicinski <kuba@...nel.org>
Cc:     "David S. Miller" <davem@...emloft.net>,
        Daniel Borkmann <daniel@...earbox.net>,
        Paolo Abeni <pabeni@...hat.com>,
        Boris Pismenny <borisp@...dia.com>,
        Tariq Toukan <tariqt@...dia.com>,
        Saeed Mahameed <saeedm@...dia.com>,
        Gal Pressman <gal@...dia.com>, netdev@...r.kernel.org
Subject: Re: [PATCH net-next] tls: Add opt-in zerocopy mode of sendfile()

On 2022-04-29 01:11, Jakub Kicinski wrote:
>> diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
>> index b12f81a2b44c..715401b20c8b 100644
>> --- a/net/tls/tls_device.c
>> +++ b/net/tls/tls_device.c
>> @@ -411,10 +411,16 @@ static int tls_device_copy_data(void *addr, size_t bytes, struct iov_iter *i)
>>   	return 0;
>>   }
>>   
>> +union tls_iter_offset {
>> +	struct iov_iter *msg_iter;
>> +	int offset;
>> +};
> 
> Is this sort of atrocity used elsewhere in the kernel?
> If you can't refactor the code you can pack args into
> a structure

What's the point of packing arguments into a struct in this particular 
case? Depending on zc_page, I need either msg_iter or offset, and I'm 
reusing the same CPU register to pass either of them. The performance 
isn't affected, and the amount of memory used is the same. A struct 
won't allow to achieve this, it would force me to drag 8 extra bytes, 
but we already use all 6 registers used to pass parameters on x86_64.

> but I've not seen people cast mutually exclusive
> arguments to a union type.

It's the purpose of a union, to hold one of mutually exclusive values, 
isn't it?

> Is this "inspired" by some higher
> level language?

It's unfortunately inspired by C and its freedom to allow 
microoptimizations/hacks. The hack here is that I use a pointer being 
NULL or not-NULL as an indicator what type the other argument has.

The closest alternative from high-level languages I can think of is 
enums with attached data from rust or swift. However, rust isn't smart 
enough to perform the optimization I described, so no, it's not inspired 
by it :)

Options that I see here:

1. Union.

2. Just pass both parameters and use one of them. Drawbacks: now we have 
7 parameters, one will be passed through the stack, and it's datapath code.

3. Pass `struct iov_iter *` and cast it to `int offset` when zc_page 
isn't NULL. As we are compiling with -fno-strict-aliasing, and int 
shouldn't be bigger than a pointer on all supported architectures, it's 
going to work, and we still have 6 parameters.

4. Combine `int offset` and `int flags` into a single 64-bit parameter.

Which one do you prefer, or do you have anything better in mind?

>>   static int tls_push_data(struct sock *sk,
>> -			 struct iov_iter *msg_iter,
>> +			 union tls_iter_offset iter_offset,
>>   			 size_t size, int flags,
>> -			 unsigned char record_type)
>> +			 unsigned char record_type,
>> +			 struct page *zc_page)
>>   {
>>   	struct tls_context *tls_ctx = tls_get_ctx(sk);
>>   	struct tls_prot_info *prot = &tls_ctx->prot_info;
>> @@ -480,15 +486,29 @@ static int tls_push_data(struct sock *sk,
>>   		}
>>   
>>   		record = ctx->open_record;
>> -		copy = min_t(size_t, size, (pfrag->size - pfrag->offset));
>> -		copy = min_t(size_t, copy, (max_open_record_len - record->len));
>> -
>> -		if (copy) {
>> -			rc = tls_device_copy_data(page_address(pfrag->page) +
>> -						  pfrag->offset, copy, msg_iter);
>> -			if (rc)
>> -				goto handle_error;
>> -			tls_append_frag(record, pfrag, copy);
>> +
>> +		if (!zc_page) {
>> +			copy = min_t(size_t, size, pfrag->size - pfrag->offset);
>> +			copy = min_t(size_t, copy, max_open_record_len - record->len);
> 
> Nope, refactor this please. 95% sure you don't need 4 indentation
> levels here. Space left in record can be calculated before the if,
> then you can do
> 
> if (zc_page) {
> 	..
> } else if (copy) {
> 	..
> }

It'll save one indentation level for the zc_page case, but not for the 
other:

copy = min_t(size_t, size, max_open_record_len - record->len);
if (zc_page && copy) {
     ...
} else {
     // I still have to do this in non-zc case:
     copy = min_t(size_t, copy, pfrag->size - pfrag->offset);
     if (copy) {
         // Same indentation level as in my patch.
         ...
     }
}

Is it good enough?

>> +			if (copy) {
>> +				rc = tls_device_copy_data(page_address(pfrag->page) +
>> +							  pfrag->offset, copy,
>> +							  iter_offset.msg_iter);
>> +				if (rc)
>> +					goto handle_error;
>> +				tls_append_frag(record, pfrag, copy);
>> +			}
>> +		} else {
>> +			copy = min_t(size_t, size, max_open_record_len - record->len);
>> +			if (copy) {
>> +				struct page_frag _pfrag;
> 
> And name your variables right :/
> 
>> +				_pfrag.page = zc_page;
>> +				_pfrag.offset = iter_offset.offset;
>> +				_pfrag.size = copy;
>> +				tls_append_frag(record, &_pfrag, copy);
>> +			}
>>   		}
> 
>>   		size -= copy;
>> @@ -551,8 +571,8 @@ int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
>>   			goto out;
>>   	}
>>   
>> -	rc = tls_push_data(sk, &msg->msg_iter, size,
>> -			   msg->msg_flags, record_type);
>> +	rc = tls_push_data(sk, (union tls_iter_offset)&msg->msg_iter, size,
>> +			   msg->msg_flags, record_type, NULL);
>>   
>>   out:
>>   	release_sock(sk);
>> @@ -564,11 +584,14 @@ int tls_device_sendpage(struct sock *sk, struct page *page,
>>   			int offset, size_t size, int flags)
>>   {
>>   	struct tls_context *tls_ctx = tls_get_ctx(sk);
>> +	struct tls_offload_context_tx *ctx;
>>   	struct iov_iter	msg_iter;
>>   	char *kaddr;
>>   	struct kvec iov;
>>   	int rc;
>>   
>> +	ctx = tls_offload_ctx_tx(tls_ctx);
>> +
>>   	if (flags & MSG_SENDPAGE_NOTLAST)
>>   		flags |= MSG_MORE;
>>   
>> @@ -580,12 +603,18 @@ int tls_device_sendpage(struct sock *sk, struct page *page,
>>   		goto out;
>>   	}
>>   
>> +	if (ctx->zerocopy_sendfile) {
>> +		rc = tls_push_data(sk, (union tls_iter_offset)offset, size,
>> +				   flags, TLS_RECORD_TYPE_DATA, page);
>> +		goto out;
>> +	}
>> +
>>   	kaddr = kmap(page);
>>   	iov.iov_base = kaddr + offset;
>>   	iov.iov_len = size;
>>   	iov_iter_kvec(&msg_iter, WRITE, &iov, 1, size);
>> -	rc = tls_push_data(sk, &msg_iter, size,
>> -			   flags, TLS_RECORD_TYPE_DATA);
>> +	rc = tls_push_data(sk, (union tls_iter_offset)&msg_iter, size,
>> +			   flags, TLS_RECORD_TYPE_DATA, NULL);
>>   	kunmap(page);
>>   
>>   out:
>> @@ -659,7 +688,8 @@ static int tls_device_push_pending_record(struct sock *sk, int flags)
>>   	struct iov_iter	msg_iter;
>>   
>>   	iov_iter_kvec(&msg_iter, WRITE, NULL, 0, 0);
>> -	return tls_push_data(sk, &msg_iter, 0, flags, TLS_RECORD_TYPE_DATA);
>> +	return tls_push_data(sk, (union tls_iter_offset)&msg_iter, 0, flags,
>> +			     TLS_RECORD_TYPE_DATA, NULL);
>>   }
>>   
>>   void tls_device_write_space(struct sock *sk, struct tls_context *ctx)
>> diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
>> index 7b2b0e7ffee4..8ef86e04f571 100644
>> --- a/net/tls/tls_main.c
>> +++ b/net/tls/tls_main.c
>> @@ -513,6 +513,31 @@ static int do_tls_getsockopt_conf(struct sock *sk, char __user *optval,
>>   	return rc;
>>   }
>>   
>> +static int do_tls_getsockopt_tx_zc(struct sock *sk, char __user *optval,
>> +				   int __user *optlen)
>> +{
>> +	struct tls_context *tls_ctx = tls_get_ctx(sk);
>> +	struct tls_offload_context_tx *ctx;
>> +	int len, value;
>> +
>> +	if (get_user(len, optlen))
>> +		return -EFAULT;
>> +
>> +	if (len != sizeof(value))
> 
> the size check should match the one in setsockopt()
> 
>> +		return -EINVAL;
>> +
>> +	if (!tls_ctx || tls_ctx->tx_conf != TLS_HW)
>> +		return -EBUSY;
> 
> see setsockopt()
> 
>> +	ctx = tls_offload_ctx_tx(tls_ctx);
>> +
>> +	value = ctx->zerocopy_sendfile;
>> +	if (copy_to_user(optval, &value, sizeof(value)))
>> +		return -EFAULT;
>> +
>> +	return 0;
>> +}
> 
>> +static int do_tls_setsockopt_tx_zc(struct sock *sk, sockptr_t optval,
>> +				   unsigned int optlen)
>> +{
>> +	struct tls_context *tls_ctx = tls_get_ctx(sk);
> 
> Highly annoying but each file has its own scheme of naming variables,
> tls_main uses ctx for tls_context.
> 
>> +	struct tls_offload_context_tx *ctx;
>> +	int val;
> 
> unsigned
> 
>> +	if (!tls_ctx || tls_ctx->tx_conf != TLS_HW)
> 
> How's tls_ctx ever gonna be NULL here?

Shouldn't be possible, I'll drop the check.

> We should allow setting the option for non-HW. It's an opt-in, the app
> has opted in, the fact that the kernel will not make use of the liberty
> to apply the optimization is not important, no?

Yes, I agree that if the application opted in, it should work properly 
regardless of whether the optimization actually did turn on. However, 
the indication could be useful, for example, for diagnostic purposes, to 
show the user whether zerocopy mode was enabled, if someone is trying to 
debug some performance issue. If you insist, though, I can make 
setsockopt succeed and getsockopt return 1. What do you think?

>> +		return -EINVAL;
>> +
>> +	if (sockptr_is_null(optval) || optlen < sizeof(val))
>> +		return -EINVAL;
>> +
>> +	if (copy_from_sockptr(&val, optval, sizeof(val)))
>> +		return -EFAULT;
>> +
>> +	ctx = tls_offload_ctx_tx(tls_ctx);
>> +	ctx->zerocopy_sendfile = val;
> 
> if (val > 1)
> 	EINVAL.
> 
>> +	return 0;
>> +}

Thanks for the comments!