netdev - Re: [PATCH net-next] tls: Add opt-in zerocopy mode of sendfile()

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <db461463-23ac-de03-806b-6ce2b7ea1d6b@nvidia.com>
Date:   Tue, 3 May 2022 21:56:48 +0300
From:   Maxim Mikityanskiy <maximmi@...dia.com>
To:     Jakub Kicinski <kuba@...nel.org>
Cc:     "David S. Miller" <davem@...emloft.net>,
        Daniel Borkmann <daniel@...earbox.net>,
        Paolo Abeni <pabeni@...hat.com>,
        Boris Pismenny <borisp@...dia.com>,
        Tariq Toukan <tariqt@...dia.com>,
        Saeed Mahameed <saeedm@...dia.com>,
        Gal Pressman <gal@...dia.com>, netdev@...r.kernel.org
Subject: Re: [PATCH net-next] tls: Add opt-in zerocopy mode of sendfile()

On 2022-04-29 22:11, Jakub Kicinski wrote:
> On Fri, 29 Apr 2022 17:21:59 +0300 Maxim Mikityanskiy wrote:
>> On 2022-04-29 01:11, Jakub Kicinski wrote:
>>>> diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
>>>> index b12f81a2b44c..715401b20c8b 100644
>>>> --- a/net/tls/tls_device.c
>>>> +++ b/net/tls/tls_device.c
>>>> @@ -411,10 +411,16 @@ static int tls_device_copy_data(void *addr, size_t bytes, struct iov_iter *i)
>>>>    	return 0;
>>>>    }
>>>>    
>>>> +union tls_iter_offset {
>>>> +	struct iov_iter *msg_iter;
>>>> +	int offset;
>>>> +};
>>>
>>> Is this sort of atrocity used elsewhere in the kernel?
>>> If you can't refactor the code you can pack args into
>>> a structure
>>
>> What's the point of packing arguments into a struct in this particular
>> case? Depending on zc_page, I need either msg_iter or offset, and I'm
>> reusing the same CPU register to pass either of them. The performance
>> isn't affected, and the amount of memory used is the same. A struct
>> won't allow to achieve this, it would force me to drag 8 extra bytes,
>> but we already use all 6 registers used to pass parameters on x86_64.
> 
> I know why you're doing this, but you're not writing assembly:
> 
> +	rc = tls_push_data(sk, (union tls_iter_offset)&msg_iter, size,
> 
> +	return tls_push_data(sk, (union tls_iter_offset)&msg_iter, 0, flags,
> 
> even if it's legal C (i.e. not UB) it looks awkward.

It's not UB, cast between a union and its element type is well-defined in C.

>>> but I've not seen people cast mutually exclusive
>>> arguments to a union type.
>>
>> It's the purpose of a union, to hold one of mutually exclusive values,
>> isn't it?
> 
> The union itself is not the problem.
> 
>>> Is this "inspired" by some higher
>>> level language?
>>
>> It's unfortunately inspired by C and its freedom to allow
>> microoptimizations/hacks. The hack here is that I use a pointer being
>> NULL or not-NULL as an indicator what type the other argument has.
>>
>> The closest alternative from high-level languages I can think of is
>> enums with attached data from rust or swift. However, rust isn't smart
>> enough to perform the optimization I described, so no, it's not inspired
>> by it :)
>>
>> Options that I see here:
>>
>> 1. Union.
>>
>> 2. Just pass both parameters and use one of them. Drawbacks: now we have
>> 7 parameters, one will be passed through the stack, and it's datapath code.
>>
>> 3. Pass `struct iov_iter *` and cast it to `int offset` when zc_page
>> isn't NULL. As we are compiling with -fno-strict-aliasing, and int
>> shouldn't be bigger than a pointer on all supported architectures, it's
>> going to work, and we still have 6 parameters.
>>
>> 4. Combine `int offset` and `int flags` into a single 64-bit parameter.
>>
>> Which one do you prefer, or do you have anything better in mind?
> 
> If you declare the union on the stack in the callers, and pass by value
> - is the compiler not going to be clever enough to still DDRT?

Ah, OK, it should do the thing. I thought you wanted me to ditch the 
union altogether.

>>>>    static int tls_push_data(struct sock *sk,
>>>> -			 struct iov_iter *msg_iter,
>>>> +			 union tls_iter_offset iter_offset,
>>>>    			 size_t size, int flags,
>>>> -			 unsigned char record_type)
>>>> +			 unsigned char record_type,
>>>> +			 struct page *zc_page)
>>>>    {
>>>>    	struct tls_context *tls_ctx = tls_get_ctx(sk);
>>>>    	struct tls_prot_info *prot = &tls_ctx->prot_info;
>>>> @@ -480,15 +486,29 @@ static int tls_push_data(struct sock *sk,
>>>>    		}
>>>>    
>>>>    		record = ctx->open_record;
>>>> -		copy = min_t(size_t, size, (pfrag->size - pfrag->offset));
>>>> -		copy = min_t(size_t, copy, (max_open_record_len - record->len));
>>>> -
>>>> -		if (copy) {
>>>> -			rc = tls_device_copy_data(page_address(pfrag->page) +
>>>> -						  pfrag->offset, copy, msg_iter);
>>>> -			if (rc)
>>>> -				goto handle_error;
>>>> -			tls_append_frag(record, pfrag, copy);
>>>> +
>>>> +		if (!zc_page) {
>>>> +			copy = min_t(size_t, size, pfrag->size - pfrag->offset);
>>>> +			copy = min_t(size_t, copy, max_open_record_len - record->len);
>>>
>>> Nope, refactor this please. 95% sure you don't need 4 indentation
>>> levels here. Space left in record can be calculated before the if,
>>> then you can do
>>>
>>> if (zc_page) {
>>> 	..
>>> } else if (copy) {
>>> 	..
>>> }
>>
>> It'll save one indentation level for the zc_page case, but not for the
>> other:
>>
>> copy = min_t(size_t, size, max_open_record_len - record->len);
>> if (zc_page && copy) {
>>       ...
>> } else {
>>       // I still have to do this in non-zc case:
>>       copy = min_t(size_t, copy, pfrag->size - pfrag->offset);
> 
> Can pfrag->size - pfrag->offset really be 0 provided
> tls_do_allocation() did not return an error?

No, it can't. Now it makes more sense to me. Thanks for the clarification.

>>       if (copy) {
>>           // Same indentation level as in my patch.
>>           ...
>>       }
>> }
>>
>> Is it good enough?
> 
>>> We should allow setting the option for non-HW. It's an opt-in, the app
>>> has opted in, the fact that the kernel will not make use of the liberty
>>> to apply the optimization is not important, no?
>>
>> Yes, I agree that if the application opted in, it should work properly
>> regardless of whether the optimization actually did turn on. However,
>> the indication could be useful, for example, for diagnostic purposes, to
>> show the user whether zerocopy mode was enabled, if someone is trying to
>> debug some performance issue. If you insist, though, I can make
>> setsockopt succeed and getsockopt return 1. What do you think?
> 
> I'd say "whether the optimization is applicable" rather than "whether
> the optimization is turned on". User can check whether the connection
> is using SW or HW TLS if they want to make sure it's taken advantage of.
> 
> Speaking of which, should we report the state of this knob via socket
> diag?

That sounds like an option, I'll take a look. TLS doesn't expose 
anything via diag yet, does it? The only option to distinguish SW/HW TLS 
is ethtool, and there is no per-socket check, right? Cause a HW TLS 
socket can downgrade to SW after tls_device_down, and ethtool won't show it.