[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <db461463-23ac-de03-806b-6ce2b7ea1d6b@nvidia.com>
Date: Tue, 3 May 2022 21:56:48 +0300
From: Maxim Mikityanskiy <maximmi@...dia.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: "David S. Miller" <davem@...emloft.net>,
Daniel Borkmann <daniel@...earbox.net>,
Paolo Abeni <pabeni@...hat.com>,
Boris Pismenny <borisp@...dia.com>,
Tariq Toukan <tariqt@...dia.com>,
Saeed Mahameed <saeedm@...dia.com>,
Gal Pressman <gal@...dia.com>, netdev@...r.kernel.org
Subject: Re: [PATCH net-next] tls: Add opt-in zerocopy mode of sendfile()
On 2022-04-29 22:11, Jakub Kicinski wrote:
> On Fri, 29 Apr 2022 17:21:59 +0300 Maxim Mikityanskiy wrote:
>> On 2022-04-29 01:11, Jakub Kicinski wrote:
>>>> diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
>>>> index b12f81a2b44c..715401b20c8b 100644
>>>> --- a/net/tls/tls_device.c
>>>> +++ b/net/tls/tls_device.c
>>>> @@ -411,10 +411,16 @@ static int tls_device_copy_data(void *addr, size_t bytes, struct iov_iter *i)
>>>> return 0;
>>>> }
>>>>
>>>> +union tls_iter_offset {
>>>> + struct iov_iter *msg_iter;
>>>> + int offset;
>>>> +};
>>>
>>> Is this sort of atrocity used elsewhere in the kernel?
>>> If you can't refactor the code you can pack args into
>>> a structure
>>
>> What's the point of packing arguments into a struct in this particular
>> case? Depending on zc_page, I need either msg_iter or offset, and I'm
>> reusing the same CPU register to pass either of them. The performance
>> isn't affected, and the amount of memory used is the same. A struct
>> won't allow to achieve this, it would force me to drag 8 extra bytes,
>> but we already use all 6 registers used to pass parameters on x86_64.
>
> I know why you're doing this, but you're not writing assembly:
>
> + rc = tls_push_data(sk, (union tls_iter_offset)&msg_iter, size,
>
> + return tls_push_data(sk, (union tls_iter_offset)&msg_iter, 0, flags,
>
> even if it's legal C (i.e. not UB) it looks awkward.
It's not UB, cast between a union and its element type is well-defined in C.
>>> but I've not seen people cast mutually exclusive
>>> arguments to a union type.
>>
>> It's the purpose of a union, to hold one of mutually exclusive values,
>> isn't it?
>
> The union itself is not the problem.
>
>>> Is this "inspired" by some higher
>>> level language?
>>
>> It's unfortunately inspired by C and its freedom to allow
>> microoptimizations/hacks. The hack here is that I use a pointer being
>> NULL or not-NULL as an indicator what type the other argument has.
>>
>> The closest alternative from high-level languages I can think of is
>> enums with attached data from rust or swift. However, rust isn't smart
>> enough to perform the optimization I described, so no, it's not inspired
>> by it :)
>>
>> Options that I see here:
>>
>> 1. Union.
>>
>> 2. Just pass both parameters and use one of them. Drawbacks: now we have
>> 7 parameters, one will be passed through the stack, and it's datapath code.
>>
>> 3. Pass `struct iov_iter *` and cast it to `int offset` when zc_page
>> isn't NULL. As we are compiling with -fno-strict-aliasing, and int
>> shouldn't be bigger than a pointer on all supported architectures, it's
>> going to work, and we still have 6 parameters.
>>
>> 4. Combine `int offset` and `int flags` into a single 64-bit parameter.
>>
>> Which one do you prefer, or do you have anything better in mind?
>
> If you declare the union on the stack in the callers, and pass by value
> - is the compiler not going to be clever enough to still DDRT?
Ah, OK, it should do the thing. I thought you wanted me to ditch the
union altogether.
>>>> static int tls_push_data(struct sock *sk,
>>>> - struct iov_iter *msg_iter,
>>>> + union tls_iter_offset iter_offset,
>>>> size_t size, int flags,
>>>> - unsigned char record_type)
>>>> + unsigned char record_type,
>>>> + struct page *zc_page)
>>>> {
>>>> struct tls_context *tls_ctx = tls_get_ctx(sk);
>>>> struct tls_prot_info *prot = &tls_ctx->prot_info;
>>>> @@ -480,15 +486,29 @@ static int tls_push_data(struct sock *sk,
>>>> }
>>>>
>>>> record = ctx->open_record;
>>>> - copy = min_t(size_t, size, (pfrag->size - pfrag->offset));
>>>> - copy = min_t(size_t, copy, (max_open_record_len - record->len));
>>>> -
>>>> - if (copy) {
>>>> - rc = tls_device_copy_data(page_address(pfrag->page) +
>>>> - pfrag->offset, copy, msg_iter);
>>>> - if (rc)
>>>> - goto handle_error;
>>>> - tls_append_frag(record, pfrag, copy);
>>>> +
>>>> + if (!zc_page) {
>>>> + copy = min_t(size_t, size, pfrag->size - pfrag->offset);
>>>> + copy = min_t(size_t, copy, max_open_record_len - record->len);
>>>
>>> Nope, refactor this please. 95% sure you don't need 4 indentation
>>> levels here. Space left in record can be calculated before the if,
>>> then you can do
>>>
>>> if (zc_page) {
>>> ..
>>> } else if (copy) {
>>> ..
>>> }
>>
>> It'll save one indentation level for the zc_page case, but not for the
>> other:
>>
>> copy = min_t(size_t, size, max_open_record_len - record->len);
>> if (zc_page && copy) {
>> ...
>> } else {
>> // I still have to do this in non-zc case:
>> copy = min_t(size_t, copy, pfrag->size - pfrag->offset);
>
> Can pfrag->size - pfrag->offset really be 0 provided
> tls_do_allocation() did not return an error?
No, it can't. Now it makes more sense to me. Thanks for the clarification.
>> if (copy) {
>> // Same indentation level as in my patch.
>> ...
>> }
>> }
>>
>> Is it good enough?
>
>>> We should allow setting the option for non-HW. It's an opt-in, the app
>>> has opted in, the fact that the kernel will not make use of the liberty
>>> to apply the optimization is not important, no?
>>
>> Yes, I agree that if the application opted in, it should work properly
>> regardless of whether the optimization actually did turn on. However,
>> the indication could be useful, for example, for diagnostic purposes, to
>> show the user whether zerocopy mode was enabled, if someone is trying to
>> debug some performance issue. If you insist, though, I can make
>> setsockopt succeed and getsockopt return 1. What do you think?
>
> I'd say "whether the optimization is applicable" rather than "whether
> the optimization is turned on". User can check whether the connection
> is using SW or HW TLS if they want to make sure it's taken advantage of.
>
> Speaking of which, should we report the state of this knob via socket
> diag?
That sounds like an option, I'll take a look. TLS doesn't expose
anything via diag yet, does it? The only option to distinguish SW/HW TLS
is ethtool, and there is no per-socket check, right? Cause a HW TLS
socket can downgrade to SW after tls_device_down, and ethtool won't show it.
Powered by blists - more mailing lists