netdev - Re: [PATCH v5 1/2] net/handshake: Create a NETLINK service for handling handshake requests

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aae5c8e6-738e-669f-3f0e-059282d55449@suse.de>
Date:   Tue, 28 Feb 2023 16:48:38 +0100
From:   Hannes Reinecke <hare@...e.de>
To:     Chuck Lever III <chuck.lever@...cle.com>
Cc:     Chuck Lever <cel@...nel.org>, "kuba@...nel.org" <kuba@...nel.org>,
        "pabeni@...hat.com" <pabeni@...hat.com>,
        "edumazet@...gle.com" <edumazet@...gle.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "kernel-tls-handshake@...ts.linux.dev" 
        <kernel-tls-handshake@...ts.linux.dev>
Subject: Re: [PATCH v5 1/2] net/handshake: Create a NETLINK service for
 handling handshake requests

On 2/28/23 15:28, Chuck Lever III wrote:
> 
> 
>> On Feb 28, 2023, at 1:58 AM, Hannes Reinecke <hare@...e.de> wrote:
>>
>> On 2/27/23 19:10, Chuck Lever III wrote:
>>>> On Feb 27, 2023, at 12:21 PM, Hannes Reinecke <hare@...e.de> wrote:
>>>>
>>>>> On 2/27/23 16:39, Chuck Lever III wrote:
>>>>>> On Feb 27, 2023, at 10:14 AM, Hannes Reinecke <hare@...e.de> wrote:
>>>>>>
>>>>>> Problem here is with using different key materials.
>>>>>> As the current handshake can only deal with one key at a time
>>>>>> the only chance we have for several possible keys is to retry
>>>>>> the handshake with the next key.
>>>>>> But out of necessity we have to use the _same_ connection
>>>>>> (as tlshd doesn't control the socket). So we cannot close
>>>>>> the socket, and hence we can't notify userspace to give up the handshake attempt.
>>>>>> Being able to send a signal would be simple; sending SIGHUP to userspace, and wait for the 'done' call.
>>>>>> If it doesn't come we can terminate all attempts.
>>>>>> But if we get the 'done' call we know it's safe to start with the next attempt.
>>>>> We solve this problem by enabling the kernel to provide all those
>>>>> materials to tlshd in one go.
>>>> Ah. Right, that would work, too; provide all possible keys to the
>>>> 'accept' call and let the userspace agent figure out what to do with
>>>> them. That makes life certainly easier for the kernel side.
>>>>
>>>>> I don't think there's a "retry" situation here. Once the handshake
>>>>> has failed, the client peer has to know to try again. That would
>>>>> mean retrying would have to be part of the upper layer protocol.
>>>>> Does an NVMe initiator know it has to drive another handshake if
>>>>> the first one fails, or does it rely on the handshake itself to
>>>>> try all available identities?
>>>>> We don't have a choice but to provide all the keys at once and
>>>>> let the handshake negotiation deal with it.
>>>>> I'm working on DONE passing multiple remote peer IDs back to the
>>>>> kernel now. I don't see why ACCEPT couldn't pass multiple peer IDs
>>>>> the other way.
>>>> Nope. That's not required.
>>>> DONE can only ever have one peer id (TLS 1.3 specifies that the client
>>>> sends a list of identities, the server picks one, and sends that one back
>>>> to the client). So for DONE we will only ever have 1 peer ID.
>>>> If we allow for several peer IDs to be present in the client ACCEPT message
>>>> then we'd need to include the resulting peer ID in the client DONE, too;
>>>> otherwise we'll need it for the server DONE only.
>>>>
>>>> So all in all I think we should be going with the multiple IDs in the
>>>> ACCEPT call (ie move the key id from being part of the message into an
>>>> attribute), and have a peer id present in the DONE all for both versions,
>>>> server and client.
>>> To summarize:
>>> ---
>>> The ACCEPT request (from tlshd) would have just the handler class
>>> "Which handler is responding". The kernel uses that to find a
>>> handshake request waiting for that type of handler. In our case,
>>> "tlshd".
>>> The ACCEPT response (from the kernel) would have the socket fd,
>>> the handshake parameters, and zero or more peer ID key serial
>>> numbers. (Today, just zero or one peer IDs).
>>>> There is also an errno status in the ACCEPT response, which
>>> the kernel can use to indicate things like "no requests in that
>>> class were found" or that the request was otherwise improperly
>>> formed.
>>> ---
>>> The DONE request (from tlshd) would have the socket fd (and
>>> implicitly, the handler's PID), the session status, and zero
>>> or one remote peer ID key serial numbers.
>>>> The DONE response (from the kernel) is an ACK. (Today it's
>>> more than that, but that's broken and will be removed).
>>> ---
>>> For the DONE request, the session status is one of:
>>> 0: session established -- see @peerid for authentication status
>>> EIO: local error
>>> EACCES: handshake rejected
>>> For server handshake completion:
>>> @peerid contains the remote peer ID if the session was
>>> authenticated, or TLS_NO_PEERID if the session was not
>>> authenticated.
>>> status == EACCES if authentication material was present from
>>> both peers but verification failed.
>>> For client handshake completion:
>>> @peerid contains the remote peer ID if authentication was
>>> requested and the session was authenticated
>>> status == EACCES if authentication was requested and the
>>> session was not authenticated, or if verification failed.
>>> (Maybe client could work like the server side, and the
>>> kernel consumer would need to figure out if it cares
>>> whether there was authentication).
>> Yes, that would be my preference. Always return @peerid
>> for DONE if the TLS session was established.
> 
> You mean if the TLS session was authenticated. The server
> won't receive a remote peer identity if the client peer
> doesn't authenticate.
> 
Ah, yes, forgot about that.
(PSK always 'authenticate' as the identity is that used to
find the appropriate PSK ...)

> 
>> We might also consider returning @peerid with EACCESS
>> to indicate the offending ID.
> 
> I'll look into that.
> 
> 
>>> Is that adequate?
>> Yes, it is.
> 
> What about the narrow set of DONE status values? You've
> recently wanted to add ENOMEM, ENOKEY, and EINVAL to
> this set. My experience is that these status values are
> nearly always obscured before they can get back to the
> requesting user.
> 
> Can the kernel make use of ENOMEM, for example? It might
> be able to retry, I suppose... retrying is not sensible
> for the server side.
> 
The usual problem: Retry or no retry.
Sadly error numbers are no good indicator to that.
Maybe we should take the NVMe approach and add a _different_
attribute indicating whether this particular error status
should be retried.

> 
>> So the only bone of contention is the timeout; as we won't
>> be implementing signals I still think that we should have
>> a 'timeout' attribute. And if only to feed the TLS timeout
>> parameter for gnutls ...
> 
> I'm still not seeing the case for making it an individual
> parameter for each handshake request. Maybe a config
> parameter, if a short timeout is actually needed... even
> then, maybe a built-in timeout is preferable to yet another
> tuning knob that can be abused.
> 
The problem I see is that the kernel-side needs to make forward
progress eventually, and calling into userspace is a good recipe
of violating that principle.
Sending a timeout value as a netlink parameter has the advantage
the both sides are aware that there _is_ a timeout.
The alternative would be an unconditional wait in the kernel,
and a very real possibility of a stuck process.

> I'd like to see some testing results to determine that a
> short timeout is the only way to handle corner cases.
> 
Short timeouts are especially useful for testing and debugging;
timeout handlers are prone to issues, and hence need a really good
bashing to hash out issues.
And not having a timeout is also not a good idea, see above.

But yeah, in theory we could use a configuration timeout in tlshd.

In the end, it's _just_ another netlink attribute, which might
(or might not) be present. Which replaces a built-in value.
I hadn't thought this to be such an issue ...

Cheers,

Hannes