netdev - Re: [PATCH net] net/tls: Fix driver request resync

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <3525a763-17dd-e126-8684-3bfcd7c26818@mellanox.com>
Date:   Sun, 31 May 2020 14:38:58 +0300
From:   Boris Pismenny <borisp@...lanox.com>
To:     Jakub Kicinski <kuba@...nel.org>
Cc:     Tariq Toukan <tariqt@...lanox.com>,
        "David S. Miller" <davem@...emloft.net>, netdev@...r.kernel.org,
        Saeed Mahameed <saeedm@...lanox.com>
Subject: Re: [PATCH net] net/tls: Fix driver request resync

>>
>> Yes, although I would phrase it differently: the kernel would indicate to the driver,
>> that the resync request is wrong, and that it can go back to searching for a header.
>> If there are any drivers that need an extra check, then we can add it in the driver itself.
> 
> I'd rather make the API clear and use a different op to indicate this is
> a reset rather than a valid sync response. sync callback already has
> the enum for sync type.
> 

Sure, so we will add another flag to `tls_offload_rx_resync_request` and respin this patch with driver changes which use it.

>>> Kernel usually can't guarantee that the notification will happen,
>>> (memory allocation errors, etc.) so the device needs to do the
>>> restarting itself. The notification should not be necessary.
>>>  
>>
>> Usually, all is best effort, but in principle, reliability should be guaranteed by higher layers to simplify the design.
> 
> Since we're talking high level design perspectives here - IMO when you
> talk to FW on a device - it's a distributed system. The days when you
> could say driver on the host is higher layer ended when people started
> putting fat firmwares on the NICs. So no, restart has to be handled by
> the system making the request. In this case the NIC.
> 

When the driver talks to the device, then the driver is responsible to ensure messages are received reliably - no doubt about that.

>> On the one hand, resync depends on packet arrival, which may take a while, and implementing different heuristics in each driver to timeout is complex.
>> On the other hand, assuming the user reads the record data eventually, ktls will be able to deliver the resync request, so implementing this in the tls layer is simple.
> 
> We definitely not want any driver logic here - the resync restart logic
> has to be implemented on the device.
> 

Restart can be triggered by the device if there is traffic, as it would identify bad record headers. But, otherwise, software should help restart the device.
>From an API perspective, I think it makes little sense for the device to send a request without any response from the TLS layer, as it is now; this patches fixes this.

>> In this case, I see no reason for the tls layer to fail --- did you have a specific flow in mind?
>> AFAICT, there are no memory allocation/error flows that will prevent the driver to receive a resync without an error on the socket (bad tls header).
>> If tls received the header, then the driver will receive the resync call, and it will take responsibility for reliably delivering it to HW.
> 
> So you're saying the request path and the response path for resync are
> both 100% lossless both on the NIC and the host? There is no scenario
> in which queue overflows, PCIe gets congested, etc.?
> 

Not at all.
As I've mentioned above "all is best effort", but we try our best to make things work when all is in proper order. In our case, resync is fully implemented in ASIC, so no FW failures. This leaves only hardware and host software failures. Hardware failures will likely cause the entire NIC to restart, closing all queues in the process. Software failures should be handled by each layer respectively. In the TLS layer, all is very simple and I see no reason for failure. Each driver should handle its own reliability trade-offs according to its own goals, which might vary from vendor to vendor and device to device.