[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <253wmxvuq2z.fsf@nvidia.com>
Date: Wed, 16 Aug 2023 15:30:44 +0300
From: Aurelien Aptel <aaptel@...dia.com>
To: Sagi Grimberg <sagi@...mberg.me>, linux-nvme@...ts.infradead.org,
netdev@...r.kernel.org, hch@....de, kbusch@...nel.org, axboe@...com,
chaitanyak@...dia.com, davem@...emloft.net, kuba@...nel.org
Cc: Boris Pismenny <borisp@...dia.com>, aurelien.aptel@...il.com,
smalin@...dia.com, malin1024@...il.com, ogerlitz@...dia.com,
yorayz@...dia.com, galshalom@...dia.com, mgurtovoy@...dia.com
Subject: Re: [PATCH v12 07/26] nvme-tcp: Add DDP offload control path
Sagi Grimberg <sagi@...mberg.me> writes:
>>>> + if (!netdev || !queue)
>>>> + return false;
>>>
>>> Is it reasonable to be called here with !netdev or !queue ?
>>
>> The check is needed only for the IO queue case but we can move it
>> earlier in nvme_tcp_start_queue().
>
> I still don't understand even on io queues how this can happen.
In case where the netdev does not support DDP, when the admin queue is
started, netdev->offloading_netdev will not be set and therefore will be
NULL.
Later, when the IO queue is started:
netdev = queue->ctrl->offloading_netdev; <== NULL
if (is_netdev_ulp_offload_active(netdev, queue)) { <== we pass NULL
We can move the NULL check higher-up if you prefer, like so:
if (queue->ctrl->offloading_netdev &&
is_netdev_ulp_offload_active(queue->ctrl->offloading_netdev, queue)) {
>>>> +
>>>> + /* If we cannot query the netdev limitations, do not offload */
>>>> + if (!nvme_tcp_ddp_query_limits(netdev, queue))
>>>> + return false;
>>>> +
>>>> + /* If netdev supports nvme-tcp ddp offload, we can offload */
>>>> + if (test_bit(ULP_DDP_C_NVME_TCP_BIT, netdev->ulp_ddp_caps.active))
>>>> + return true;
>>>
>>> This should be coming from the API itself, have the limits query
>>> api fail if this is off.
>>
>> We can move the function to the ULP DDP layer.
>>
>>> btw, what is the active thing? is this driven from ethtool enable?
>>> what happens if the user disables it while there is a ulp using it?
>>
>> The active bits are indeed driven by ethtool according to the design
>> Jakub suggested.
>> The nvme-tcp connection will have to be reconnected to see the effect of
>> changing the bit.
>
> It should move inside the api as well. Don't want to care about it in
> nvme.
Ok, we will move it there.
>>>> +
>>>> + return false;
>>>
>>> This can be folded to the above function.
>>
>> We won't be able to check for TLS in a common wrapper. We think this
>> should be kept.
>
> Why? any tcp ddp need to be able to support tls. Nothing specific to
> nvme here.
True, we will move it to the ULP wrapper.
>>>> + /*
>>>> + * The atomic operation guarantees that we don't miss any NIC driver
>>>> + * resync requests submitted after the above checks.
>>>> + */
>>>> + if (atomic64_cmpxchg(&queue->resync_req, pdu_val,
>>>> + pdu_val & ~ULP_DDP_RESYNC_PENDING) !=
>>>> + atomic64_read(&queue->resync_req))
>>>> + netdev->netdev_ops->ulp_ddp_ops->resync(netdev,
>>>> + queue->sock->sk,
>>>> + pdu_seq);
>>>
>>> Who else is doing an atomic on this value? and what happens
>>> if the cmpxchg fails?
>>
>> The driver thread can set queue->resync_req concurrently in patch
>> "net/mlx5e: NVMEoTCP, data-path for DDP+DDGST offload" in function
>> nvmeotcp_update_resync().
>>
>> If the cmpxchg fails it means a new resync request was triggered by the
>> HW, the old request will be dropped and the new one will be processed by
>> a later PDU.
>
> So resync_req is actually the current tcp sequence number or something?
> The name resync_req is very confusing.
queue->resync_req is the TCP sequence for which the HW requested a
resync operation. We can rename it with queue->resync_tcp_seq.
>>>> + ret = nvme_tcp_offload_socket(queue);
>>>> + if (ret) {
>>>> + dev_info(nctrl->device,
>>>> + "failed to setup offload on queue %d ret=%d\n",
>>>> + idx, ret);
>>>> + }
>>>> + }
>>>> + } else {
>>>> ret = nvmf_connect_admin_queue(nctrl);
>>>> + if (ret)
>>>> + goto err;
>>>>
>>>> - if (!ret) {
>>>> - set_bit(NVME_TCP_Q_LIVE, &queue->flags);
>>>> - } else {
>>>> - if (test_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
>>>> - __nvme_tcp_stop_queue(queue);
>>>> - dev_err(nctrl->device,
>>>> - "failed to connect queue: %d ret=%d\n", idx, ret);
>>>> + netdev = get_netdev_for_sock(queue->sock->sk);
>>>
>>> Is there any chance that this is a different netdev than what is
>>> already recorded? doesn't make sense to me.
>>
>> The idea is that we are first starting the admin queue, which looks up
>> the netdev associated with the socket and stored in the queue. Later,
>> when the IO queues are started, we use the recorded netdev.
>>
>> In cases of bonding or vlan, a netdev can have lower device links, which
>> get_netdev_for_sock() will look up.
>
> I think the code should in high level do:
> if (idx) {
> ret = nvmf_connect_io_queue(nctrl, idx);
> if (ret)
> goto err;
> if (nvme_tcp_ddp_query_limits(queue))
> nvme_tcp_offload_socket(queue);
>
> } else {
> ret = nvmf_connect_admin_queue(nctrl);
> if (ret)
> goto err;
> ctrl->ddp_netdev = get_netdev_for_sock(queue->sock->sk);
> if (nvme_tcp_ddp_query_limits(queue))
> nvme_tcp_offload_apply_limits(queue);
> }
Ok, we will follow this design.
> ctrl->ddp_netdev should be cleared and put when the admin queue
> is stopped/freed, similar to how async_req is handled.
Thanks, we will clear ddp_netdev on queue stop/free.
This will also prevent reusing a potentially wrong netdev after a reconnection.
>>>> + /*
>>>> + * release the device as no offload context is
>>>> + * established yet.
>>>> + */
>>>> + dev_put(netdev);
>>>
>>> the put is unclear, what does it pair with? the get_netdev_for_sock?
>>
>> Yes, get_netdev_for_sock() takes a reference, which we don't need at
>> that point so we put it.
>
> Well, you store a pointer to it, what happens if it goes away while
> the controller is being set up?
It's a problem. We will remove the dev_put() to keep the first
reference, and only release it when it is cleared from the admin queue.
Powered by blists - more mailing lists