[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <40509bb4-3ca4-3bb1-3c28-1b0e90aa92be@grimberg.me>
Date: Mon, 3 Jul 2023 17:10:02 +0300
From: Sagi Grimberg <sagi@...mberg.me>
To: Hannes Reinecke <hare@...e.de>, David Howells <dhowells@...hat.com>
Cc: Keith Busch <kbusch@...nel.org>, Christoph Hellwig <hch@....de>,
linux-nvme@...ts.infradead.org, Jakub Kicinski <kuba@...nel.org>,
Eric Dumazet <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>,
netdev@...r.kernel.org
Subject: Re: [PATCHv6 0/5] net/tls: fixes for NVMe-over-TLS
On 7/3/23 16:57, Hannes Reinecke wrote:
> On 7/3/23 15:42, Sagi Grimberg wrote:
>>
>>>>> Hannes Reinecke <hare@...e.de> wrote:
>>>>>
>>>>>>> 'discover' and 'connect' works, but when I'm trying to transfer data
>>>>>>> (eg by doing a 'mkfs.xfs') the whole thing crashes horribly in
>>>>>>> sock_sendmsg() as it's trying to access invalid pages :-(
>>>>>
>>>>> Can you be more specific about the crash?
>>>>
>>>> Hannes,
>>>>
>>>> See:
>>>> [PATCH net] nvme-tcp: Fix comma-related oops
>>>
>>> Ah, right. That solves _that_ issue.
>>>
>>> But now I'm deadlocking on the tls_rx_reader_lock() (patched as to
>>> your suggestion). Investigating.
>>
>> Are you sure it is a deadlock? or maybe you returned EAGAIN and nvme-tcp
>> does not interpret this as a transient status and simply returns from
>> io_work?
>>
> Unfortunately, yes.
>
> static int tls_rx_reader_acquire(struct sock *sk, struct
> tls_sw_context_rx *ctx,
> bool nonblock)
> {
> long timeo;
>
> timeo = sock_rcvtimeo(sk, nonblock);
>
> while (unlikely(ctx->reader_present)) {
> DEFINE_WAIT_FUNC(wait, woken_wake_function);
>
> ctx->reader_contended = 1;
>
> add_wait_queue(&ctx->wq, &wait);
> sk_wait_event(sk, &timeo,
> !READ_ONCE(ctx->reader_present), &wait);
>
> and sk_wait_event() does:
> #define sk_wait_event(__sk, __timeo, __condition, __wait) \
> ({ int __rc; \
> __sk->sk_wait_pending++; \
> release_sock(__sk); \
> __rc = __condition; \
> if (!__rc) { \
> *(__timeo) = wait_woken(__wait, \
> TASK_INTERRUPTIBLE, \
> *(__timeo)); \
> } \
> sched_annotate_sleep(); \
> lock_sock(__sk); \
> __sk->sk_wait_pending--; \
> __rc = __condition; \
> __rc; \
> })
>
> so not calling 'lock_sock()' in tls_tx_reader_acquire() helps only _so_
> much, we're still deadlocking.
That still is legal assuming that sock lock is taken prior to
sk_wait_event...
What are the blocked threads from sysrq-trigger?
Powered by blists - more mailing lists