linux-kernel - Re: [RFC PATCH] nvme-tcp: fix possible crash when only partilal queues are allocated by nvme_tcp_alloc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <422141ca-53aa-16fd-e29d-9cf7adf4f07d@grimberg.me>
Date:   Wed, 15 Mar 2023 12:40:43 +0200
From:   Sagi Grimberg <sagi@...mberg.me>
To:     zhangyanjun@...tc.cn
Cc:     kbusch@...nel.org, axboe@...com, hch@....de,
        linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org,
        xiang.li@...tc.cn, shaopeijie@...tc.cn
Subject: Re: [RFC PATCH] nvme-tcp: fix possible crash when only partilal
 queues are allocated by nvme_tcp_alloc_queue



On 3/15/23 11:41, zhangyanjun@...tc.cn wrote:
> From: Yanjun Zhang <zhangyanjun@...tc.cn>
> 
> Thanks for your applying and suggestions!
> 
>> Hi Yanjun,
>>
>>> In storage clusters constructed by nvme-tcp driver, we have encountered
>>> the following two kinds of crashes on the host kernel repeatedly.
>>>
>>> one crash info:
>>> [16802.107284] nvme nvme4: starting error recovery
>>> [16802.109166] nvme nvme4: Reconnecting in 10 seconds...
>>> [16812.173535] nvme nvme4: failed to connect socket: -111
>>> [16812.173745] nvme nvme4: Failed reconnect attempt 1
>>> [16812.173747] nvme nvme4: Reconnecting in 10 seconds...
>>> [16822.413555] nvme nvme4: failed to connect socket: -111
>>> [16822.413762] nvme nvme4: Failed reconnect attempt 2
>>> [16822.413765] nvme nvme4: Reconnecting in 10 seconds...
>>> [16832.661274] nvme nvme4: creating 32 I/O queues.
>>> [16833.919887] BUG: kernel NULL pointer dereference, address: 0000000000000088
>>> [16833.920068] nvme nvme4: Failed reconnect attempt 3
>>> [16833.920094] #PF: supervisor write access in kernel mode
>>> [16833.920261] nvme nvme4: Reconnecting in 10 seconds...
>>> [16833.920368] #PF: error_code(0x0002) - not-present page
>>> [16833.920600] PGD a21a12067 P4D b998cf067 PUD c00be5067 PMD 0
>>> [16833.920731] Oops: 0002 [#1] SMP NOPTI
>>> [16833.920852] CPU: 28 PID: 1324 Comm: kworker/28:1H Kdump: loaded Tainted: G S 5.15.67.cl9.x86_64 #1
>>> [16833.920980] Hardware name: Inspur NF5280M6/NF5280M6, BIOS 06.00.04 07/07/2022
>>> [16833.921086] Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
>>> [16833.921191] RIP: 0010:_raw_spin_lock_bh+0x17/0x30
>>> [16833.921294] Code: ff 66 90 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 65 81 05 f0 ab 35 4a 00 02 00 00 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 05 c3 cc cc cc cc 89 c6 e8 c7 09 69 ff 66 90 c3 cc
>>> [16833.921539] RSP: 0018:ff5d9770a078fdf8 EFLAGS: 00010246
>>> [16833.921658] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
>>> [16833.921787] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000088
>>> [16833.921912] RBP: 0000000000000088 R08: ff4f755fca433f00 R09: ff4f7558374a7874
>>> [16833.922040] R10: 000000000000000c R11: 000000000000000c R12: ff4f7563dbbdad80
>>> [16833.922177] R13: 0000000100fc48f3 R14: ff4f754b6dbfa278 R15: ff8f97707db0b905
>>> [16833.922317] FS: 0000000000000000(0000) GS:ff4f756fccd00000(0000) knlGS:0000000000000000
>>> [16833.922454] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [16833.922593] CR2: 0000000000000088 CR3: 0000000b8477e002 CR4: 0000000000771ee0
>>> [16833.922731] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [16833.922869] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> [16833.923007] PKRU: 55555554
>>> [16833.923138] Call Trace:
>>> [16833.923271] <TASK>
>>> [16833.923402] lock_sock_nested+0x1e/0x50
>>> [16833.923545] nvme_tcp_try_recv+0x40/0xa0 [nvme_tcp]
>>> [16833.923685] nvme_tcp_io_work+0x68/0xa0 [nvme_tcp]
>>> [16833.923824] process_one_work+0x1e8/0x390
>>> [16833.923969] worker_thread+0x53/0x3d0
>>> [16833.924104] ? process_one_work+0x390/0x390
>>> [16833.924240] kthread+0x124/0x150
>>> [16833.924376] ? set_kthread_struct+0x50/0x50
>>> [16833.924518] ret_from_fork+0x1f/0x30
>>> [16833.924655] </TASK>
>>>
>>> By analyzing the vmcore, we confirm the direct cause is sock->sk=NULL
>>> and sock->ops=NULL attached to the struct nvme_tcp_queue. And we found
>>> this issue occurred during the execuation of __nvme_tcp_alloc_io_queues
>>> every time. Such as when nvme_tcp_alloc_queue failed to allocate the
>>> 29th I/O queue, the return flow would free previously allocated 28 I/O
>>> queues, but possibly some io_works attached to these allocated I/O queues
>>> have already been queued.
>>
>> That should not be possible, io_work is only triggered after a queue is
>> started. If io_work is running before the queue is started, we have
>> a wider issue. Did you confirm that the io_work belongs to an io queue
>> and not the admin queue?
> 
> Yes, we think that the io_work belongs to an io_queue by comparing the
> nvme_tcp_queue pointer and queues[] attached to nvme_tcp_ctrl, its qid is
> not zero. The size of struct nvme_tcp_queue in our kernel is 0x130.
> crash> nvme_tcp_queue.sock ff4f754b6dbfa270
>    sock = 0xff4f7563dbbdad80,
> crash> socket.sk 0xff4f7563dbbdad80
>    sk = 0x0,
> crash> socket.ops 0xff4f7563dbbdad80
>    ops = 0x0,
> crash> nvme_tcp_queue.ctrl ff4f754b6dbfa270
>    ctrl = 0xff4f75608c9a6000,
> crash> nvme_tcp_ctrl.queues 0xff4f75608c9a6000
>    queues = 0xff4f754b6dbf8000,
> 
> Obviously, the ff4f754b6dbfa270 which analysized from stack is not
> the admin queue pointer.

ff4f754b6dbfa270 is the sock pointer, not the queue pointer, but I
believe you that it is different from queues.

> And we found that sock attached to the
> 29th I/O queue is NULL, socks attached to other queues are all
> normal kernel addresses. We think it should be dealt with err_sock
> in nvme_tcp_alloc_queue due to some errors.
> 	err_sock:
> 		sock_release(queue->sock);
> 		queue->sock = NULL;
> crash> nvme_tcp_queue.sock ff4f754b6dbfa3a0
>    sock = 0x0,
> crash> nvme_tcp_queue.sock ff4f754b6dbfa4d0
>    sock = 0xff4f75641f4aa700,
>

Makes sense. thanks for confirming.


>>
>>> So we query that directly freeing these allocated
>>> I/O queues may result in this issue. To avoid this issuse, we try to make
>>> the following code modifications, but have not verified to be valid yet
>>> because of the occasionality of this issue.
>>>
>>> Now, we can not confirm the caller of queuing on the io_work during
>>> allocating all the I/O queues, but we doubt that the socket interface
>>> sk_data_ready=nvme_tcp_data_ready initialized by nvme_tcp_alloc_queue
>>> may do this action in this interval. Hope to get some suggestions, Thanks!
>>
>> I would start by determining that the running io_work is an io queue or
>> the admin queue. It is also possible that the admin queue cleanup after
>> we io queues allocation failed is the problematic teardown that is
>> causing this bug.
>>
>>>
>>> Signed-off-by: Yanjun Zhang <zhangyanjun@...tc.cn>
>>> ---
>>> drivers/nvme/host/tcp.c | 7 +++++--
>>> 1 file changed, 5 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index 9b47dcb2a..5782183b8 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -1761,6 +1761,7 @@ static int nvme_tcp_alloc_admin_queue(struct nvme_ctrl *ctrl)
>>> static int __nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
>>> {
>>> 	int i, ret;
>>> + 	struct nvme_tcp_ctrl *tcp_ctrl = NULL;
>>>
>>> 	for (i = 1; i < ctrl->queue_count; i++) {
>>> 		ret = nvme_tcp_alloc_queue(ctrl, i);
>>> @@ -1771,9 +1772,11 @@ static int __nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
>>> 	return 0;
>>>
>>> 	out_free_queues:
>>> -	for (i--; i >= 1; i--)
>>> +	for (i--; i >= 1; i--) {
>>> +		tcp_ctrl = to_tcp_ctrl(ctrl);
>>> +		__nvme_tcp_stop_queue(&tcp_ctrl->queues[i]);
>>> 		nvme_tcp_free_queue(ctrl, i);
>>> -
>>> +	}
>>> 	return ret;
>>> }
>>>
>>
>> The crash dump reminds me of a reported issue fixed by:
>> 160f3549a907 ("nvme-tcp: fix UAF when detecting digest errors")
>>
>> Is this applied in your host kernel?
> 
> I checked the above issue, and the kernel we used have applied this fix.

Good to know.

> And I have add the following trace to our reproduction attempts.
> 
>>
>> I'd also add this trace to your reproduction attempts:
>> --
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 0ec7d3329595..0cbc8afb41c8 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -919,6 +919,10 @@ static void nvme_tcp_data_ready(struct sock *sk)
>>
>> 	read_lock_bh(&sk->sk_callback_lock);
>> 	queue = sk->sk_user_data;
>> + 	if (!test_bit(NVME_TCP_Q_ALLOCATED, &queue->flags)) {
>> + 		pr_err("got data on a deallocated queue %d\n",
>> 			nvme_tcp_queue_id(queue));
>> +		WARN_ON_ONCE(1);
>> + 	}
>> 	if (likely(queue && queue->rd_enabled) &&
>> 		!test_bit(NVME_TCP_Q_POLLING, &queue->flags))
>> 		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>> --
>>
> 
>> If indeed this is the io_queue and for some reason it is running,
>> perhaps the correct fix would be to set the socket ops later in
>> start_queue. It will also be symmetric because we clear them in
>> stop_queue:
>> --
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 0ec7d3329595..b3b1e39966c0 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -1617,22 +1617,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl
>> *nctrl, int qid)
>> 	if (ret)
>> 		goto err_init_connect;
>>
>> - 	queue->rd_enabled = true;
>> 	set_bit(NVME_TCP_Q_ALLOCATED, &queue->flags);
>> -	nvme_tcp_init_recv_ctx(queue);
>> -
>> -	write_lock_bh(&queue->sock->sk->sk_callback_lock);
>> -	queue->sock->sk->sk_user_data = queue;
>> -	queue->state_change = queue->sock->sk->sk_state_change;
>> -	queue->data_ready = queue->sock->sk->sk_data_ready;
>> -	queue->write_space = queue->sock->sk->sk_write_space;
>> -	queue->sock->sk->sk_data_ready = nvme_tcp_data_ready;
>> -	queue->sock->sk->sk_state_change = nvme_tcp_state_change;
>> -	queue->sock->sk->sk_write_space = nvme_tcp_write_space;
>> - #ifdef CONFIG_NET_RX_BUSY_POLL
>> -	queue->sock->sk->sk_ll_usec = 1;
>> - #endif
>> -	write_unlock_bh(&queue->sock->sk->sk_callback_lock);
>>
>> 	return 0;
>>
>> @@ -1690,6 +1675,21 @@ static int nvme_tcp_start_queue(struct nvme_ctrl
>> *nctrl, int idx)
>> 	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
>> 	int ret;
>>
>> +	queue->rd_enabled = true;
>> +	nvme_tcp_init_recv_ctx(queue);
>> +	write_lock_bh(&queue->sock->sk->sk_callback_lock);
>> +	queue->sock->sk->sk_user_data = queue;
>> +	queue->state_change = queue->sock->sk->sk_state_change;
>> + 	queue->data_ready = queue->sock->sk->sk_data_ready;
>> + 	queue->write_space = queue->sock->sk->sk_write_space;
>> +	queue->sock->sk->sk_data_ready = nvme_tcp_data_ready;
>> +	queue->sock->sk->sk_state_change = nvme_tcp_state_change;
>> +	queue->sock->sk->sk_write_space = nvme_tcp_write_space;
>> + #ifdef CONFIG_NET_RX_BUSY_POLL
>> +	queue->sock->sk->sk_ll_usec = 1;
>> + #endif
>> +	 write_unlock_bh(&queue->sock->sk->sk_callback_lock);
>> +
>> 	if (idx)
>> 		ret = nvmf_connect_io_queue(nctrl, idx);
>> 	else
>> --
> 
> Yes, your given fix seting the socket ops later in start_queue can correct
> this issue properly. Then I have an another doubt that it maybe have hidden
> trouble if the label out_free_queue in __nvme_tcp_alloc_io_queues does not
> call kernel_sock_shutdown before freeing io_queue. So do you think the
> following patch are needed?
> Hope to get your reply again!
> 
> --
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index b55b5a298..824c0ddb3 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1761,6 +1761,8 @@ static int nvme_tcp_alloc_admin_queue(struct nvme_ctrl *ctrl)
>   static int __nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
>   {
>   	int i, ret;
> +	struct nvme_tcp_ctrl *tcp_ctrl = to_tcp_ctrl(ctrl);
> +	struct nvme_tcp_queue *queue;
> 
>   	for (i = 1; i < ctrl->queue_count; i++) {
>   		ret = nvme_tcp_alloc_queue(ctrl, i);
> @@ -1771,8 +1773,11 @@ static int __nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
>   	return 0;
> 
>   out_free_queues:
> -	for (i--; i >= 1; i--)
> +	for (i--; i >= 1; i--) {
> +		queue = &tcp_ctrl->queues[i];
> +		kernel_sock_shutdown(queue->sock, SHUT_RDWR);
> 		nvme_tcp_free_queue(ctrl, i);
> +	}
> 
> 	return ret;
>   }
> --

I don't see why this is needed. The shutdown prevents us from being able 
to read/write to/from the socket because it is going away. And at this 
point, if we have not triggered io_work (and if we didn't assign 
data_ready yet we can't), there should be no way some other context is
consuming the socket. This is the assumption of the current code.