linux-kernel - Re: [PATCH 1/1] block: System crashes when cpu hotplug + bouncing port

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20210629115027.rtohoxtl7cmycdqr@beryllium.lan>
Date:   Tue, 29 Jun 2021 13:50:27 +0200
From:   Daniel Wagner <dwagner@...e.de>
To:     Ming Lei <ming.lei@...hat.com>
Cc:     Wen Xiong <wenxiong@...ibm.com>, james.smart@...adcom.com,
        linux-kernel@...r.kernel.org, sagi@...mberg.me,
        wenxiong@...ux.vnet.ibm.com
Subject: Re: [PATCH 1/1] block: System crashes when cpu hotplug + bouncing
 port

On Tue, Jun 29, 2021 at 06:06:21PM +0800, Ming Lei wrote:
> > No, I don't see any errors. I am still trying to reproduce it on real
> > hardware. The setup with blktests running in Qemu did work with all
> > patches applied (the once from me and your patches).
> > 
> > About the error argument: Later in the code path, e.g. in
> > __nvme_submit_sync_cmd() transport errors (incl. canceled request) are
> > handled as well, hence the upper layer will see errors during connection
> > attempts. My point is, there is nothing special about the connection
> > attempt failing. We have error handling code in place and the above
> > state machine has to deal with it.
> 
> My two patches not only avoids the kernel panic, but also allow
> request to be allocated successfully, then connect io queue request can
> be submitted to driver even though all CPUs in hctx->cpumask is offline,
> then nvmef can be setup well.
> 
> That is the difference with yours to fail the request allocation, then
> connect io queues can't be done, and the whole host can't be setup
> successfully, then become a brick. The point is that cpu offline shouldn't
> fail to setup nvme fc/rdma/tcp/loop.

Right, I think I see your point now.

> > Anyway, avoiding the if in the hotpath is a good thing. I just don't
> > think your argument about no error can happen is correct.
> 
> Again, it isn't related with avoiding the if, and it isn't in hotpath
> at all.

I mixed up blk_mq_alloc_request() with blk_mq_alloc_request_hctx().

Thanks for the explanation. I'll keep trying to replicated the problem
on real hardware and see if these patches mitigate it.

Thanks,
Daniel