linux-kernel - RE: [PATCH v2 2/2] nvme: handle persistent internal error AER from NVMe controller

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <PH0PR21MB3025781A702070304BB8A282D7A09@PH0PR21MB3025.namprd21.prod.outlook.com>
Date:   Sat, 4 Jun 2022 14:28:11 +0000
From:   "Michael Kelley (LINUX)" <mikelley@...rosoft.com>
To:     Keith Busch <kbusch@...nel.org>
CC:     "axboe@...com" <axboe@...com>, "hch@....de" <hch@....de>,
        "sagi@...mberg.me" <sagi@...mberg.me>,
        "linux-nvme@...ts.infradead.org" <linux-nvme@...ts.infradead.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Caroline Subramoney <Caroline.Subramoney@...rosoft.com>,
        Richard Wurdack <riwurd@...rosoft.com>,
        Nathan Obr <Nathan.Obr@...rosoft.com>
Subject: RE: [PATCH v2 2/2] nvme: handle persistent internal error AER from
 NVMe controller

From: Keith Busch <kbusch@...nel.org> Sent: Friday, June 3, 2022 12:23 PM
> 
> On Fri, Jun 03, 2022 at 10:56:01AM -0700, Michael Kelley wrote:
> 
> This series looks good to me. Just one concern below that may amount to
> nothing.
> 
> > +static void nvme_handle_aer_persistent_error(struct nvme_ctrl *ctrl)
> > +{
> > +	u32 csts;
> > +
> > +	trace_nvme_async_event(ctrl, NVME_AER_ERROR);
> > +
> > +	if (ctrl->ops->reg_read32(ctrl, NVME_REG_CSTS, &csts) != 0 ||
> 
> The reg_read32() is non-blocking for pcie, so this is safe to call from that
> driver's irq handler. The other transports block on register reads, though, so
> they can't call this from an atomic context. The TCP context looks safe, but
> I'm not sure about RDMA or FC.

Good point.  But even if the RDMA and FC contexts are safe, if a
persistent error is reported, the controller is already in trouble and
may not respond to a request to retrieve the CSTS anyway.  Perhaps
we should just trust the AER error report and not bother checking
CSTS to decide whether to do the reset.  We can still check ctrl->state
and skip the reset if there's already one in progress.

> 
> > +	    nvme_should_reset(ctrl, csts)) {
> > +		dev_warn(ctrl->device, "resetting controller due to AER\n");
> > +		nvme_reset_ctrl(ctrl);
> > +	}
> > +}
> > +
> >  void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
> >  		volatile union nvme_result *res)
> >  {
> >  	u32 result = le32_to_cpu(res->u32);
> >  	u32 aer_type = result & 0x07;
> > +	u32 aer_subtype = (result & 0xff00) >> 8;
> 
> Since the above mask + shift is duplicated with nvme_handle_aen_notice(), an
> inline helper function seems reasonable.

Yep.  Will do in v3.

Michael