linux-kernel - Re: [PATCH 2/3] cxl/mbox: Add GET_POISON

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20220620123002.000041be@Huawei.com>
Date:   Mon, 20 Jun 2022 12:30:02 +0100
From:   Jonathan Cameron <Jonathan.Cameron@...wei.com>
To:     Dan Williams <dan.j.williams@...el.com>
CC:     <alison.schofield@...el.com>, Ira Weiny <ira.weiny@...el.com>,
        "Vishal Verma" <vishal.l.verma@...el.com>,
        Ben Widawsky <bwidawsk@...nel.org>,
        "Steven Rostedt" <rostedt@...dmis.org>,
        Ingo Molnar <mingo@...hat.com>, <linux-cxl@...r.kernel.org>,
        <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 2/3] cxl/mbox: Add GET_POISON_LIST mailbox command
 support

On Fri, 17 Jun 2022 12:27:38 -0700
Dan Williams <dan.j.williams@...el.com> wrote:

> Jonathan Cameron wrote:
> > On Tue, 14 Jun 2022 17:10:27 -0700
> > alison.schofield@...el.com wrote:
> >   
> > > From: Alison Schofield <alison.schofield@...el.com>
> > > 
> > > CXL devices that support persistent memory maintain a list of locations
> > > that are poisoned or result in poison if the addresses are accessed by
> > > the host.
> > > 
> > > Per the spec (CXL 2.0 8.2.8.5.4.1), the device returns this Poison
> > > list as a set of  Media Error Records that include the source of the
> > > error, the starting device physical address and length. The length is
> > > the number of adjacent DPAs in the record and is in units of 64 bytes.
> > > 
> > > Retrieve the list and log each Media Error Record as a trace event of
> > > type cxl_poison_list.
> > > 
> > > Signed-off-by: Alison Schofield <alison.schofield@...el.com>  
> > 
> > A few more things inline.
> > 
> > Otherwise, can confirm it works with some hack QEMU code.
> > I'll tidy that up and post soon.
> >   
> > > +int cxl_mem_get_poison_list(struct device *dev)
> > > +{
> > > +	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> > > +	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> > > +	struct cxl_mbox_poison_payload_out *po;
> > > +	struct cxl_mbox_poison_payload_in pi;
> > > +	int nr_records = 0;
> > > +	int rc, i;
> > > +
> > > +	if (range_len(&cxlds->pmem_range)) {
> > > +		pi.offset = cpu_to_le64(cxlds->pmem_range.start);
> > > +		pi.length = cpu_to_le64(range_len(&cxlds->pmem_range));
> > > +	} else {
> > > +		return -ENXIO;
> > > +	}
> > > +
> > > +	po = kvmalloc(cxlds->payload_size, GFP_KERNEL);
> > > +	if (!po)
> > > +		return -ENOMEM;
> > > +
> > > +	do {
> > > +		rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_GET_POISON, &pi,
> > > +				       sizeof(pi), po, cxlds->payload_size);
> > > +		if (rc)
> > > +			goto out;
> > > +
> > > +		if (po->flags & CXL_POISON_FLAG_OVERFLOW) {
> > > +			time64_t o_time = le64_to_cpu(po->overflow_timestamp);
> > > +
> > > +			dev_err(dev, "Poison list overflow at %ptTs UTC\n",
> > > +				&o_time);
> > > +			rc = -ENXIO;
> > > +			goto out;
> > > +		}
> > > +
> > > +		if (po->flags & CXL_POISON_FLAG_SCANNING) {
> > > +			dev_err(dev, "Scan Media in Progress\n");
> > > +			rc = -EBUSY;
> > > +			goto out;
> > > +		}
> > > +
> > > +		for (i = 0; i < le16_to_cpu(po->count); i++) {
> > > +			u64 addr = le64_to_cpu(po->record[i].address);  
> >   
> > > +			u32 len = le32_to_cpu(po->record[i].length);  
> > 
> >   
> > > +			int source = FIELD_GET(CXL_POISON_SOURCE_MASK, addr);
> > > +
> > > +			if (!CXL_POISON_SOURCE_VALID(source)) {
> > > +				dev_dbg(dev, "Invalid poison source %d",
> > > +					source);
> > > +				source = CXL_POISON_SOURCE_INVALID;
> > > +			}
> > > +
> > > +			trace_cxl_poison_list(dev, source, addr, len);  
> > 
> > Need to mask off the lower 6 bits of addr as they contain the source
> > + a few reserved bits.
> > 
> > I was confused how you were geting better than 64 byte precision in your
> > example.
> >   
> > > +		}
> > > +
> > > +		/* Protect against an uncleared _FLAG_MORE */
> > > +		nr_records = nr_records + le16_to_cpu(po->count);
> > > +		if (nr_records >= cxlds->poison_max)
> > > +			goto out;
> > > +
> > > +	} while (po->flags & CXL_POISON_FLAG_MORE);  
> > So.. A conundrum here.  What happens if:
> > 
> > 1. We get an error mid way through a set of multiple reads
> >    (something intermittent - maybe a software issue)
> > 2. We will drop out of here fine and report the error.
> > 3. We run this function again.
> > 
> > It will (I think) currently pick up where we left off, but we have
> > no way of knowing that as there isn't a 'total records' count or
> > any other form of index in the output payload.
> > 
> > So, software solutions I think should work (though may warrant a note
> > to be added to the spec).
> > 
> > 1. Read whole thing twice. First time is just to ensure we get
> >    to the end and flush out any prior half done reads.
> > 2. Issue a read for a different region (perhaps length 0) first
> >    and assume everything starts from scratch when we go back to
> >    this region.  
> 
> It would be nice if this was codified as *the* way to reset the
> retrieval, but I worry that neither length==0 or length==1 can be used
> for this purpose since the "more" bit is attached to the last passed in
> *range*, not request. I.e. spec seems to allow for overlapping
> retrievals, although I doubt any implementation gets that fancy.
> 
> I think it is sufficient to just include the "more" flag in the trace
> event and if userspace suspects that it is getting "more" results from a
> previous run it can reissue the scan. 

Meaning is a bit ugly if attached to an individual trace event, though
I guess we could do something nicer like only have one that doesn't have
MORE set, thus indicating that one trace event is the last one from a
query.  i.e. fill in MORE for all the other events in the last GET_POISON_LIST
reply. 

> This is another reason that the
> trace event should include the pid of the process that triggered the
> results so it can delineate re-requests. Otherwise, the poison list
> cache is opportunistic so I am not sure that missing records in this
> corner case is fatal.

Ok, for now let's document the limitation with an appropriate comment.
In parallel I've started a thread in appropriate venue to discuss if
we can clarify the spec and potentially do better in future.  So that
discussion should shift over there.

Thanks,

Jonathan