[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAPcyv4gn_AvT6BA7g4jLKRFODSpt7_ORowVd3KgyWxyaFG0k9g@mail.gmail.com>
Date: Mon, 8 Mar 2021 10:01:52 -0800
From: Dan Williams <dan.j.williams@...el.com>
To: "ruansy.fnst@...itsu.com" <ruansy.fnst@...itsu.com>
Cc: Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
linux-xfs <linux-xfs@...r.kernel.org>,
linux-nvdimm <linux-nvdimm@...ts.01.org>,
Linux MM <linux-mm@...ck.org>,
linux-fsdevel <linux-fsdevel@...r.kernel.org>,
device-mapper development <dm-devel@...hat.com>,
"Darrick J. Wong" <darrick.wong@...cle.com>,
david <david@...morbit.com>, Christoph Hellwig <hch@....de>,
Alasdair Kergon <agk@...hat.com>,
Mike Snitzer <snitzer@...hat.com>,
Goldwyn Rodrigues <rgoldwyn@...e.de>,
"qi.fuli@...itsu.com" <qi.fuli@...itsu.com>,
"y-goto@...itsu.com" <y-goto@...itsu.com>
Subject: Re: [PATCH v3 01/11] pagemap: Introduce ->memory_failure()
On Mon, Mar 8, 2021 at 3:34 AM ruansy.fnst@...itsu.com
<ruansy.fnst@...itsu.com> wrote:
> > > > > 1 file changed, 8 insertions(+)
> > > > >
> > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > > > > index 79c49e7f5c30..0bcf2b1e20bd 100644
> > > > > --- a/include/linux/memremap.h
> > > > > +++ b/include/linux/memremap.h
> > > > > @@ -87,6 +87,14 @@ struct dev_pagemap_ops {
> > > > > * the page back to a CPU accessible page.
> > > > > */
> > > > > vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
> > > > > +
> > > > > + /*
> > > > > + * Handle the memory failure happens on one page. Notify the processes
> > > > > + * who are using this page, and try to recover the data on this page
> > > > > + * if necessary.
> > > > > + */
> > > > > + int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
> > > > > + int flags);
> > > > > };
> > > >
> > > > After the conversation with Dave I don't see the point of this. If
> > > > there is a memory_failure() on a page, why not just call
> > > > memory_failure()? That already knows how to find the inode and the
> > > > filesystem can be notified from there.
> > >
> > > We want memory_failure() supports reflinked files. In this case, we are not
> > > able to track multiple files from a page(this broken page) because
> > > page->mapping,page->index can only track one file. Thus, I introduce this
> > > ->memory_failure() implemented in pmem driver, to call ->corrupted_range()
> > > upper level to upper level, and finally find out files who are
> > > using(mmapping) this page.
> > >
> >
> > I know the motivation, but this implementation seems backwards. It's
> > already the case that memory_failure() looks up the address_space
> > associated with a mapping. From there I would expect a new 'struct
> > address_space_operations' op to let the fs handle the case when there
> > are multiple address_spaces associated with a given file.
> >
>
> Let me think about it. In this way, we
> 1. associate file mapping with dax page in dax page fault;
I think this needs to be a new type of association that proxies the
representation of the reflink across all involved address_spaces.
> 2. iterate files reflinked to notify `kill processes signal` by the
> new address_space_operation;
> 3. re-associate to another reflinked file mapping when unmmaping
> (rmap qeury in filesystem to get the another file).
Perhaps the proxy object is reference counted per-ref-link. It seems
error prone to keep changing the association of the pfn while the
reflink is in-tact.
> It did not handle those dax pages are not in use, because their ->mapping are
> not associated to any file. I didn't think it through until reading your
> conversation. Here is my understanding: this case should be handled by
> badblock mechanism in pmem driver. This badblock mechanism will call
> ->corrupted_range() to tell filesystem to repaire the data if possible.
There are 2 types of notifications. There are badblocks discovered by
the driver (see notify_pmem()) and there are memory_failures()
signalled by the CPU machine-check handler, or the platform BIOS. In
the case of badblocks that needs to be information considered by the
fs block allocator to avoid / try-to-repair badblocks on allocate, and
to allow listing damaged files that need repair. The memory_failure()
notification needs immediate handling to tear down mappings to that
pfn and signal processes that have consumed it with
SIGBUS-action-required. Processes that have the poison mapped, but
have not consumed it receive SIGBUS-action-optional.
> So, we split it into two parts. And dax device and block device won't be mixed
> up again. Is my understanding right?
Right, it's only the filesystem that knows that the block_device and
the dax_device alias data at the same logical offset. The requirements
for sector error handling and page error handling are separate like
block_device_operations and dax_operations.
> But the solution above is to solve the hwpoison on one or couple pages, which
> happens rarely(I think). Do the 'pmem remove' operation cause hwpoison too?
> Call memory_failure() so many times? I havn't understood this yet.
I'm working on a patch here to call memory_failure() on a wide range
for the surprise remove of a dax_device while a filesystem might be
mounted. It won't be efficient, but there is no other way to notify
the kernel that it needs to immediately stop referencing a page.
Powered by blists - more mailing lists