linux-kernel - Re: [PATCH v3 01/11] pagemap: Introduce ->memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAPcyv4jhUU3NVD8HLZnJzir+SugB6LnnrgJZ-jP45BZrbJ1dJQ@mail.gmail.com>
Date:   Tue, 23 Mar 2021 19:19:28 -0700
From:   Dan Williams <dan.j.williams@...el.com>
To:     "ruansy.fnst@...itsu.com" <ruansy.fnst@...itsu.com>
Cc:     Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        linux-xfs <linux-xfs@...r.kernel.org>,
        linux-nvdimm <linux-nvdimm@...ts.01.org>,
        Linux MM <linux-mm@...ck.org>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>,
        device-mapper development <dm-devel@...hat.com>,
        "Darrick J. Wong" <darrick.wong@...cle.com>,
        david <david@...morbit.com>, Christoph Hellwig <hch@....de>,
        Alasdair Kergon <agk@...hat.com>,
        Mike Snitzer <snitzer@...hat.com>,
        Goldwyn Rodrigues <rgoldwyn@...e.de>,
        "qi.fuli@...itsu.com" <qi.fuli@...itsu.com>,
        "y-goto@...itsu.com" <y-goto@...itsu.com>
Subject: Re: [PATCH v3 01/11] pagemap: Introduce ->memory_failure()

On Thu, Mar 18, 2021 at 7:18 PM ruansy.fnst@...itsu.com
<ruansy.fnst@...itsu.com> wrote:
>
>
>
> > -----Original Message-----
> > From: ruansy.fnst@...itsu.com <ruansy.fnst@...itsu.com>
> > Subject: RE: [PATCH v3 01/11] pagemap: Introduce ->memory_failure()
> > > > > > >
> > > > > > > After the conversation with Dave I don't see the point of this.
> > > > > > > If there is a memory_failure() on a page, why not just call
> > > > > > > memory_failure()? That already knows how to find the inode and
> > > > > > > the filesystem can be notified from there.
> > > > > >
> > > > > > We want memory_failure() supports reflinked files.  In this
> > > > > > case, we are not able to track multiple files from a page(this
> > > > > > broken
> > > > > > page) because
> > > > > > page->mapping,page->index can only track one file.  Thus, I
> > > > > > page->introduce this
> > > > > > ->memory_failure() implemented in pmem driver, to call
> > > > > > ->->corrupted_range()
> > > > > > upper level to upper level, and finally find out files who are
> > > > > > using(mmapping) this page.
> > > > > >
> > > > >
> > > > > I know the motivation, but this implementation seems backwards.
> > > > > It's already the case that memory_failure() looks up the
> > > > > address_space associated with a mapping. From there I would expect
> > > > > a new 'struct address_space_operations' op to let the fs handle
> > > > > the case when there are multiple address_spaces associated with a given
> > file.
> > > > >
> > > >
> > > > Let me think about it.  In this way, we
> > > >     1. associate file mapping with dax page in dax page fault;
> > >
> > > I think this needs to be a new type of association that proxies the
> > > representation of the reflink across all involved address_spaces.
> > >
> > > >     2. iterate files reflinked to notify `kill processes signal` by the
> > > >           new address_space_operation;
> > > >     3. re-associate to another reflinked file mapping when unmmaping
> > > >         (rmap qeury in filesystem to get the another file).
> > >
> > > Perhaps the proxy object is reference counted per-ref-link. It seems
> > > error prone to keep changing the association of the pfn while the reflink is
> > in-tact.
> > Hi, Dan
> >
> > I think my early rfc patchset was implemented in this way:
> >  - Create a per-page 'dax-rmap tree' to store each reflinked file's (mapping,
> > offset) when causing dax page fault.
> >  - Mount this tree on page->zone_device_data which is not used in fsdax, so
> > that we can iterate reflinked file mappings in memory_failure() easily.
> > In my understanding, the dax-rmap tree is the proxy object you mentioned.  If
> > so, I have to say, this method was rejected. Because this will cause huge
> > overhead in some case that every dax page have one dax-rmap tree.
> >
>
> Hi, Dan
>
> How do you think about this?  I am still confused.  Could you give me some advice?

So I think the primary driver of this functionality is dax-devices and
the architectural model for memory failure where several architectures
and error handlers know how to route pfn failure to the
memory_failure() frontend.

Compare that to block-devices where sector failure has no similar
framework, and despite some initial interest about reusing 'struct
badblocks' for this type of scenario there has been no real uptake to
expand 'struct badblocks' outside of the pmem driver.

I think the work you have done for ->corrupted_range() just needs to
be repurposed away from a block-device operation to dax-device
infrastructure. Christoph's pushback on extending
block_device_operations makes sense to me because there is likely no
other user of this facility than the pmem driver, and the pmem driver
only needs it for the vestigial reason that filesystems mount on
block-devices and not dax-devices.

Recently Dave drove home the point that a filesystem can't do anything
with pfns, it needs LBAs. A dax-device does not have LBA's, but it
does operate on the concept of device-relative offsets. The filesystem
is allowed to assume that dax-device:PFN[device_byte_offset >>
PAGE_SHIFT] aliases the same data as the associated
block-device:LBA[device_byte_offset >> SECTOR_SHIFT]. He also
reiterated that this interface should be range based, which you
already had, but I did not include in my attempt to communicate the
mass failure of an entire surprise-removed device.

So I think the path forward is:

- teach memory_failure() to allow for ranged failures

- let interested drivers register for memory failure events via a
blocking_notifier_head

- teach memory_failure() to optionally let the notifier chain claim
the event vs its current default of walking page->mapping

- teach the pmem driver to register for memory_failure() events and
filter the ones that apply to pfns that the driver owns

- drop the nfit driver's usage of the mce notifier chain since
memory_failure() is a superset of what the mce notifier communicates

- augment the pmem driver's view of badblocks that it gets from
address range scrub with one's it gets from memory_failure() events

- when pmem handles a memory_failure() event or an address range scrub
event fire a new event on a new per-dax-device blocking_notifier_head
indicating the dax-relative offset ranges of the translated PFNs. This
notification can optionally indicate failure, offline (for removal),
and online (for repaired ranges).

- teach dm to receive dax-device notifier events from its leaf devices
and then translate them into dax-device notifications relative to the
dm-device offset. This would seem to be a straightforward conversion
of what you have done with ->corrupted_range()

- teach filesystems to register for dax-device notifiers

With all of that in place an interested filesystem can take ownership
of a memory failure that impacts a range of pfns it is responsible for
via a dax-device, but it also allows a not interested filesystem to
default to standard single-pfn-at-a-time error handling and
assumptions about page->mapping only referring to a single address
space.

This obviously does not solve Dave's desire to get this type of error
reporting on block_devices, but I think there's nothing stopping a
parallel notifier chain from being created for block-devices, but
that's orthogonal to requirements and capabilities provided by
dax-devices.