[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YYDYUCCiEPXhZEw0@infradead.org>
Date: Mon, 1 Nov 2021 23:18:56 -0700
From: Christoph Hellwig <hch@...radead.org>
To: "Darrick J. Wong" <djwong@...nel.org>
Cc: Christoph Hellwig <hch@...radead.org>,
Jane Chu <jane.chu@...cle.com>,
"david@...morbit.com" <david@...morbit.com>,
"dan.j.williams@...el.com" <dan.j.williams@...el.com>,
"vishal.l.verma@...el.com" <vishal.l.verma@...el.com>,
"dave.jiang@...el.com" <dave.jiang@...el.com>,
"agk@...hat.com" <agk@...hat.com>,
"snitzer@...hat.com" <snitzer@...hat.com>,
"dm-devel@...hat.com" <dm-devel@...hat.com>,
"ira.weiny@...el.com" <ira.weiny@...el.com>,
"willy@...radead.org" <willy@...radead.org>,
"vgoyal@...hat.com" <vgoyal@...hat.com>,
"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
"nvdimm@...ts.linux.dev" <nvdimm@...ts.linux.dev>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-xfs@...r.kernel.org" <linux-xfs@...r.kernel.org>
Subject: Re: [dm-devel] [PATCH 0/6] dax poison recovery with
RWF_RECOVERY_DATA flag
On Wed, Oct 27, 2021 at 05:24:51PM -0700, Darrick J. Wong wrote:
> ...so would you happen to know if anyone's working on solving this
> problem for us by putting the memory controller in charge of dealing
> with media errors?
The only one who could know is Intel..
> The trouble is, we really /do/ want to be able to (re)write the failed
> area, and we probably want to try to read whatever we can. Those are
> reads and writes, not {pre,f}allocation activities. This is where Dave
> and I arrived at a month ago.
>
> Unless you'd be ok with a second IO path for recovery where we're
> allowed to be slow? That would probably have the same user interface
> flag, just a different path into the pmem driver.
Which is fine with me. If you look at the API here we do have the
RWF_ API, which them maps to the IOMAP API, which maps to the DAX_
API which then gets special casing over three methods.
And while Pavel pointed out that he and Jens are now optimizing for
single branches like this. I think this actually is silly and it is
not my point.
The point is that the DAX in-kernel API is a mess, and before we make
it even worse we need to sort it first. What is directly relevant
here is that the copy_from_iter and copy_to_iter APIs do not make
sense. Most of the DAX API is based around getting a memory mapping
using ->direct_access, it is just the read/write path which is a slow
path that actually uses this. I have a very WIP patch series to try
to sort this out here:
http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-devirtualize
But back to this series. The basic DAX model is that the callers gets a
memory mapping an just works on that, maybe calling a sync after a write
in a few cases. So any kind of recovery really needs to be able to
work with that model as going forward the copy_to/from_iter path will
be used less and less. i.e. file systems can and should use
direct_access directly instead of using the block layer implementation
in the pmem driver. As an example the dm-writecache driver, the pending
bcache nvdimm support and the (horribly and out of tree) nova file systems
won't even use this path. We need to find a way to support recovery
for them. And overloading it over the read/write path which is not
the main path for DAX, but the absolutely fast path for 99% of the
kernel users is a horrible idea.
So how can we work around the horrible nvdimm design for data recovery
in a way that:
a) actually works with the intended direct memory map use case
b) doesn't really affect the normal kernel too much
?
Powered by blists - more mailing lists