lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 3 Nov 2021 23:21:39 -0700
From:   Dan Williams <dan.j.williams@...el.com>
To:     Jane Chu <jane.chu@...cle.com>
Cc:     Christoph Hellwig <hch@...radead.org>,
        "Darrick J. Wong" <djwong@...nel.org>,
        "david@...morbit.com" <david@...morbit.com>,
        "vishal.l.verma@...el.com" <vishal.l.verma@...el.com>,
        "dave.jiang@...el.com" <dave.jiang@...el.com>,
        "agk@...hat.com" <agk@...hat.com>,
        "snitzer@...hat.com" <snitzer@...hat.com>,
        "dm-devel@...hat.com" <dm-devel@...hat.com>,
        "ira.weiny@...el.com" <ira.weiny@...el.com>,
        "willy@...radead.org" <willy@...radead.org>,
        "vgoyal@...hat.com" <vgoyal@...hat.com>,
        "linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
        "nvdimm@...ts.linux.dev" <nvdimm@...ts.linux.dev>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-xfs@...r.kernel.org" <linux-xfs@...r.kernel.org>
Subject: Re: [dm-devel] [PATCH 0/6] dax poison recovery with RWF_RECOVERY_DATA flag

On Wed, Nov 3, 2021 at 11:10 AM Jane Chu <jane.chu@...cle.com> wrote:
>
> On 11/1/2021 11:18 PM, Christoph Hellwig wrote:
> > On Wed, Oct 27, 2021 at 05:24:51PM -0700, Darrick J. Wong wrote:
> >> ...so would you happen to know if anyone's working on solving this
> >> problem for us by putting the memory controller in charge of dealing
> >> with media errors?
> >
> > The only one who could know is Intel..
> >
> >> The trouble is, we really /do/ want to be able to (re)write the failed
> >> area, and we probably want to try to read whatever we can.  Those are
> >> reads and writes, not {pre,f}allocation activities.  This is where Dave
> >> and I arrived at a month ago.
> >>
> >> Unless you'd be ok with a second IO path for recovery where we're
> >> allowed to be slow?  That would probably have the same user interface
> >> flag, just a different path into the pmem driver.
> >
> > Which is fine with me.  If you look at the API here we do have the
> > RWF_ API, which them maps to the IOMAP API, which maps to the DAX_
> > API which then gets special casing over three methods.
> >
> > And while Pavel pointed out that he and Jens are now optimizing for
> > single branches like this.  I think this actually is silly and it is
> > not my point.
> >
> > The point is that the DAX in-kernel API is a mess, and before we make
> > it even worse we need to sort it first.  What is directly relevant
> > here is that the copy_from_iter and copy_to_iter APIs do not make
> > sense.  Most of the DAX API is based around getting a memory mapping
> > using ->direct_access, it is just the read/write path which is a slow
> > path that actually uses this.  I have a very WIP patch series to try
> > to sort this out here:
> >
> > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-devirtualize
> >
> > But back to this series.  The basic DAX model is that the callers gets a
> > memory mapping an just works on that, maybe calling a sync after a write
> > in a few cases.  So any kind of recovery really needs to be able to
> > work with that model as going forward the copy_to/from_iter path will
> > be used less and less.  i.e. file systems can and should use
> > direct_access directly instead of using the block layer implementation
> > in the pmem driver.  As an example the dm-writecache driver, the pending
> > bcache nvdimm support and the (horribly and out of tree) nova file systems
> > won't even use this path.  We need to find a way to support recovery
> > for them.  And overloading it over the read/write path which is not
> > the main path for DAX, but the absolutely fast path for 99% of the
> > kernel users is a horrible idea.
> >
> > So how can we work around the horrible nvdimm design for data recovery
> > in a way that:
> >
> >     a) actually works with the intended direct memory map use case
> >     b) doesn't really affect the normal kernel too much
> >
> > ?
> >
>
> This is clearer, I've looked at your 'dax-devirtualize' patch which
> removes pmem_copy_to/from_iter, and as you mentioned before,
> a separate API for poison-clearing is needed. So how about I go ahead
> rebase my earlier patch
>
> https://lore.kernel.org/lkml/20210914233132.3680546-2-jane.chu@oracle.com/
> on 'dax-devirtualize', provide dm support for clear-poison?
> That way, the non-dax 99% of the pwrite use-cases aren't impacted at all
> and we resolve the urgent pmem poison-clearing issue?
>
> Dan, are you okay with this?  I am getting pressure from our customers
> who are basically stuck at the moment.

The concern I have with dax_clear_poison() is that it precludes atomic
error clearing. Also, as Boris and I discussed, poisoned pages should
be marked NP (not present) rather than UC (uncacheable) [1]. With
those 2 properties combined I think that wants a custom pmem fault
handler that knows how to carefully write to pmem pages with poison
present, rather than an additional explicit dax-operation. That also
meets Christoph's requirement of "works with the intended direct
memory map use case".

[1]: https://lore.kernel.org/r/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@mail.gmail.com

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ