linux-kernel - Re: [PATCH v9 02/10] dax: Introduce holder for dax

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220105235407.GN656707@magnolia>
Date:   Wed, 5 Jan 2022 15:54:07 -0800
From:   "Darrick J. Wong" <djwong@...nel.org>
To:     Dan Williams <dan.j.williams@...el.com>
Cc:     Shiyang Ruan <ruansy.fnst@...itsu.com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        linux-xfs <linux-xfs@...r.kernel.org>,
        Linux NVDIMM <nvdimm@...ts.linux.dev>,
        Linux MM <linux-mm@...ck.org>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>,
        david <david@...morbit.com>,
        Christoph Hellwig <hch@...radead.org>,
        Jane Chu <jane.chu@...cle.com>
Subject: Re: [PATCH v9 02/10] dax: Introduce holder for dax_device

On Wed, Jan 05, 2022 at 03:01:22PM -0800, Dan Williams wrote:
> On Wed, Jan 5, 2022 at 2:47 PM Darrick J. Wong <djwong@...nel.org> wrote:
> >
> > On Wed, Jan 05, 2022 at 11:20:12AM -0800, Dan Williams wrote:
> > > On Wed, Jan 5, 2022 at 10:56 AM Darrick J. Wong <djwong@...nel.org> wrote:
> > > >
> > > > On Wed, Jan 05, 2022 at 10:23:08AM -0800, Dan Williams wrote:
> > > > > On Wed, Jan 5, 2022 at 10:12 AM Darrick J. Wong <djwong@...nel.org> wrote:
> > > > > >
> > > > > > On Sun, Dec 26, 2021 at 10:34:31PM +0800, Shiyang Ruan wrote:
> > > > > > > To easily track filesystem from a pmem device, we introduce a holder for
> > > > > > > dax_device structure, and also its operation.  This holder is used to
> > > > > > > remember who is using this dax_device:
> > > > > > >  - When it is the backend of a filesystem, the holder will be the
> > > > > > >    instance of this filesystem.
> > > > > > >  - When this pmem device is one of the targets in a mapped device, the
> > > > > > >    holder will be this mapped device.  In this case, the mapped device
> > > > > > >    has its own dax_device and it will follow the first rule.  So that we
> > > > > > >    can finally track to the filesystem we needed.
> > > > > > >
> > > > > > > The holder and holder_ops will be set when filesystem is being mounted,
> > > > > > > or an target device is being activated.
> > > > > > >
> > > > > > > Signed-off-by: Shiyang Ruan <ruansy.fnst@...itsu.com>
> > > > > > > ---
> > > > > > >  drivers/dax/super.c | 62 +++++++++++++++++++++++++++++++++++++++++++++
> > > > > > >  include/linux/dax.h | 29 +++++++++++++++++++++
> > > > > > >  2 files changed, 91 insertions(+)
> > > > > > >
> > > > > > > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > > > > > > index c46f56e33d40..94c51f2ee133 100644
> > > > > > > --- a/drivers/dax/super.c
> > > > > > > +++ b/drivers/dax/super.c
> > > > > > > @@ -20,15 +20,20 @@
> > > > > > >   * @inode: core vfs
> > > > > > >   * @cdev: optional character interface for "device dax"
> > > > > > >   * @private: dax driver private data
> > > > > > > + * @holder_data: holder of a dax_device: could be filesystem or mapped device
> > > > > > >   * @flags: state and boolean properties
> > > > > > > + * @ops: operations for dax_device
> > > > > > > + * @holder_ops: operations for the inner holder
> > > > > > >   */
> > > > > > >  struct dax_device {
> > > > > > >       struct inode inode;
> > > > > > >       struct cdev cdev;
> > > > > > >       void *private;
> > > > > > >       struct percpu_rw_semaphore rwsem;
> > > > > > > +     void *holder_data;
> > > > > > >       unsigned long flags;
> > > > > > >       const struct dax_operations *ops;
> > > > > > > +     const struct dax_holder_operations *holder_ops;
> > > > > > >  };
> > > > > > >
> > > > > > >  static dev_t dax_devt;
> > > > > > > @@ -192,6 +197,29 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
> > > > > > >  }
> > > > > > >  EXPORT_SYMBOL_GPL(dax_zero_page_range);
> > > > > > >
> > > > > > > +int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off,
> > > > > > > +                           u64 len, int mf_flags)
> > > > > > > +{
> > > > > > > +     int rc;
> > > > > > > +
> > > > > > > +     dax_read_lock(dax_dev);
> > > > > > > +     if (!dax_alive(dax_dev)) {
> > > > > > > +             rc = -ENXIO;
> > > > > > > +             goto out;
> > > > > > > +     }
> > > > > > > +
> > > > > > > +     if (!dax_dev->holder_ops) {
> > > > > > > +             rc = -EOPNOTSUPP;
> > > > > > > +             goto out;
> > > > > > > +     }
> > > > > > > +
> > > > > > > +     rc = dax_dev->holder_ops->notify_failure(dax_dev, off, len, mf_flags);
> > > > > > > +out:
> > > > > > > +     dax_read_unlock(dax_dev);
> > > > > > > +     return rc;
> > > > > > > +}
> > > > > > > +EXPORT_SYMBOL_GPL(dax_holder_notify_failure);
> > > > > > > +
> > > > > > >  #ifdef CONFIG_ARCH_HAS_PMEM_API
> > > > > > >  void arch_wb_cache_pmem(void *addr, size_t size);
> > > > > > >  void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
> > > > > > > @@ -254,6 +282,10 @@ void kill_dax(struct dax_device *dax_dev)
> > > > > > >               return;
> > > > > > >       dax_write_lock(dax_dev);
> > > > > > >       clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
> > > > > > > +
> > > > > > > +     /* clear holder data */
> > > > > > > +     dax_dev->holder_ops = NULL;
> > > > > > > +     dax_dev->holder_data = NULL;
> > > > > > >       dax_write_unlock(dax_dev);
> > > > > > >  }
> > > > > > >  EXPORT_SYMBOL_GPL(kill_dax);
> > > > > > > @@ -401,6 +433,36 @@ void put_dax(struct dax_device *dax_dev)
> > > > > > >  }
> > > > > > >  EXPORT_SYMBOL_GPL(put_dax);
> > > > > > >
> > > > > > > +void dax_register_holder(struct dax_device *dax_dev, void *holder,
> > > > > > > +             const struct dax_holder_operations *ops)
> > > > > > > +{
> > > > > > > +     if (!dax_alive(dax_dev))
> > > > > > > +             return;
> > > > > > > +
> > > > > > > +     dax_dev->holder_data = holder;
> > > > > > > +     dax_dev->holder_ops = ops;
> > > > > >
> > > > > > Shouldn't this return an error code if the dax device is dead or if
> > > > > > someone already registered a holder?  I'm pretty sure XFS should not
> > > > > > bind to a dax device if someone else already registered for it...
> > > > >
> > > > > Agree, yes.
> > > > >
> > > > > >
> > > > > > ...unless you want to use a notifier chain for failure events so that
> > > > > > there can be multiple consumers of dax failure events?
> > > > >
> > > > > No, I would hope not. It should be 1:1 holders to dax-devices. Similar
> > > > > ownership semantics like bd_prepare_to_claim().
> > > >
> > > > Does each partition on a pmem device still have its own dax_device?
> > >
> > > No, it never did...
> > >
> > > Just as before, each dax-device is still associated with a gendisk /
> > > whole-block_device. The recent change is that instead of needing that
> > > partition-block_device plumbed to convert a relative block number to
> > > its absolute whole-block_device offset the filesystem now handles that
> > > at iomap_begin() time. See:
> > >
> > >                 if (mapping_flags & IOMAP_DAX)
> > >                         iomap->addr += target->bt_dax_part_off;
> > >
> > > ...in xfs_bmbt_to_iomap() (in -next). I.e. bdev_dax_pgoff() is gone
> > > with the lead-in reworks.
> >
> > OH.  Crap, Dan's right:
> >
> > XFS (pmem0p1): ddev dax = 0xffff88800304ed00 bdev = 0xffff8880480d6180
> > XFS (pmem0p2): ddev dax = 0xffff88800304ed00 bdev = 0xffff8880480d4380
> >
> > Unless you're planning to remove partition support too, this patch needs
> > to be reworked so that multiple filesystems in separate partitions can
> > each call dax_register_holder to receive memory_failure notifications
> > for their partition.
> 
> Oh, crap indeed. I think this gets back to the slow tip-toeing away
> from dax + partition support. While FSDAX can continue to support
> "legacy/experimental" operation on partitions of DAX capable block
> devices, I think failure notification + reflink support is where we
> draw the line and say "DAX on partitions was a mistake, it's too late
> to undo that mistake, but going forward for new FSDAX features it
> requires switching from block-device partitions to pmem-namespaces if
> you need sub-division support and new DAX features."
> 
> > /methinks this sharing is all a little scary...
> 
> Yes, I think we should just fail the holder registration and
> DAX+reflink unless the FS being mounted on a whole device. I know Ted
> and others had reservations about moving filesystems to be mounted on
> dax-devices directly, but hopefully the whole-block_device requirement
> is a workable middle ground?

I think you have to be /very/ careful about that kind of statement --

Take ext4 for example.  It has a lot of statically allocated ondisk
metadata.  Someone could decide that it's a good idea to wire up a media
failure notification so that we shut down the fs if (say) a giant hole
opens up in the middle of the inode table.  However, registering any
kind of media failure handler brings along this requirement for not
having partitions.

This means that if ext4 finds a filesystem on a partition on a pmem
device and someone else has already registered a media failure handler,
it will have to choose between foregoing media failure notifications or
failing the mount outright.

Or you could support notification call chains...

--D

PS: I was about to say that ext4 lacks reverse mapping and reflink, but
I forgot that ext4 *does* support reflink now.  It just doesn't support
copy on write.