lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ypbz735oinozgm5bsxk7gyzgjpfpzzeb3k4f2fxq5ogm3hthb2@ict4rjvogy25>
Date: Thu, 8 Jan 2026 08:32:09 -0600
From: John Groves <John@...ves.net>
To: Jonathan Cameron <jonathan.cameron@...wei.com>
Cc: Miklos Szeredi <miklos@...redi.hu>, 
	Dan Williams <dan.j.williams@...el.com>, Bernd Schubert <bschubert@....com>, 
	Alison Schofield <alison.schofield@...el.com>, John Groves <jgroves@...ron.com>, 
	Jonathan Corbet <corbet@....net>, Vishal Verma <vishal.l.verma@...el.com>, 
	Dave Jiang <dave.jiang@...el.com>, Matthew Wilcox <willy@...radead.org>, Jan Kara <jack@...e.cz>, 
	Alexander Viro <viro@...iv.linux.org.uk>, David Hildenbrand <david@...nel.org>, 
	Christian Brauner <brauner@...nel.org>, "Darrick J . Wong" <djwong@...nel.org>, 
	Randy Dunlap <rdunlap@...radead.org>, Jeff Layton <jlayton@...nel.org>, 
	Amir Goldstein <amir73il@...il.com>, Stefan Hajnoczi <shajnocz@...hat.com>, 
	Joanne Koong <joannelkoong@...il.com>, Josef Bacik <josef@...icpanda.com>, 
	Bagas Sanjaya <bagasdotme@...il.com>, Chen Linxuan <chenlinxuan@...ontech.com>, 
	James Morse <james.morse@....com>, Fuad Tabba <tabba@...gle.com>, 
	Sean Christopherson <seanjc@...gle.com>, Shivank Garg <shivankg@....com>, 
	Ackerley Tng <ackerleytng@...gle.com>, Gregory Price <gourry@...rry.net>, 
	Aravind Ramesh <arramesh@...ron.com>, Ajay Joshi <ajayjoshi@...ron.com>, venkataravis@...ron.com, 
	linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org, nvdimm@...ts.linux.dev, 
	linux-cxl@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH V3 02/21] dax: add fsdev.c driver for fs-dax on character
 dax

On 26/01/08 11:31AM, Jonathan Cameron wrote:
> On Wed,  7 Jan 2026 09:33:11 -0600
> John Groves <John@...ves.net> wrote:
> 
> > The new fsdev driver provides pages/folios initialized compatibly with
> > fsdax - normal rather than devdax-style refcounting, and starting out
> > with order-0 folios.
> > 
> > When fsdev binds to a daxdev, it is usually (always?) switching from the
> > devdax mode (device.c), which pre-initializes compound folios according
> > to its alignment. Fsdev uses fsdev_clear_folio_state() to switch the
> > folios into a fsdax-compatible state.
> > 
> > A side effect of this is that raw mmap doesn't (can't?) work on an fsdev
> > dax instance. Accordingly, The fsdev driver does not provide raw mmap -
> > devices must be put in 'devdax' mode (drivers/dax/device.c) to get raw
> > mmap capability.
> > 
> > In this commit is just the framework, which remaps pages/folios compatibly
> > with fsdax.
> > 
> > Enabling dax changes:
> > 
> > * bus.h: add DAXDRV_FSDEV_TYPE driver type
> > * bus.c: allow DAXDRV_FSDEV_TYPE drivers to bind to daxdevs
> > * dax.h: prototype inode_dax(), which fsdev needs
> > 
> > Suggested-by: Dan Williams <dan.j.williams@...el.com>
> > Suggested-by: Gregory Price <gourry@...rry.net>
> > Signed-off-by: John Groves <john@...ves.net>
> 
> > diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
> > index d656e4c0eb84..491325d914a8 100644
> > --- a/drivers/dax/Kconfig
> > +++ b/drivers/dax/Kconfig
> > @@ -78,4 +78,21 @@ config DEV_DAX_KMEM
> >  
> >  	  Say N if unsure.
> >  
> > +config DEV_DAX_FS
> > +	tristate "FSDEV DAX: fs-dax compatible device driver"
> > +	depends on DEV_DAX
> > +	default DEV_DAX
> 
> What's the logic for the default? Generally I'd not expect a
> default for something new like this (so default of default == no)
> 
> > +	help
> > +	  Support a device-dax driver mode that is compatible with fs-dax
> 
> ...
> 
> 
> 
> >  struct dax_device_driver {
> > diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> > new file mode 100644
> > index 000000000000..2a3249d1529c
> > --- /dev/null
> > +++ b/drivers/dax/fsdev.c
> > @@ -0,0 +1,276 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright(c) 2026 Micron Technology, Inc. */
> > +#include <linux/memremap.h>
> > +#include <linux/pagemap.h>
> > +#include <linux/module.h>
> > +#include <linux/device.h>
> > +#include <linux/cdev.h>
> > +#include <linux/slab.h>
> > +#include <linux/dax.h>
> > +#include <linux/fs.h>
> > +#include <linux/mm.h>
> > +#include "dax-private.h"
> > +#include "bus.h"
> 
> ...
> 
> > +static void fsdev_cdev_del(void *cdev)
> > +{
> > +	cdev_del(cdev);
> > +}
> > +
> > +static void fsdev_kill(void *dev_dax)
> > +{
> > +	kill_dev_dax(dev_dax);
> > +}
> 
> ...
> 
> > +/*
> > + * Clear any stale folio state from pages in the given range.
> > + * This is necessary because device_dax pre-initializes compound folios
> > + * based on vmemmap_shift, and that state may persist after driver unbind.
> 
> What's the argument for not cleaning these out in the unbind path for device_dax?
> I can see that it might be an optimization if some other code path blindly
> overwrites all this state.

I prefer this because it doesn't rely on some other module having done the
right thing. Dax maintainers might have thoughts too though.

> 
> > + * Since fsdev_dax uses MEMORY_DEVICE_FS_DAX without vmemmap_shift, fs-dax
> > + * expects to find clean order-0 folios that it can build into compound
> > + * folios on demand.
> > + *
> > + * At probe time, no filesystem should be mounted yet, so all mappings
> > + * are stale and must be cleared along with compound state.
> > + */
> > +static void fsdev_clear_folio_state(struct dev_dax *dev_dax)
> > +{
> > +	int i;
> 
> It's becoming increasingly common to declare loop variables as
> for (int i = 0; i <...
> 
> and given that saves us a few lines here it seems worth doing.

Done thanks

> 
> > +
> > +	for (i = 0; i < dev_dax->nr_range; i++) {
> > +		struct range *range = &dev_dax->ranges[i].range;
> > +		unsigned long pfn, end_pfn;
> > +
> > +		pfn = PHYS_PFN(range->start);
> > +		end_pfn = PHYS_PFN(range->end) + 1;
> 
> Might as well do
> 		unsigned long pfn = PHY_PFN(range->start);
> 		unsigned long end_pfn = PHYS_PFN(range->end) + 1;

Sounds good, done

> > +
> > +		while (pfn < end_pfn) {
> > +			struct page *page = pfn_to_page(pfn);
> > +			struct folio *folio = (struct folio *)page;
> > +			struct dev_pagemap *pgmap = page_pgmap(page);
> > +			int order = folio_order(folio);
> > +
> > +			/*
> > +			 * Clear any stale mapping pointer. At probe time,
> > +			 * no filesystem is mounted, so any mapping is stale.
> > +			 */
> > +			folio->mapping = NULL;
> > +			folio->share = 0;
> > +
> > +			if (order > 0) {
> > +				int j;
> > +
> > +				folio_reset_order(folio);
> > +				for (j = 0; j < (1UL << order); j++) {
> > +					struct page *p = page + j;
> > +
> > +					ClearPageHead(p);
> > +					clear_compound_head(p);
> > +					((struct folio *)p)->mapping = NULL;
> 
> This code block is very similar to a chunk in dax_folio_put() in fs/dax.c
> 
> Can we create a helper for both to use?
> 
> I note that uses a local struct folio *new_folio to avoid multiple casts.
> I'd do similar here even if it's a long line.
>  
> If not possible to use a common helper, it is probably still worth
> having a helper here for the stuff in the while loop just to reduce indent
> and improve readability a little.

Good catch! You shall have a Suggested-by in the next version, which will
inject a commit right before this that factors out dax_folio_reset_order()
from dax_folio_put(). Then fsdev_clear_folio_state() will also call that.

> 
> > +					((struct folio *)p)->share = 0;
> > +					((struct folio *)p)->pgmap = pgmap;
> > +				}
> > +				pfn += (1UL << order);
> > +			} else {
> > +				folio->pgmap = pgmap;
> > +				pfn++;
> > +			}
> > +		}
> > +	}
> > +}
> > +
> > +static int fsdev_open(struct inode *inode, struct file *filp)
> > +{
> > +	struct dax_device *dax_dev = inode_dax(inode);
> > +	struct dev_dax *dev_dax = dax_get_private(dax_dev);
> > +
> > +	dev_dbg(&dev_dax->dev, "trace\n");
> 
> Hmm. This is a somewhat odd, but I see dax/device.c does
> the same thing and I guess that's because you are using
> dynamic debug with function names turned on to provide the
> 'real' information.

Actually I just have it from the gut-and-repurpose of device.c.
Dropping from fsdev.c as I'm not using it.

> 
> 
> 
> > +	filp->private_data = dev_dax;
> > +
> > +	return 0;
> > +}
> 
> > +static int fsdev_dax_probe(struct dev_dax *dev_dax)
> > +{
> > +	struct dax_device *dax_dev = dev_dax->dax_dev;
> > +	struct device *dev = &dev_dax->dev;
> > +	struct dev_pagemap *pgmap;
> > +	u64 data_offset = 0;
> > +	struct inode *inode;
> > +	struct cdev *cdev;
> > +	void *addr;
> > +	int rc, i;
> > +
> 
> A bunch of this is cut and paste from dax/device.c
> If it carries on looking like this, can we have a helper module that
> both drivers use with the common code in it? That would make the
> difference more obvious as well.

Makes sense. I'll wait for thoughts from the dax people before
flipping bits on this though.

> 
> > +	if (static_dev_dax(dev_dax))  {
> > +		if (dev_dax->nr_range > 1) {
> > +			dev_warn(dev,
> > +				"static pgmap / multi-range device conflict\n");
> > +			return -EINVAL;
> > +		}
> > +
> > +		pgmap = dev_dax->pgmap;
> > +	} else {
> > +		if (dev_dax->pgmap) {
> > +			dev_warn(dev,
> > +				 "dynamic-dax with pre-populated page map\n");
> Unless dax maintainers are very fussy about 80 chars, I'd go long on these as it's
> only just over 80 chars on one line.
> 
> Given you are failing probe, not sure why dev_warn() is considered sufficient.
> To me dev_err() seems more sensible. What you have matches dax/device.c though
> so maybe there is a sound reason.

I'm personally a bit fussy about 80 column code, being kinda old and favoring
80 column emacs windows :D - mulling it over.

> 
> > +			return -EINVAL;
> > +		}
> > +
> > +		pgmap = devm_kzalloc(dev,
> > +			struct_size(pgmap, ranges, dev_dax->nr_range - 1),
> > +				     GFP_KERNEL);
> Pick an alignment style and stick to it.  Either.
> 		pgmap = devm_kzalloc(dev,
> 			struct_size(pgmap, ranges, dev_dax->nr_range - 1),
> 			GFP_KERNEL);
> 
> or go long for readability and do
> 		pgmap = devm_kzalloc(dev,
> 				     struct_size(pgmap, ranges, dev_dax->nr_range - 1),
> 				     GFP_KERNEL);

Will do something cleaner. This is the aforementioned 80 column curmudgeonliness
at work...

> 
> 
> 
> > +		if (!pgmap)
> > +			return -ENOMEM;
> > +
> > +		pgmap->nr_range = dev_dax->nr_range;
> > +		dev_dax->pgmap = pgmap;
> > +
> > +		for (i = 0; i < dev_dax->nr_range; i++) {
> > +			struct range *range = &dev_dax->ranges[i].range;
> > +
> > +			pgmap->ranges[i] = *range;
> > +		}
> > +	}
> > +
> > +	for (i = 0; i < dev_dax->nr_range; i++) {
> > +		struct range *range = &dev_dax->ranges[i].range;
> > +
> > +		if (!devm_request_mem_region(dev, range->start,
> > +					range_len(range), dev_name(dev))) {
> > +			dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve range\n",
> > +					i, range->start, range->end);
> > +			return -EBUSY;
> > +		}
> > +	}
> > +
> > +	/*
> > +	 * FS-DAX compatible mode: Use MEMORY_DEVICE_FS_DAX type and
> > +	 * do NOT set vmemmap_shift. This leaves folios at order-0,
> > +	 * allowing fs-dax to dynamically create compound folios as needed
> > +	 * (similar to pmem behavior).
> > +	 */
> > +	pgmap->type = MEMORY_DEVICE_FS_DAX;
> > +	pgmap->ops = &fsdev_pagemap_ops;
> > +	pgmap->owner = dev_dax;
> > +
> > +	/*
> > +	 * CRITICAL DIFFERENCE from device.c:
> > +	 * We do NOT set vmemmap_shift here, even if align > PAGE_SIZE.
> > +	 * This ensures folios remain order-0 and are compatible with
> > +	 * fs-dax's folio management.
> > +	 */
> > +
> > +	addr = devm_memremap_pages(dev, pgmap);
> > +	if (IS_ERR(addr))
> > +		return PTR_ERR(addr);
> > +
> > +	/*
> > +	 * Clear any stale compound folio state left over from a previous
> > +	 * driver (e.g., device_dax with vmemmap_shift).
> > +	 */
> > +	fsdev_clear_folio_state(dev_dax);
> > +
> > +	/* Detect whether the data is at a non-zero offset into the memory */
> > +	if (pgmap->range.start != dev_dax->ranges[0].range.start) {
> > +		u64 phys = dev_dax->ranges[0].range.start;
> > +		u64 pgmap_phys = dev_dax->pgmap[0].range.start;
> > +
> > +		if (!WARN_ON(pgmap_phys > phys))
> > +			data_offset = phys - pgmap_phys;
> > +
> > +		pr_debug("%s: offset detected phys=%llx pgmap_phys=%llx offset=%llx\n",
> > +		       __func__, phys, pgmap_phys, data_offset);
> > +	}
> > +
> > +	inode = dax_inode(dax_dev);
> > +	cdev = inode->i_cdev;
> > +	cdev_init(cdev, &fsdev_fops);
> > +	cdev->owner = dev->driver->owner;
> > +	cdev_set_parent(cdev, &dev->kobj);
> > +	rc = cdev_add(cdev, dev->devt, 1);
> > +	if (rc)
> > +		return rc;
> > +
> > +	rc = devm_add_action_or_reset(dev, fsdev_cdev_del, cdev);
> > +	if (rc)
> > +		return rc;
> > +
> > +	run_dax(dax_dev);
> > +	return devm_add_action_or_reset(dev, fsdev_kill, dev_dax);
> > +}
> > +
> > +static struct dax_device_driver fsdev_dax_driver = {
> > +	.probe = fsdev_dax_probe,
> > +	.type = DAXDRV_FSDEV_TYPE,
> > +};
> > +
> > +static int __init dax_init(void)
> > +{
> > +	return dax_driver_register(&fsdev_dax_driver);
> > +}
> > +
> > +static void __exit dax_exit(void)
> > +{
> > +	dax_driver_unregister(&fsdev_dax_driver);
> > +}
> If these don't get more complex, maybe it's time for a dax specific define
> using module_driver()

I'll defer to the dax folks here

> 
> > +
> > +MODULE_AUTHOR("John Groves");
> > +MODULE_DESCRIPTION("FS-DAX Device: fs-dax compatible devdax driver");
> > +MODULE_LICENSE("GPL");
> > +module_init(dax_init);
> > +module_exit(dax_exit);
> > +MODULE_ALIAS_DAX_DEVICE(0);
> 
> Curious macro. Always has same parameter...  Maybe ripe for just dropping the parameter?
> 
> 

Thanks!
John

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ