linux-kernel - Re: [RFC PATCH v3 2/4] dax: Check for data cache aliasing at runtime

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 2 Feb 2024 14:29:05 -0500
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Dan Williams <dan.j.williams@...el.com>, Arnd Bergmann <arnd@...db.de>,
 Dave Chinner <david@...morbit.com>
Cc: linux-kernel@...r.kernel.org, Andrew Morton <akpm@...ux-foundation.org>,
 Linus Torvalds <torvalds@...ux-foundation.org>, linux-mm@...ck.org,
 linux-arch@...r.kernel.org, Vishal Verma <vishal.l.verma@...el.com>,
 Dave Jiang <dave.jiang@...el.com>, Matthew Wilcox <willy@...radead.org>,
 Russell King <linux@...linux.org.uk>, nvdimm@...ts.linux.dev,
 linux-cxl@...r.kernel.org, linux-fsdevel@...r.kernel.org,
 dm-devel@...ts.linux.dev
Subject: Re: [RFC PATCH v3 2/4] dax: Check for data cache aliasing at runtime

On 2024-02-02 12:37, Dan Williams wrote:
> Mathieu Desnoyers wrote:
[...]
>>
> 
>> The alternative route I intend to take is to audit all callers
>> of alloc_dax() and make sure they all save the alloc_dax() return
>> value in a struct dax_device * local variable first for the sake
>> of checking for IS_ERR(). This will leave the xyz->dax_dev pointer
>> initialized to NULL in the error case and simplify the rest of
>> error checking.
> 
> I could maybe get on board with that, but it needs a comment somewhere
> about the asymmetric subtlety.

Is this "somewhere" at every alloc_dax() call site, or do you have
something else in mind ?

> 
>>
>>
>>>    		return;
>>>    
>>>    	if (dax_dev->holder_data != NULL)
>>> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
>>> index 4e8fdcb3f1c8..b69c9e442cf4 100644
>>> --- a/drivers/nvdimm/pmem.c
>>> +++ b/drivers/nvdimm/pmem.c
>>> @@ -560,17 +560,19 @@ static int pmem_attach_disk(struct device *dev,
>>>    	dax_dev = alloc_dax(pmem, &pmem_dax_ops);
>>>    	if (IS_ERR(dax_dev)) {
>>>    		rc = PTR_ERR(dax_dev);
>>> -		goto out;
>>> +		if (rc != -EOPNOTSUPP)
>>> +			goto out;
>>
>> If I compare the before / after this change, if previously
>> pmem_attach_disk() was called in a configuration with FS_DAX=n, it would
>> result in a NULL pointer dereference.
> 
> No, alloc_dax() only returns NULL CONFIG_DAX=n case, not the
> CONFIG_FS_DAX=n case.

Indeed, I was wrong there.

> So that means that pmem devices on ARM have been
> possible without FS_DAX. So, in order for alloc_dax() returning
> ERR_PTR(-EOPNOTSUPP) to not regress pmem device availability this error
> path needs to be changed.
Good point. We're moving the depends on !(ARM || MIPS |PARC) from FS_DAX
Kconfig to a runtime check in alloc_dax(), which is used whenever DAX=y,
which includes configurations that had FS_DAX=n previously.

I'll change the error path in pmem_attack_disk to treat -EOPNOTSUPP
alloc_dax() return value as non-fatal.

> 
>> This would be an error handling fix all by itself. Do we really want
>> to return successfully if dax is unsupported, or should we return
>> an error here ?
> 
> Per above, there is no error handling fix, and pmem block device
> available should not depend on alloc_dax() succeeding.

I agree on treating alloc_dax() failure as non-fatal. There is
however one error handling fix to nvdimm/pmem which I plan to
introduce as an initial patch before this change:

     nvdimm/pmem: Fix leak on dax_add_host() failure
     
     Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done
     before setting pmem->dax_dev, which therefore issues the two following
     calls on NULL pointers:
     
     out_cleanup_dax:
             kill_dax(pmem->dax_dev);
             put_dax(pmem->dax_dev);
     
     Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 4e8fdcb3f1c8..9fe358090720 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -566,12 +566,11 @@ static int pmem_attach_disk(struct device *dev,
  	set_dax_nomc(dax_dev);
  	if (is_nvdimm_sync(nd_region))
  		set_dax_synchronous(dax_dev);
+	pmem->dax_dev = dax_dev;
  	rc = dax_add_host(dax_dev, disk);
  	if (rc)
  		goto out_cleanup_dax;
  	dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
-	pmem->dax_dev = dax_dev;
-
  	rc = device_add_disk(dev, disk, pmem_attribute_groups);
  	if (rc)
  		goto out_remove_host;

> 
> The real question is what to do about device-dax. I *think* it is not
> affected by cpu_dcache aliasing because it never accesses user mappings
> through a kernel alias. I doubt device-dax is in use on these platforms,
> but we might need another fixup for that if someone screams about the
> alloc_dax() behavior change making them lose device-dax access.

By "device-dax", I understand you mean drivers/dax/Kconfig:DEV_DAX.

Based on your analysis, is alloc_dax() still the right spot where
to place this runtime check ? Which call sites are responsible
for invoking alloc_dax() for device-dax ?

If we know which call sites do not intend to use the kernel linear
mapping, we could introduce a flag (or a new variant of the alloc_dax()
API) that would either enforce or skip the check.

[...]

>>
>> Here what I'm seeing so far:
>>
>> - devm_release_mem_region() is never called after devm_request_mem_region(). Not
>>     on error, neither on teardown,
> 
> devm_release_mem_region() is called from virtio_fs_probe() context. That

I guess you mean "devm_request_mem_region()" here.

> means that when virtio_fs_probe() returns an error the driver core will
> automatically call devm_request_mem_region().

And "devm_release_mem_region()" here.

> 
>> - pgmap is never freed on error after devm_kzalloc.
> 
> That is what the "devm_" in devm_kzalloc() does, free the memory on
> driver-probe failure, or after the driver remove callback is invoked.

Got it.

> 
>>
>>>    {
>>> +	struct dax_device *dax_dev __free(cleanup_dax) = NULL;
>>>    	struct virtio_shm_region cache_reg;
>>>    	struct dev_pagemap *pgmap;
>>>    	bool have_cache;
>>> @@ -804,6 +808,15 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
>>>    	if (!IS_ENABLED(CONFIG_FUSE_DAX))
>>>    		return 0;
>>>    
>>> +	dax_dev = alloc_dax(fs, &virtio_fs_dax_ops);
>>> +	if (IS_ERR(dax_dev)) {
>>> +		int rc = PTR_ERR(dax_dev);
>>> +
>>> +		if (rc == -EOPNOTSUPP)
>>> +			return 0;
>>> +		return rc;
>>> +	}
>>
>> What is gained by moving this allocation here ?
> 
> The gain is to fail early in virtio_fs_setup_dax() since the fundamental
> dependency of alloc_dax() success is not met. For example why let the
> setup progress to devm_memremap_pages() when alloc_dax() is going to
> return ERR_PTR(-EOPNOTSUPP).

What I don't know is whether there is a dependency requiring to do
devm_request_mem_region(), devm_kzalloc(), devm_memremap_pages()
before calling alloc_dax() ?

Those 3 calls are used to populate:

         fs->window_phys_addr = (phys_addr_t) cache_reg.addr;
         fs->window_len = (phys_addr_t) cache_reg.len;

and then alloc_dax() takes "fs" as private data parameter. So it's
unclear to me whether we can swap the invocation order. I suspect
that it is not an issue because it is only used to populate
dax_dev->private, but I prefer to confirm this with you just to be
on the safe side.

[...]

Thanks,

Mathieu



-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com