linux-kernel - Re: [PATCH 4/6] fs: xfs: Support atomic write for statx

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <36437378-de35-48dc-8b1e-b5c1370e38b0@oracle.com>
Date: Fri, 9 Feb 2024 17:30:50 +0000
From: John Garry <john.g.garry@...cle.com>
To: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
Cc: hch@....de, djwong@...nel.org, viro@...iv.linux.org.uk, brauner@...nel.org,
        dchinner@...hat.com, jack@...e.cz, chandan.babu@...cle.com,
        martin.petersen@...cle.com, linux-kernel@...r.kernel.org,
        linux-xfs@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        tytso@....edu, jbongio@...gle.com
Subject: Re: [PATCH 4/6] fs: xfs: Support atomic write for statx


>> +void xfs_get_atomic_write_attr(
>> +	struct xfs_inode *ip,
>> +	unsigned int *unit_min,
>> +	unsigned int *unit_max)
>> +{
>> +	xfs_extlen_t		extsz = xfs_get_extsz(ip);
>> +	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
>> +	struct block_device	*bdev = target->bt_bdev;
>> +	unsigned int		awu_min, awu_max, align;
>> +	struct request_queue	*q = bdev->bd_queue;
>> +	struct xfs_mount	*mp = ip->i_mount;
>> +
>> +	/*
>> +	 * Convert to multiples of the BLOCKSIZE (as we support a minimum
>> +	 * atomic write unit of BLOCKSIZE).
>> +	 */
>> +	awu_min = queue_atomic_write_unit_min_bytes(q);
>> +	awu_max = queue_atomic_write_unit_max_bytes(q);
>> +
>> +	awu_min &= ~mp->m_blockmask;
>> +	awu_max &= ~mp->m_blockmask;
> 
> I don't understand why we try to round down the awu_max to blocks size
> here and not just have an explicit check of (awu_max < blocksize).
We have later check for !awu_max:

if (!awu_max || !xfs_inode_atomicwrites(ip) || !align ||
..

So what we are doing is ensuring that the awu_max which the device 
reports is at least FS block size. If it is not, then we cannot support 
atomic writes.

Indeed, my NVMe drive reports awu_max  = 4K. So for XFS configured for 
64K block size, we will say that we don't support atomic writes.

> 
> I think the issue with changing the awu_max is that we are using awu_max
> to also indirectly reflect the alignment so as to ensure we don't cross
> atomic boundaries set by the hw (eg we check uint_max % atomic alignment
> == 0 in scsi). So once we change the awu_max, there's a chance that even
> if an atomic write aligns to the new awu_max it still doesn't have the
> right alignment and fails.

All these values should be powers-of-2, so rounding down should not 
affect whether we would not cross an atomic write boundary.

> 
> It works right now since eveything is power of 2 but it should cause
> issues incase we decide to remove that limitation. 

Sure, but that is a fundamental principle of this atomic write support. 
Not having powers-of-2 requirement for atomic writes will affect many 
things.

> Anyways, I think
> this implicit behavior of things working since eveything is a power of 2
> should atleast be documented in a comment, so these things are
> immediately clear.
> 
>> +
>> +	align = XFS_FSB_TO_B(mp, extsz);
>> +
>> +	if (!awu_max || !xfs_inode_atomicwrites(ip) || !align ||
>> +	    !is_power_of_2(align)) {
> 
> Correct me if I'm wrong but here as well, the is_power_of_2(align) is
> esentially checking if the align % uinit_max == 0 (or vice versa if
> unit_max is greater) 

yes

>so that an allocation of extsize will always align
> nicely as needed by the device.
>

I'm trying to keep things simple now.

In theory we could allow, say, align == 12 FSB, and then could say 
awu_max = 4.

The same goes for atomic write boundary in NVMe. Currently we say that 
it needs to be a power-of-2. However, it really just needs to be a 
multiple of awu_max. So if some HW did report a !power-of-2 atomic write 
boundary, we could reduce awu_max reported until to fits the power-of-2 
rule and also is cleanly divisible into atomic write boundary. But that 
is just not what HW will report (I expect). We live in a power-of-2 data 
granularity world.

> So maybe we should use the % expression explicitly so that the intention
> is immediately clear.

As mentioned, I wanted to keep it simple. In addition, it's a bit of a 
mess for the FS block allocator to work with odd sizes, like 12. And it 
does not suit RAID stripe alignment, which works in powers-of-2.

> 
>> +		*unit_min = 0;
>> +		*unit_max = 0;
>> +	} else {
>> +		if (awu_min)
>> +			*unit_min = min(awu_min, align);
> 
> How will the min() here work? If awu_min is the minumum set by the
> device, how can statx be allowed to advertise something smaller than
> that?
The idea is that is that if the awu_min reported by the device is less 
than the FS block size, then we report awu_min = FS block size. We 
already know that awu_max => FS block size, since we got this far, so 
saying that awu_min = FS block size is ok.

Otherwise it is the minimum of alignment and awu_min. I suppose that 
does not make much sense, and we should just always require awu_min = FS 
block size.

> 
> If I understand correctly, right now the way we set awu_min in scsi and
> nvme, the follwoing should usually be true for a sane device:
> 
>   awu_min <= blocks size of fs <= align
> 
>   so the min() anyways becomes redundant, but if we do assume that there
>   might be some weird devices with awu_min absurdly large (SCSI with
>   high atomic granularity) we still can't actually advertise a min
>   smaller than that of the device, or am I missing something here?

As above, I might just ensure that we can do awu_min = FS block size and 
not deal with crazy devices.

Thanks,
John