[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <36437378-de35-48dc-8b1e-b5c1370e38b0@oracle.com>
Date: Fri, 9 Feb 2024 17:30:50 +0000
From: John Garry <john.g.garry@...cle.com>
To: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
Cc: hch@....de, djwong@...nel.org, viro@...iv.linux.org.uk, brauner@...nel.org,
dchinner@...hat.com, jack@...e.cz, chandan.babu@...cle.com,
martin.petersen@...cle.com, linux-kernel@...r.kernel.org,
linux-xfs@...r.kernel.org, linux-fsdevel@...r.kernel.org,
tytso@....edu, jbongio@...gle.com
Subject: Re: [PATCH 4/6] fs: xfs: Support atomic write for statx
>> +void xfs_get_atomic_write_attr(
>> + struct xfs_inode *ip,
>> + unsigned int *unit_min,
>> + unsigned int *unit_max)
>> +{
>> + xfs_extlen_t extsz = xfs_get_extsz(ip);
>> + struct xfs_buftarg *target = xfs_inode_buftarg(ip);
>> + struct block_device *bdev = target->bt_bdev;
>> + unsigned int awu_min, awu_max, align;
>> + struct request_queue *q = bdev->bd_queue;
>> + struct xfs_mount *mp = ip->i_mount;
>> +
>> + /*
>> + * Convert to multiples of the BLOCKSIZE (as we support a minimum
>> + * atomic write unit of BLOCKSIZE).
>> + */
>> + awu_min = queue_atomic_write_unit_min_bytes(q);
>> + awu_max = queue_atomic_write_unit_max_bytes(q);
>> +
>> + awu_min &= ~mp->m_blockmask;
>> + awu_max &= ~mp->m_blockmask;
>
> I don't understand why we try to round down the awu_max to blocks size
> here and not just have an explicit check of (awu_max < blocksize).
We have later check for !awu_max:
if (!awu_max || !xfs_inode_atomicwrites(ip) || !align ||
..
So what we are doing is ensuring that the awu_max which the device
reports is at least FS block size. If it is not, then we cannot support
atomic writes.
Indeed, my NVMe drive reports awu_max = 4K. So for XFS configured for
64K block size, we will say that we don't support atomic writes.
>
> I think the issue with changing the awu_max is that we are using awu_max
> to also indirectly reflect the alignment so as to ensure we don't cross
> atomic boundaries set by the hw (eg we check uint_max % atomic alignment
> == 0 in scsi). So once we change the awu_max, there's a chance that even
> if an atomic write aligns to the new awu_max it still doesn't have the
> right alignment and fails.
All these values should be powers-of-2, so rounding down should not
affect whether we would not cross an atomic write boundary.
>
> It works right now since eveything is power of 2 but it should cause
> issues incase we decide to remove that limitation.
Sure, but that is a fundamental principle of this atomic write support.
Not having powers-of-2 requirement for atomic writes will affect many
things.
> Anyways, I think
> this implicit behavior of things working since eveything is a power of 2
> should atleast be documented in a comment, so these things are
> immediately clear.
>
>> +
>> + align = XFS_FSB_TO_B(mp, extsz);
>> +
>> + if (!awu_max || !xfs_inode_atomicwrites(ip) || !align ||
>> + !is_power_of_2(align)) {
>
> Correct me if I'm wrong but here as well, the is_power_of_2(align) is
> esentially checking if the align % uinit_max == 0 (or vice versa if
> unit_max is greater)
yes
>so that an allocation of extsize will always align
> nicely as needed by the device.
>
I'm trying to keep things simple now.
In theory we could allow, say, align == 12 FSB, and then could say
awu_max = 4.
The same goes for atomic write boundary in NVMe. Currently we say that
it needs to be a power-of-2. However, it really just needs to be a
multiple of awu_max. So if some HW did report a !power-of-2 atomic write
boundary, we could reduce awu_max reported until to fits the power-of-2
rule and also is cleanly divisible into atomic write boundary. But that
is just not what HW will report (I expect). We live in a power-of-2 data
granularity world.
> So maybe we should use the % expression explicitly so that the intention
> is immediately clear.
As mentioned, I wanted to keep it simple. In addition, it's a bit of a
mess for the FS block allocator to work with odd sizes, like 12. And it
does not suit RAID stripe alignment, which works in powers-of-2.
>
>> + *unit_min = 0;
>> + *unit_max = 0;
>> + } else {
>> + if (awu_min)
>> + *unit_min = min(awu_min, align);
>
> How will the min() here work? If awu_min is the minumum set by the
> device, how can statx be allowed to advertise something smaller than
> that?
The idea is that is that if the awu_min reported by the device is less
than the FS block size, then we report awu_min = FS block size. We
already know that awu_max => FS block size, since we got this far, so
saying that awu_min = FS block size is ok.
Otherwise it is the minimum of alignment and awu_min. I suppose that
does not make much sense, and we should just always require awu_min = FS
block size.
>
> If I understand correctly, right now the way we set awu_min in scsi and
> nvme, the follwoing should usually be true for a sane device:
>
> awu_min <= blocks size of fs <= align
>
> so the min() anyways becomes redundant, but if we do assume that there
> might be some weird devices with awu_min absurdly large (SCSI with
> high atomic granularity) we still can't actually advertise a min
> smaller than that of the device, or am I missing something here?
As above, I might just ensure that we can do awu_min = FS block size and
not deal with crazy devices.
Thanks,
John
Powered by blists - more mailing lists