linux-kernel - Re: [PATCH 10/21] block: Add fops atomic write support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <c8ef6c57-5e3a-4c12-9f0f-a2e1b1e49ddc@oracle.com>
Date:   Tue, 5 Dec 2023 10:49:49 +0000
From:   John Garry <john.g.garry@...cle.com>
To:     Ming Lei <ming.lei@...hat.com>
Cc:     axboe@...nel.dk, kbusch@...nel.org, hch@....de, sagi@...mberg.me,
        jejb@...ux.ibm.com, martin.petersen@...cle.com, djwong@...nel.org,
        viro@...iv.linux.org.uk, brauner@...nel.org,
        chandan.babu@...cle.com, dchinner@...hat.com,
        linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-nvme@...ts.infradead.org, linux-xfs@...r.kernel.org,
        linux-fsdevel@...r.kernel.org, tytso@....edu, jbongio@...gle.com,
        linux-api@...r.kernel.org
Subject: Re: [PATCH 10/21] block: Add fops atomic write support

On 05/12/2023 01:45, Ming Lei wrote:
>> Right, we do expect userspace to use a fixed block size, but we give scope
>> in the API to use variable size.
> Maybe it is enough to just take atomic_write_unit_min_bytes
> only, and allow length to be N * atomic_write_unit_min_bytes.
> 
> But it may violate atomic write boundary?

About atomic boundary, we just don't allow a merge which will result in 
a write which will straddle a boundary as there are no guarantees of 
atomicity then.

Having said this, atomic write boundary is just relevant to NVMe, so if 
we don't have merges there, then we could just omit this code.

> 
>>> The hardware should recognize unit size by start LBA, and check if length is
>>> valid, so probably the interface might be relaxed to:
>>>
>>> 1) start lba is unit aligned, and this unit is in the supported unit
>>> range(power_2 in [unit_min, unit_max])
>>>
>>> 2) length needs to be:
>>>
>>> - N * this_unit_size
>>> - <= atomic_write_max_bytes
>> Please note that we also need to consider:
>> - any atomic write boundary (from NVMe)
> Can you provide actual NVMe boundary value?
> 
> Firstly natural aligned write won't cross boundary, so boundary should
> be >= write_unit_max,

Correct

> see blow code from patch 10/21:
> 
> +static bool bio_straddles_atomic_write_boundary(loff_t bi_sector,
> +				unsigned int bi_size,
> +				unsigned int boundary)
> +{
> +	loff_t start = bi_sector << SECTOR_SHIFT;
> +	loff_t end = start + bi_size;
> +	loff_t start_mod = start % boundary;
> +	loff_t end_mod = end % boundary;
> +
> +	if (end - start > boundary)
> +		return true;
> +	if ((start_mod > end_mod) && (start_mod && end_mod))
> +		return true;
> +
> +	return false;
> +}
> +
> 
> Then if the WRITE size is <= boundary, the above function should return
> false, right? 

Actually if WRITE > boundary then we must be crossing a boundary and 
should return true, which is what the first condition checks.

However 2x naturally-aligned atomic writes could be less than 
atomic_write_max_bytes but still straddle if merged.

> Looks like it is power_of(2) & aligned atomic_write_max_bytes?
> 
>> - virt boundary (from NVMe)
> virt boundary is applied on bv_offset and bv_len, and NVMe's virt
> bounary is (4k - 1), it shouldn't be one issue in reality.

On a related topic, as I understand, for NVMe we just split bios 
according to virt boundary for PRP, but we won't always use PRP. So is 
there value in not splitting bio according to PRP if SGL will actually 
be used?

> 
>> And, as I mentioned elsewhere, I am still not 100% comfortable that we don't
>> pay attention to regular max_sectors_kb...
> max_sectors_kb should be bigger than atomic_write_max_bytes actually,
> then what is your concern?

My concern is that we don't enforce that, so we may issue atomic writes 
which exceed max_sectors_kb.

If we did enforce it, then atomic_write_unit_min_bytes, 
atomic_write_unit_max_bytes, and atomic_write_max_bytes all need to be 
limited according to max_sectors_kb.

Thanks,
John