linux-kernel - Re: [PATCH v2 00/16] block atomic writes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b8b0a9d7-88d2-45a9-877a-ecc5e0f1e645@oracle.com>
Date:   Wed, 13 Dec 2023 09:32:06 +0000
From:   John Garry <john.g.garry@...cle.com>
To:     Christoph Hellwig <hch@....de>
Cc:     axboe@...nel.dk, kbusch@...nel.org, sagi@...mberg.me,
        jejb@...ux.ibm.com, martin.petersen@...cle.com, djwong@...nel.org,
        viro@...iv.linux.org.uk, brauner@...nel.org, dchinner@...hat.com,
        jack@...e.cz, linux-block@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-nvme@...ts.infradead.org,
        linux-xfs@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        tytso@....edu, jbongio@...gle.com, linux-scsi@...r.kernel.org,
        ming.lei@...hat.com, jaswin@...ux.ibm.com, bvanassche@....org
Subject: Re: [PATCH v2 00/16] block atomic writes

On 12/12/2023 16:32, Christoph Hellwig wrote:
> On Tue, Dec 12, 2023 at 11:08:28AM +0000, John Garry wrote:
>> Two new fields are added to struct statx - atomic_write_unit_min and
>> atomic_write_unit_max. For each atomic individual write, the total length
>> of a write must be a between atomic_write_unit_min and
>> atomic_write_unit_max, inclusive, and a power-of-2. The write must also be
>> at a natural offset in the file wrt the write length.
>>
>> SCSI sd.c and scsi_debug and NVMe kernel support is added.
>>
>> Some open questions:
>> - How to make API extensible for when we have no HW support? In that case,
>>    we would prob not have to follow rule of power-of-2 length et al.
>>    As a possible solution, maybe we can say that atomic writes are
>>    supported for the file via statx, but not set unit_min and max values,
>>    and this means that writes need to be just FS block aligned there.
> I don't think the power of two length is much of a problem to be
> honest, and if we every want to lift it we can still do that easily
> by adding a new flag or limit.

ok, but it would be nice to have some idea on what that flag or limit 
change would be.

> 
> What I'm a lot more worried about is how to tell the file system that
> allocations are done right for these requirement.  There is no way
> a user can know that allocations in an existing file are properly
> aligned, so atomic writes will just fail on existing files.
> 
> I suspect we need an on-disk flag that forces allocations to be
> aligned to the atomic write limit, in some ways similar how the
> XFS rt flag works.  You'd need to set it on an empty file, and all
> allocations after that are guaranteed to be properly aligned.

Hmmm... so how is this different to the XFS forcealign feature?

For XFS, I thought that your idea was to always CoW new extents for 
misaligned extents or writes which spanned multiple extents.

> 
>> - For block layer, should atomic_write_unit_max be limited by
>>    max_sectors_kb? Currently it is not.
> Well.  It must be limited to max_hw_sectors to actually work.

Sure, as max_hw_sectors may also be limited by DMA controller max 
mapping size.

> max_sectors is a software limit below that, which with modern hardware
> is actually pretty silly and a real performance issue with todays
> workloads when people don't tweak it..

Right, so we should limit atomic write queue limits to max_hw_sectors. 
But people can still tweak max_sectors, and I am inclined to say that 
atomic_write_unit_max et al should be (dynamically) limited to 
max_sectors also.

> 
>> - How to improve requirement that iovecs are PAGE-aligned.
>>    There are 2x issues:
>>    a. We impose this rule to not split BIOs due to virt boundary for
>>       NVMe, but there virt boundary is 4K (and not PAGE size, so broken for
>>       16K/64K pages). Easy solution is to impose requirement that iovecs
>>       are 4K-aligned.
>>    b. We don't enforce this rule for virt boundary == 0, i.e. SCSI
> .. we require any device that wants to support atomic writes to not
> have that silly limit.  For NVMe that would require SGL support
> (and some driver changes I've been wanting to make for long where
> we always use SGLs for transfers larger than a single PRP if supported)

If we could avoid dealing with a virt boundary, then that would be nice.

Are there any patches yet for the change to always use SGLs for 
transfers larger than a single PRP?

On a related topic, I am not sure about how - or if we even should - 
enforce iovec PAGE-alignment or length; rather, the user could just be 
advised that iovecs must be PAGE-aligned and min PAGE length to achieve 
atomic_write_unit_max.

> 
> 
>> - Since debugging torn-writes due to unwanted kernel BIO splitting/merging
>>    would be horrible, should we add some kernel storage stack software
>>    integrity checks?
> Yes, I think we'll need asserts in the drivers.  At least for NVMe I
> will insist on them. 

Please see 16/16 in this series.

> For SCSI I think the device actually checks
> because the atomic writes are a different command anyway, or am I
> misunderstanding how SCSI works here?

Right, for WRITE ATOMIC (16) the target will reject a command when it 
does not respect the device atomic write limits.

Thanks,
John