linux-kernel - Re: [PATCH v6 00/10] block atomic writes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f3c1d321-0dfc-466f-9f6a-fe2f0513d944@oracle.com>
Date: Fri, 5 Apr 2024 11:06:00 +0100
From: John Garry <john.g.garry@...cle.com>
To: Matthew Wilcox <willy@...radead.org>
Cc: axboe@...nel.dk, kbusch@...nel.org, hch@....de, sagi@...mberg.me,
        jejb@...ux.ibm.com, martin.petersen@...cle.com, djwong@...nel.org,
        viro@...iv.linux.org.uk, brauner@...nel.org, dchinner@...hat.com,
        jack@...e.cz, linux-block@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-nvme@...ts.infradead.org,
        linux-fsdevel@...r.kernel.org, tytso@....edu, jbongio@...gle.com,
        linux-scsi@...r.kernel.org, ojaswin@...ux.ibm.com, linux-aio@...ck.org,
        linux-btrfs@...r.kernel.org, io-uring@...r.kernel.org,
        nilay@...ux.ibm.com, ritesh.list@...il.com
Subject: Re: [PATCH v6 00/10] block atomic writes

On 04/04/2024 17:48, Matthew Wilcox wrote:
>>> The thing is that there's no requirement for an interface as complex as
>>> the one you're proposing here.  I've talked to a few database people
>>> and all they want is to increase the untorn write boundary from "one
>>> disc block" to one database block, typically 8kB or 16kB.
>>>
>>> So they would be quite happy with a much simpler interface where they
>>> set the inode block size at inode creation time,
>> We want to support untorn writes for bdev file operations - how can we set
>> the inode block size there? Currently it is based on logical block size.
> ioctl(BLKBSZSET), I guess?  That currently limits to PAGE_SIZE, but I
> think we can remove that limitation with the bs>PS patches.

We want a consistent interface for bdev and regular files, so that would 
need to work for FSes also. FSes(XFS) work based on a homogeneous inode 
blocksize, which is the SB blocksize.

Furthermore, we would seem to be mixing different concepts here. 
Currently in Linux we say that a logical block size write is atomic. In 
the block layer, we split BIOs on LBS boundaries. iomap creates BIOs 
based on LBS boundaries. But writing a FS block is not always guaranteed 
to be atomic, as far as I'm concerned. So just increasing the inode 
block size / FS block size does not really change anything, in itself.

> 
>>> and then all writes to
>>> that inode were guaranteed to be untorn.  This would also be simpler to
>>> implement for buffered writes.
>> We did consider that. Won't that lead to the possibility of breaking
>> existing applications which want to do regular unaligned writes to these
>> files? We do know that mysql/innodb does have some "compressed" mode of
>> operation, which involves regular writes to the same file which wants untorn
>> writes.
> If you're talking about "regular unaligned buffered writes", then that
> won't break.  If you cross a folio boundary, the result may be torn,
> but if you're crossing a block boundary you expect that.
> 
>> Furthermore, untorn writes in HW are expensive - for SCSI anyway. Do we
>> always want these for such a file?
> Do untorn writes actually exist in SCSI?  I was under the impression
> nobody had actually implemented them in SCSI hardware.

I know that some SCSI targets actually atomically write data in chunks > 
LBS. Obviously atomic vs non-atomic performance is a moot point there, 
as data is implicitly always atomically written.

We actually have an mysql/innodb port of this API working on such a SCSI 
target.

However I am not sure about atomic write support for other SCSI targets.

> 
>> We saw untorn writes as not being a property of the file or even the inode
>> itself, but rather an attribute of the specific IO being issued from the
>> userspace application.
> The problem is that keeping track of that is expensive for buffered
> writes.  It's a model that only works for direct IO.  Arguably we
> could make it work for O_SYNC buffered IO, but that'll require some
> surgery.

To me, O_ATOMIC would be required for buffered atomic writes IO, as we 
want a fixed-sized IO, so that would mean no mixing of atomic and 
non-atomic IO.

Thanks,
John