[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f3c1d321-0dfc-466f-9f6a-fe2f0513d944@oracle.com>
Date: Fri, 5 Apr 2024 11:06:00 +0100
From: John Garry <john.g.garry@...cle.com>
To: Matthew Wilcox <willy@...radead.org>
Cc: axboe@...nel.dk, kbusch@...nel.org, hch@....de, sagi@...mberg.me,
jejb@...ux.ibm.com, martin.petersen@...cle.com, djwong@...nel.org,
viro@...iv.linux.org.uk, brauner@...nel.org, dchinner@...hat.com,
jack@...e.cz, linux-block@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-nvme@...ts.infradead.org,
linux-fsdevel@...r.kernel.org, tytso@....edu, jbongio@...gle.com,
linux-scsi@...r.kernel.org, ojaswin@...ux.ibm.com, linux-aio@...ck.org,
linux-btrfs@...r.kernel.org, io-uring@...r.kernel.org,
nilay@...ux.ibm.com, ritesh.list@...il.com
Subject: Re: [PATCH v6 00/10] block atomic writes
On 04/04/2024 17:48, Matthew Wilcox wrote:
>>> The thing is that there's no requirement for an interface as complex as
>>> the one you're proposing here. I've talked to a few database people
>>> and all they want is to increase the untorn write boundary from "one
>>> disc block" to one database block, typically 8kB or 16kB.
>>>
>>> So they would be quite happy with a much simpler interface where they
>>> set the inode block size at inode creation time,
>> We want to support untorn writes for bdev file operations - how can we set
>> the inode block size there? Currently it is based on logical block size.
> ioctl(BLKBSZSET), I guess? That currently limits to PAGE_SIZE, but I
> think we can remove that limitation with the bs>PS patches.
We want a consistent interface for bdev and regular files, so that would
need to work for FSes also. FSes(XFS) work based on a homogeneous inode
blocksize, which is the SB blocksize.
Furthermore, we would seem to be mixing different concepts here.
Currently in Linux we say that a logical block size write is atomic. In
the block layer, we split BIOs on LBS boundaries. iomap creates BIOs
based on LBS boundaries. But writing a FS block is not always guaranteed
to be atomic, as far as I'm concerned. So just increasing the inode
block size / FS block size does not really change anything, in itself.
>
>>> and then all writes to
>>> that inode were guaranteed to be untorn. This would also be simpler to
>>> implement for buffered writes.
>> We did consider that. Won't that lead to the possibility of breaking
>> existing applications which want to do regular unaligned writes to these
>> files? We do know that mysql/innodb does have some "compressed" mode of
>> operation, which involves regular writes to the same file which wants untorn
>> writes.
> If you're talking about "regular unaligned buffered writes", then that
> won't break. If you cross a folio boundary, the result may be torn,
> but if you're crossing a block boundary you expect that.
>
>> Furthermore, untorn writes in HW are expensive - for SCSI anyway. Do we
>> always want these for such a file?
> Do untorn writes actually exist in SCSI? I was under the impression
> nobody had actually implemented them in SCSI hardware.
I know that some SCSI targets actually atomically write data in chunks >
LBS. Obviously atomic vs non-atomic performance is a moot point there,
as data is implicitly always atomically written.
We actually have an mysql/innodb port of this API working on such a SCSI
target.
However I am not sure about atomic write support for other SCSI targets.
>
>> We saw untorn writes as not being a property of the file or even the inode
>> itself, but rather an attribute of the specific IO being issued from the
>> userspace application.
> The problem is that keeping track of that is expensive for buffered
> writes. It's a model that only works for direct IO. Arguably we
> could make it work for O_SYNC buffered IO, but that'll require some
> surgery.
To me, O_ATOMIC would be required for buffered atomic writes IO, as we
want a fixed-sized IO, so that would mean no mixing of atomic and
non-atomic IO.
Thanks,
John
Powered by blists - more mailing lists