linux-kernel - Re: [PATCH RFC 01/16] block: Add atomic write operations to request

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <yq1y1m10w0d.fsf@ca-mkp.ca.oracle.com>
Date:   Sat, 06 May 2023 21:59:30 -0400
From:   "Martin K. Petersen" <martin.petersen@...cle.com>
To:     "Darrick J. Wong" <djwong@...nel.org>
Cc:     John Garry <john.g.garry@...cle.com>,
        Dave Chinner <david@...morbit.com>, axboe@...nel.dk,
        kbusch@...nel.org, hch@....de, sagi@...mberg.me,
        martin.petersen@...cle.com, viro@...iv.linux.org.uk,
        brauner@...nel.org, dchinner@...hat.com, jejb@...ux.ibm.com,
        linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-nvme@...ts.infradead.org, linux-scsi@...r.kernel.org,
        linux-xfs@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        linux-security-module@...r.kernel.org, paul@...l-moore.com,
        jmorris@...ei.org, serge@...lyn.com,
        Himanshu Madhani <himanshu.madhani@...cle.com>
Subject: Re: [PATCH RFC 01/16] block: Add atomic write operations to
 request_queue limits


Darrick,

> Could a SCSI device could advertise 512b LBAs, 4096b physical blocks,
> a 64k atomic_write_unit_max, and a 1MB maximum transfer length
> (atomic_write_max_bytes)?

Yes.

> And does that mean that application software can send one 64k-aligned
> write and expect it either to be persisted completely or not at all?

Yes.

> And, does that mean that the application can send up to 16 of these
> 64k-aligned blocks as a single 1MB IO and expect that each of those 16
> blocks will either be persisted entirely or not at all?

Yes.

> There doesn't seem to be any means for the device to report /which/ of
> the 16 were persisted, which is disappointing. But maybe the
> application encodes LSNs and can tell after the fact that something
> went wrong, and recover?

Correct. Although we traditionally haven't had too much fun with partial
completion for sequential I/O either.

> If the same device reports a 2048b atomic_write_unit_min, does that mean
> that I can send between 2 and 64k of data as a single atomic write and
> that's ok?  I assume that this weird situation (512b LBA, 4k physical,
> 2k atomic unit min) requires some fancy RMW but that the device is
> prepared to cr^Wpersist that correctly?

Yes.

It would not make much sense for a device to report a minimum atomic
granularity smaller than the reported physical block size. But in theory
it could.

> What if the device also advertises a 128k atomic_write_boundary?
> That means that a 2k atomic block write will fail if it starts at 127k,
> but if it starts at 126k then thats ok.  Right?

Correct.

> As for avoiding splits in the block layer, I guess that also means that
> someone needs to reduce atomic_write_unit_max and atomic_write_boundary
> if (say) some sysadmin decides to create a raid0 of these devices with a
> 32k stripe size?

Correct. Atomic limits will need to be stacked for MD and DM like we do
with the remaining queue limits.

> It sounds like NVME is simpler in that it would report 64k for both the
> max unit and the max transfer length?  And for the 1M write I mentioned
> above, the application must send 16 individual writes?

Correct.

> With my app developer hat on, the simplest mental model of this is that
> if I want to persist a blob of data that is larger than one device LBA,
> then atomic_write_unit_min <= blob size <= atomic_write_unit_max must be
> true, and the LBA range for the write cannot cross a atomic_write_boundary.
>
> Does that sound right?

Yep.

> Going back to my sample device above, the XFS buffer cache could write
> individual 4k filesystem metadata blocks using REQ_ATOMIC because 4k is
> between the atomic write unit min/max, 4k metadata blocks will never
> cross a 128k boundary, and we'd never have to worry about torn writes
> in metadata ever again?

Correct.

> Furthermore, if I want to persist a bunch of blobs in a contiguous LBA
> range and atomic_write_max_bytes > atomic_write_unit_max, then I can do
> that with a single direct write?

Yes.

> I'm assuming that the blobs in the middle of the range must all be
> exactly atomic_write_unit_max bytes in size?

If you care about each blob being written atomically, yes.

> And I had better be prepared to (I guess) re-read the entire range
> after the system goes down to find out if any of them did or did not
> persist?

If you crash or get an I/O error, then yes. There is no way to inquire
which blobs were written. Just like we don't know which LBAs were
written if the OS crashes in the middle of a regular write operation.

-- 
Martin K. Petersen	Oracle Linux Engineering