lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZR86Z1OcO52a4BtH@dread.disaster.area>
Date:   Fri, 6 Oct 2023 09:36:23 +1100
From:   Dave Chinner <david@...morbit.com>
To:     Bart Van Assche <bvanassche@....org>
Cc:     "Martin K. Petersen" <martin.petersen@...cle.com>,
        John Garry <john.g.garry@...cle.com>, axboe@...nel.dk,
        kbusch@...nel.org, hch@....de, sagi@...mberg.me,
        jejb@...ux.ibm.com, djwong@...nel.org, viro@...iv.linux.org.uk,
        brauner@...nel.org, chandan.babu@...cle.com, dchinner@...hat.com,
        linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-nvme@...ts.infradead.org, linux-xfs@...r.kernel.org,
        linux-fsdevel@...r.kernel.org, tytso@....edu, jbongio@...gle.com,
        linux-api@...r.kernel.org
Subject: Re: [PATCH 10/21] block: Add fops atomic write support

On Thu, Oct 05, 2023 at 10:10:45AM -0700, Bart Van Assche wrote:
> On 10/4/23 11:17, Martin K. Petersen wrote:
> > 
> > Hi Bart!
> > 
> > > In other words, also for the above example it is guaranteed that
> > > writes of a single logical block (512 bytes) are atomic, no matter
> > > what value is reported as the ATOMIC TRANSFER LENGTH GRANULARITY.
> > 
> > There is no formal guarantee that a disk drive sector read-modify-write
> > operation results in a readable sector after a power failure. We have
> > definitely seen blocks being mangled in the field.
> 
> Aren't block devices expected to use a capacitor that provides enough
> power to handle power failures cleanly?

Nope.

Any block device that says it operates in writeback cache mode (i.e.
almost every single consumer SATA and NVMe drive ever made) has a
volatile write back cache and so does not provide any power fail
data integrity guarantees. Simple to check, my less-than-1-yr-old
workstation tells me:

$ lspci |grep -i nvme
03:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
06:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
$ cat /sys/block/nvme*n1/queue/write_cache
write back
write back
$

That they have volatile writeback caches....

> How about blacklisting block devices that mangle blocks if a power
> failure occurs? I think such block devices are not compatible with
> journaling filesystems nor with log-structured filesystems.

Statements like this from people working on storage hardware really
worry me. It demonstrates a lack of understanding of how filesystems
actually work, not to mention the fact that this architectural
problem (i.e. handling volatile device write caches correctly) was
solved in the Linux IO stack a couple of decades ago. This isn't
even 'state of the art' knowledge - this is foundational knowlege
that everyone working on storage should know.

The tl;dr summary is that filesystems will issue a cache flush
request (REQ_PREFLUSH) and/or write-through to stable storage
semantics (REQ_FUA) for any data, metadata or journal IO that has
data integrity and/or ordering requirements associated with it. The
block layer will then do the most optimal correct thing with that
request (e.g. ignore them for IO being directed at WC disabled
devices), but it guarantees the flush/fua semantics for those IOs
will be provided by all layers in the stack right down to the
persistent storage media itself. Hence all the filesystem has to do
is get it's IO and cache flush ordering correct, and everything
just works regardless of the underlying storage capabilities.

And, yes, any storage device with volatile caches that doesn't
implement cache flushes correctly is considered broken and will get
black listed....

-Dave.
-- 
Dave Chinner
david@...morbit.com

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ