linux-ext4 - Re: [LSF/MM TOPIC] More async operations for file systems

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20190222164504.GB10066@localhost.localdomain>
Date:   Fri, 22 Feb 2019 09:45:05 -0700
From:   Keith Busch <keith.busch@...el.com>
To:     "Martin K. Petersen" <martin.petersen@...cle.com>
Cc:     Ric Wheeler <ricwheeler@...il.com>,
        Dave Chinner <david@...morbit.com>,
        lsf-pc@...ts.linux-foundation.org,
        linux-xfs <linux-xfs@...r.kernel.org>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>,
        linux-ext4 <linux-ext4@...r.kernel.org>,
        linux-btrfs <linux-btrfs@...r.kernel.org>,
        linux-block@...r.kernel.org
Subject: Re: [LSF/MM TOPIC] More async operations for file systems - async
 discard?

On Thu, Feb 21, 2019 at 09:51:12PM -0500, Martin K. Petersen wrote:
> 
> Keith,
> 
> > With respect to fs block sizes, one thing making discards suck is that
> > many high capacity SSDs' physical page sizes are larger than the fs
> > block size, and a sub-page discard is worse than doing nothing.
> 
> That ties into the whole zeroing as a side-effect thing.
> 
> The devices really need to distinguish between discard-as-a-hint where
> it is free to ignore anything that's not a whole multiple of whatever
> the internal granularity is, and the WRITE ZEROES use case where the end
> result needs to be deterministic.

Exactly, yes, considering the deterministic zeroing behavior. For devices
supporting that, sub-page discards turn into a read-modify-write instead
of invalidating the page.  That increases WAF instead of improving it
as intended, and large page SSDs are most likely to have relatively poor
write endurance in the first place.

We have NVMe spec changes in the pipeline so devices can report this
granularity. But my real concern isn't with discard per se, but more
with the writes since we don't support "sector" sizes greater than the
system's page size. This is a bit of a different topic from where this
thread started, though.