linux-kernel - Re: [PATCH v9 1/2] fs: New zonefs file system

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20200130225921.GR18610@dread.disaster.area>
Date:   Fri, 31 Jan 2020 09:59:21 +1100
From:   Dave Chinner <david@...morbit.com>
To:     Damien Le Moal <Damien.LeMoal@....com>
Cc:     "hare@...e.de" <hare@...e.de>, Naohiro Aota <Naohiro.Aota@....com>,
        "linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "torvalds@...ux-foundation.org" <torvalds@...ux-foundation.org>,
        "darrick.wong@...cle.com" <darrick.wong@...cle.com>,
        "jth@...nel.org" <jth@...nel.org>,
        "linux-xfs@...r.kernel.org" <linux-xfs@...r.kernel.org>
Subject: Re: [PATCH v9 1/2] fs: New zonefs file system

On Thu, Jan 30, 2020 at 03:00:32AM +0000, Damien Le Moal wrote:
> On Thu, 2020-01-30 at 08:33 +1100, Dave Chinner wrote:
> > On Wed, Jan 29, 2020 at 01:06:29PM +0000, Damien Le Moal wrote:
> > > Exactly. This is how the ZBC & ZAC (and upcoming ZNS) specifications
> > > define the write pointer behavior. That makes error recovery a lot
> > > easier and does not result in stale data accesses. Just notice the one-
> > > off difference for the WP position from your example as WP will be
> > > pointing at the error location, not the last written location. Indexing
> > > from 0, we get (wp - zone start) always being isize with all written
> > > and readable data in the sector range between zone start and zone write
> > > pointer.
> > 
> > Ok, I'm going throw a curve ball here: volatile device caches.
> > 
> > How does the write pointer updates interact with device write
> > caches? i.e.  the first write could be sitting in the device write
> > cache, and the OS write pointer has been advanced. Then another write
> > occurs, the device decides to write both to physical media, and it
> > gets a write error in the area of the first write that only hit the
> > volatile cache.
> > 
> > So does this mean that, from the POV of the OS, the device zone
> > write pointer has gone backwards?
> 
> You are absolutely correct. Forgot to consider this case.
> Nice pitching :)

Potentially adverse IO ordering interactions with volatile device
caches are never that far from the mind of filesystem engineers...
:)

> > Unless there's some other magic that ensures device cached writes
> > that have been signalled as successfully completed to the OS
> > can never fail or that sequential zone writes are never cached in
> > volatile memory in drives, I can't see how the above guarantees
> > can be provided.
> 
> There not, at least from the standards point of view. Such guarantees
> would be device implementation dependent and so we cannot rely on
> anything in this regard. The write pointer ending up below the position
> of the last issue direct IO is thus a possibility and not necessarily
> indicative of an external action (and we actually cannot distinguish
> which case it really is).

*nod*

> > > It is hard to decide on the best action to take here considering the
> > > simple nature of zonefs (i.e. another better interface to do raw block
> > > device file accesses). Including your comments on mount options, I cam
> > > up with these actions that the user can choose with mount options:
> > > * repair: Truncate the inode size only, nothing else
> > > * remount-ro (default): Truncate the inode size and remount read-only
> > > * zone-ro: Truncate the inode size and set the inode read-only
> > > * zone-offline: Truncate the inode size to 0 and assume that its zone 
> > > is offline (no reads nor writes possible).
> > > 
> > > This gives I think a good range of possible behaviors that the user may
> > > want, from almost nothing (repair) to extreme to avoid accessing bad
> > > data (zone-offline).
> > 
> > I would suggest that this is something that can be added later as it
> > is not critical to supporting the underlying functionality.  Right
> > now I'd just pick the safest option: shutdown to protect what data
> > is on the storage right now and then let the user take action to
> > recover/fix the issue.
> 
> By shutdown, do you mean remounting read-only ? Or do you mean
> something more aggressive like preventing all accesses and changes to
> files, i.e. assuming all zones are offline ? The former is already
> there and is the default.

"shutdown" in this context means "do whatever is necessary to
prevent the problem getting worse". So, at minimum, it would be to
prevent further writes to the zone that has gone bad.

If there's potential for other zones to be affected, then moving to
a global read-only state is the right thing to do.

If there's potential for the error to expose stale data, propagate
the error further into currently good on-disk structures, or walk
off the end of corrupt structures (kernel crash and/or memory
corruption), then an aggressive "error out as early as possible"
shutdown is the right solution....

I suspect that zonefs really only needs to go as far as remounting
read-only as long as the hardware write pointers prevent reading the
zone beyond that point....

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com