[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140812170826.GH2808@birch.djwong.org>
Date: Tue, 12 Aug 2014 10:08:26 -0700
From: "Darrick J. Wong" <darrick.wong@...cle.com>
To: TR Reardon <thomas_reardon@...mail.com>
Cc: linux-ext4@...r.kernel.org
Subject: Re: journal_checksum_v2 on-disk format change? (was: Re: journal
recovery problems with metadata_csum, *non-64bit*)
On Tue, Aug 12, 2014 at 01:39:35AM -0400, TR Reardon wrote:
> On 8/11/14, Darrick J. Wong <darrick.wong@...cle.com> wrote:
> > Hi all,
> >
> > Mr. Reardon has discovered that due to a bug in journal_tag_bytes(), if the
> > the
> > "journal csum v2" feature flag is turned on, block tag records are being
> > written with two extra bytes of space because we don't need to execute
> > "x += sizeof(tag.t_checksum);". In 32-bit mode, other parts of the journal
> > then perform incorrect size comparisons, leading to BUG() being called. In
> > 64-bit mode, there's enough padding that bad things won't happen.
> >
> > This is a remnant of the days when I tried to enlarge journal_block_tag_t
> > to
> > hold the full 32-bit checksum for a data block that's stored in the
> > journal.
> > Back in 2011, we decided (though sadly I can't find the link; I think we
> > might
> > have discussed this in the concall) that it was better not to change the
> > size
> > of journal_block_tag_t than it was to make it bigger so that it could hold
> > the
> > full checksum.
> >
> > A simple fix for the problem has been proposed by Mr. Reardon which fixes
> > journal_tag_bytes() and leaves everything else unchanged. However, that is
> > technically a disk format change since the size of journal_block_tag_t on
> > disk
> > changes, albeit only for people running experimental metadata_csum
> > filesystems.
> > Since we've been allocating disk space for the enlarged checksum this whole
> > time, if we apply that patch, anyone with an unclean 64bit FS will not be
> > able
> > to recover the journal. (Anyone with an unclean 32-bit FS has been broken
> > the
> > whole time, and still will be.)
> >
> > The other thing we could do is actually use the two bytes to store the high
> > 16-bits of the checksum, fix the jbd2 helper functions to reflect that
> > reality
> > (so that they don't BUG()), and change the checksum verify routine to pass
> > the
> > block if either the full checksum matches, or if the lower 16 bits match
> > and
> > the upper 16 bits are zero. With this route, anybody with an uncleanly
> > unmounted FS could still recover the journal, since we're not changing the
> > size
> > of anything.
> >
> > Fortunately, journal tag blocks are fairly ephemeral and nobody ought to be
> > using metadata_csum on a production filesystem, so at this point we can
> > probably change the disk format without too many repercussions. If you
> > make
> > sure to cleanly unmount the filesystem when changing kernel/e2fsprogs
> > versions,
> > there will be no problems.
> >
> > So, uh... comments? How should we proceed?
> >
> > --D
>
> My only concern is that legacy applies to in-the-wild kernels, not
> just journals. Any post 3.4 kernel has this problem, which will be
> exposed as soon as e2fsprogs 1.43 is released. A common (enough) use
> case might be, say, a 2TB USB drive, being unplugged and replugged
> across machines without proper shutdown. Old machines will think
> they recognize the journal but don't actually, and replay something
> destructive.
>
> In other words, this is a future-retro problem. The problem is not so
> much with existing fs experimenting with metadata_csum, but a future
> properly-journaled drive being plugged into an legacy faulty machine.
> Those incompat flags are supposed to protect against just this
> scenario, but won't. So perhaps you'll need make a new
> "INCOMPAT_CSUM_V3" flag to protect? Actually, dwelling on this a bit
> today, I think whether or not you retain those 2 extra checksum bytes
> in final fix, you ought use a new flag for the format.
Definitely we need a new bit to keep the old kernels out; I wanted others to
weigh in on whether we ought to shrink the struct or put the two bytes to use.
That said... journal recovery is not so well tested, so I'm currently busy
building out debugfs to create journal transactions so we can test recovery
both with and without *_csum.
--D
>
> +Reardon
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists