linux-ext4 - Re: [PATCH 2/4] e2scrub: create online fsck tool of sorts

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180309225053.GB4856@magnolia>
Date:   Fri, 9 Mar 2018 14:50:53 -0800
From:   "Darrick J. Wong" <darrick.wong@...cle.com>
To:     Andreas Dilger <adilger@...ger.ca>
Cc:     Lukas Czerner <lczerner@...hat.com>, tytso@....edu,
        linux-ext4@...r.kernel.org
Subject: Re: [PATCH 2/4] e2scrub: create online fsck tool of sorts

On Fri, Mar 09, 2018 at 01:40:01PM -0700, Andreas Dilger wrote:
> On Mar 9, 2018, at 1:18 PM, Lukas Czerner <lczerner@...hat.com> wrote:
> > 
> > On Fri, Mar 09, 2018 at 12:40:17PM -0700, Andreas Dilger wrote:
> >> On Mar 9, 2018, at 6:31 AM, Lukas Czerner <lczerner@...hat.com> wrote:
> >>> 
> >>> On Thu, Mar 08, 2018 at 12:13:27PM -0800, Darrick J. Wong wrote:
> >>>> From: Darrick J. Wong <darrick.wong@...cle.com>
> >>>> 
> >>>> Implement online fsck for ext* filesystems which live on LVM-managed
> >>>> logical volumes.  The basic strategy mirrors that of e2croncheck --
> >>>> create a snapshot, fsck the snapshot, report whatever errors appear,
> >>>> remove snapshot.  Unlike e2croncheck, this utility accepts any LVM
> >>>> device path, knows about snapshots running out of space, and can call
> >>>> fstrim having validated that the fs metadata is ok.
> >>> 
> >>> Thanks Darrick this looks good, I have couple of comments below. Also
> >>> it's not a big deal, but I have a slight issue with the name e2scrub.
> >>> What we will call it if/when we have a real in-kernel fs scrub ?
> >> 
> >> I'd think in that case the e2scrub script will call the in-kernel version
> >> of the scrub?  That might even be "e2fsck" just calling an ioctl, like
> >> resize2fs decides whether to do an online or offline resize depending if
> >> the filesystem is mounted or not.
> > 
> > Yeah, I suppose.
> > 
> >> 
> >>>> diff --git a/scrub/e2scrub.8.in b/scrub/e2scrub.8.in
> >>>> new file mode 100644
> >>>> index 0000000..7cb6dfb
> >>>> --- /dev/null
> >>>> +++ b/scrub/e2scrub.8.in
> >>>> @@ -0,0 +1,34 @@
> >>>> +.TH E2SCRUB 8 "@E2FSPROGS_MONTH@ @E2FSPROGS_YEAR@" "E2fsprogs version @E2FSPROGS_VERSION@"
> >>>> +.SH NAME
> >>>> +e2scrub - check a mounted ext2/ext3/ext4 file system on an LVM volume for errors.
> >>>> +.SH SYNOPSYS
> >>>> +.B
> >>>> +e2scrub [OPTION] [LVM DEVICE PATH]
> >>>> +.SH DESCRIPTION
> >>>> +Given a live file system on a LVM volume, this program snapshots the
> >>>> +logical volume and runs a file system check to look for serious errors.
> >>>> +If no errors are found, fstrim can be called on the mounted file system.
> >>>> +However, if errors are found, the file system should be unmounted and
> >>>> +fixed.
> >>> 
> >>> e2scrub will _not_ fix you file system, I think we should make this part
> >>> really clear.
> >>> 
> >>> I am missing some more information on what are the requirements for this to
> >>> work - like free space in the vg.
> >>> 
> >>> How about few words about disaster recovery, I know you're trying to
> >>> always remove the snapshot, but what if it fails for whatever reason ? I
> >>> think there are things users should know about so that they can help
> >>> themselves.
> >> 
> >> Later on in the patch series there is a command to reap stale snapshots
> >> via cron or systemd once a day.
> > 
> > True, but it might not be in place and regardless of how it's set up it
> > usefull to document what do to if anything goes wrong. As it is now the
> > documentation is really light on details of what is it actually doing.
> 
> Yeah, the lvscan script that I had would delete the old snapshot if it
> found one, after checking that there was not an older scan still running
> on the device.

e2scrub tears down any existing .e2scrub snapshot prior to creating a
new one...

> One option would be to consolidate the "e2scrub_reap" functionality into
> e2scrub itself, and then allow "e2scrub -n" or "e2scrub --reap" to be
> used to only do the cleanup?  That way, at least the next e2scrub run
> will clean up the old snapshot rather than generating a useless error,
> or running a check on an old (stale) copy of the snapshot.

...so e2scrub_reap exists to clean up stale .e2scrub snapshots prior to
the regular e2scrub run so that we don't block the scheduled e2scrub
from doing its thing because the vg doesn't have enough space.

Granted if the vg reserves 256M per LV then yes we don't need the
reaper.

--D

> That would also mean one less script to be installed and documented.
> 
> Cheers, Andreas
> 
> >>> Also in case this is run on dm-thin lv it's worth mentioning that lvm
> >>> will use the "old" snaphost instead if the thin snapshot. However this
> >>> is something we can recognize and use the right command.
> >>> 
> >>>> +.SH OPTIONS
> >>>> +.TP
> >>>> +\fB-t\fR
> >>>> +Run
> >>>> +.B
> >>>> +fstrim(1)
> >>>> +on the mounted filesystem if no errors are found.
> >>>> +.TP
> >>>> +\fB-V\fR
> >>>> +Print version information and exit.
> >>>> +.SH EXIT CODE
> >>>> +The exit codes are the same as in
> >>>> +.BR e2fsck (8)
> >>>> +.SH BUGS
> >>>> +This utility is capable of checking any ext* filesystem on an LVM volume,
> >>>> +regardless of whether it is mounted.
> >>> 
> >>> If it's not mounted it'll still use lvm snapshots right ? It's not
> >>> necessarily clear from the documentation.
> >> 
> >> I think yes, which is safest in case the filesystem suddenly is mounted
> >> in the middle of the scrub.
> > 
> > Definitelly, but that needs to be documented.
> > 
> >> 
> >>>> +.SH SEE ALSO
> >>>> +.BR e2fsck (8)
> >>>> +.SH AUTHOR
> >>>> +Darrick J. Wong <darrick.wong@...cle.com>
> >>>> +.SH COPYRIGHT
> >>>> +Copyright ©2018 Darrick J. Wong.  License is GPLv2+. <http://www.gnu.org/licenses/gpl-2.0.html>
> >>>> diff --git a/scrub/e2scrub.conf.in b/scrub/e2scrub.conf.in
> >>>> new file mode 100644
> >>>> index 0000000..fec828a
> >>>> --- /dev/null
> >>>> +++ b/scrub/e2scrub.conf.in
> >>>> @@ -0,0 +1,10 @@
> >>>> +# e2scrub configuration file
> >>>> +
> >>>> +# Snapshots will be created to run fsck; the snapshot will be of this size.
> >>>> +# snap_size_mb=256
> >>> 
> >>> Do you have any idea what it takes to fill this up ? What kind of
> >>> workload for how long ? Why do you think it's a sane default ?
> >>> 
> >>> I am well aware that there is no right answer I am just curious.
> >> 
> >> I've used 256MB in my lvcheck scripts for years without issues.  It
> >> really depends on how long the e2fsck takes and how much churn there
> >> is in the filesystem during that time.  Reserving a bit too much is not
> >> really harmful, since the space is only used temporarily.
> > 
> > Good to know.
> > 
> >> 
> >> What might be difficult is that users will typically consume all of
> >> the space in the VG from the beginning.  It would be great if distros
> >> reserved a few hundred MB in the VG at installation time for this.
> > 
> > That's very much true, I think lvm people can tell a lot of stories about
> > this one :)
> > 
> >> 
> >>>> diff --git a/scrub/e2scrub.in b/scrub/e2scrub.in
> >>>> new file mode 100644
> >>>> index 0000000..e26b449
> >>>> --- /dev/null
> >>>> +++ b/scrub/e2scrub.in
> >>>> @@ -0,0 +1,162 @@
> >>>> +#!/bin/bash
> >>>> +
> >>>> +#  Copyright (C) 2018 Oracle.  All Rights Reserved.
> >>>> +#
> >>>> +#  Author: Darrick J. Wong <darrick.wong@...cle.com>
> >>>> +#
> >>>> +#  This program is free software; you can redistribute it and/or
> >>>> +#  modify it under the terms of the GNU General Public License
> >>>> +#  as published by the Free Software Foundation; either version 2
> >>>> +#  of the License, or (at your option) any later version.
> >>>> +#
> >>>> +#  This program is distributed in the hope that it would be useful,
> >>>> +#  but WITHOUT ANY WARRANTY; without even the implied warranty of
> >>>> +#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> >>>> +#  GNU General Public License for more details.
> >>>> +#
> >>>> +#  You should have received a copy of the GNU General Public License
> >>>> +#  along with this program; if not, write the Free Software Foundation,
> >>>> +#  Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
> >>>> +
> >>>> +# Automatically check a LVM-managed filesystem online.
> >>>> +# We use lvm snapshots to do this, which means that we can only
> >>>> +# check filesystems in VGs that have at least 256mb (or so) of
> >>>> +# free space.
> >>>> +
> >>>> +snap_size_mb=256
> >>>> +fstrim=0
> >>>> +e2fsck_opts=""
> >>>> +conffile="@root_sysconfdir@...scrub.conf"
> >>>> +
> >>>> +test -f "${conffile}" && . "${conffile}"
> >>>> +
> >>>> +print_help() {
> >>>> +	echo "Usage: $0 [-t] device"
> >>>> +	echo
> >>>> +	echo "device must be a LVM-managed block device"
> >>>> +	echo "-t: Run fstrim if successful."
> >>>> +}
> >>>> +
> >>>> +print_version() {
> >>>> +	echo "e2scrub @E2FSPROGS_VERSION@ (@E2FSPROGS_DATE@)"
> >>>> +}
> >>>> +
> >>>> +while getopts "tV" opt; do
> >>>> +	case "${opt}" in
> >>>> +	"t") fstrim=1;;
> >>>> +	"V") print_version; exit 0;;
> >>>> +	*) print_help; exit 2;;
> >>>> +	esac
> >>>> +done
> >>>> +shift "$((OPTIND - 1))"
> >>>> +
> >>>> +dev="$1"
> >>>> +if [ -z "${dev}" ]; then
> >>>> +	print_help
> >>>> +	exit 1
> >>>> +elif [ ! -b "${dev}" ]; then
> >>>> +	echo "${dev}: Not a block device?"
> >>>> +	print_help
> >>>> +	exit 16
> >>>> +fi
> >>>> +
> >>>> +# Make sure this is an LVM device we can snapshot
> >>>> +vg="$(lvs --noheadings -o vg_name "${dev}" 2> /dev/null | sed -e 's/^  //g')"
> >>>> +lv="$(lvs --noheadings -o lv_name "${dev}" 2> /dev/null | sed -e 's/^  //g')"
> >>> 
> >>> You could use
> >>> 
> >>> eval $(lvs --nameprefixes -o name,vgname --noheadings "${dev}")
> >>> 
> >>> and have lvm set up $LVM2_VG_NAME and $LVM2_LV_NAME for you
> >>> 
> >>>> +if [ -z "${vg}" ] || [ -z "${lv}" ]; then
> >>>> +	echo "${dev}: Not a LVM device."
> >>>> +	exit 16
> >>>> +fi
> >>>> +start_time="$(date +'%Y%m%d%H%M%S')"
> >>>> +snap="${lv}.e2scrub"
> >>> 
> >>> Let's hope people are not creating these intentionally. Maybe using a
> >>> timestamp might be worthwhile ?
> >>> 
> >>>> +snap_dev="/dev/${vg}/${snap}"
> >>>> +fstype="$(blkid -p -s TYPE "${dev}" | sed -e 's/^.*TYPE="\(.*\)".*$/\1/g')"
> >>> 
> >>> blkid -o value -p -s TYPE "${dev}"
> >>> 
> >>>> +
> >>>> +case "${fstype}" in
> >>>> +"ext2"|"ext3"|"ext4")
> >>> 
> >>> ext[234]
> >>> 
> >>>> +	;;
> >>>> +*)
> >>>> +	echo "${dev}: Filesystem of type ${fstype} not supported."
> >>>> +	exit 16
> >>>> +	;;
> >>>> +esac
> >>>> +
> >>>> +teardown() {
> >>>> +	# Remove and wait for removal to succeed.
> >>>> +	lvremove -f "${vg}/${snap}" 3>&-
> >>>> +	while [ -b "${snap_dev}" ] && [ "$?" -eq "5" ]; do
> >>>> +		sleep 0.5
> >>>> +		lvremove -f "${vg}/${snap}" 3>&-
> >>>> +	done
> >>>> +}
> >>>> +
> >>>> +check() {
> >>>> +	# First we recover the journal, then we see if e2fsck tries any
> >>>> +	# non-optimization repairs.  If either of these two returns a
> >>>> +	# non-zero status (errors fixed or remaining) then this fs is bad.
> >>>> +	E2FSCK_FIXES_ONLY=1
> >>>> +	export E2FSCK_FIXES_ONLY
> >>>> +	${DBG} "@root_sbindir@...fsck" -E journal_only -p ${e2fsck_opts} "${snap_dev}" || return 1
> >>>> +	${DBG} "@root_sbindir@...fsck" -fy ${e2fsck_opts} "${snap_dev}" || return 1
> >>> 
> >>> Why to use -y ? Wouldn't it be better to just run with -p ?
> >>> 
> >>>> +	return 0
> >>>> +}
> >>>> +
> >>>> +mark_clean() {
> >>>> +	${DBG} "@root_sbindir@...ne2fs" -C 0 -T "${start_time}" "${dev}"
> >>> 
> >>> I am not sure about -C 0. I understand that it's there to elliminate the
> >>> mount-count based fsck, but it also means the mount-count is not going
> >>> to be reliable, not sure if that matters much.
> >> 
> >> IMHO, with the e2scrub functionality in place it would make sense to
> >> reinstate the time- and/or mount-based e2fsck.  That would catch the case
> >> if e2scrub or the cron/systemd jobs that run it were broken for some
> >> reason and hadn't checked the filesystem in months.
> > 
> > If the cron/systemd jobs are set up then it might make sense maybe. But
> > again after fs unmount the mount-count might be 0, not sure if that
> > matters to people ar all.
> > 
> >> 
> >>>> +}
> >>>> +
> >>>> +mark_corrupt() {
> >>>> +	${DBG} "@root_sbindir@...ne2fs" -E force_fsck "${dev}"
> >>>> +}
> >>>> +
> >>>> +setup() {
> >>>> +	# Create the snapshot, wait for device to appear
> >>>> +	teardown > /dev/null 2> /dev/null
> >>>> +	lvcreate -s -L "${snap_size_mb}m" -n "${snap}" "${vg}/${lv}" 3>&-
> >>>> +	test $? -ne 0 && return 1
> >>>> +	udevadm settle 2> /dev/null
> >>>> +	return 0
> >>>> +}
> >>>> +
> >>>> +trap "teardown; exit 1" EXIT INT QUIT TERM
> >>>> +if ! setup; then
> >>>> +	echo "Snapshot of ${dev} FAILED, will not check!"
> >>>> +	exit 1
> >>>> +fi
> >>>> +
> >>>> +# Check and react
> >>>> +if check; then
> >>>> +	echo "Scrub of ${dev} succeeded."
> >>>> +	mark_clean
> >>>> +
> >>>> +	if [ "${fstrim}" -eq 1 ] && type fstrim > /dev/null 2>&1; then
> >>>> +		dir="$(lsblk -o MOUNTPOINT -n "${dev}")"
> >>>> +		if [ -d "${dir}" ]; then
> >>>> +			# NB: fstrim fails with snapshot present
> >>>> +			trap '' EXIT
> >>>> +			teardown
> >>>> +			fstrim -v "${dir}"
> >>> 
> >>> This can take a long time, how about some information for the user what
> >>> we're doing ?
> >> 
> >> This will typically be run via cron/systemd, so you don't want it to be
> >> too verbose.
> > 
> > But it does not necessarily have to be run from cron. Then it'll say
> > 
> > Scrub of ${dev} succeeded.
> > 
> > ... then hangs for minutes or potentially even longer, that's not good.
> > 
> > 
> > Remember, if we get it into e2fsprogs it'll get to distributions and
> > people will start using and experimenting with it. People that not
> > necessarily know much about lvm or fstrim. It needs to be well documented
> > and verbose enough for people to at least have some idea of what's going on.
> > 
> > Especially since I think it might attract some attention, if nothing
> > else then just because btrfs has scrub ;)
> > 
> > Regards,
> > -Lukas
> > 
> >> 
> >> Cheers, Andreas
> 
> 
> Cheers, Andreas
> 
> 
> 
> 
>