linux-ext4 - Re: kernel bug at fs/ext4/resize.c:409

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140215031624.GI9176@birch.djwong.org>
Date:	Fri, 14 Feb 2014 19:16:24 -0800
From:	"Darrick J. Wong" <darrick.wong@...cle.com>
To:	"Theodore Ts'o" <tytso@....edu>
Cc:	Jon Bernard <jbernard@...ion.com>,
	Dmitry Monakhov <dmonakhov@...nvz.org>,
	linux-ext4@...r.kernel.org
Subject: Re: kernel bug at fs/ext4/resize.c:409

Per Ted's request, I've started editing a document on the ext4 wiki:

https://ext4.wiki.kernel.org/index.php/Ext4_VM_Images 

[comments below too]

On Fri, Feb 14, 2014 at 06:46:31PM -0500, Theodore Ts'o wrote:
> On Fri, Feb 14, 2014 at 03:19:05PM -0500, Jon Bernard wrote:
> > Ahh, I see.  Here's where this comes from: the particular usecase is
> > provisioning of new cloud instances whose root volume is of unknown
> > size.  The filesystem and its contents are created and bundled
> > before-hand into the smallest filesystem possible.  The instance is PXE
> > booted for provisioning and the root filesystem is then copied onto the
> > disk - and then resized to take advantage of the total amount of space.
> > 
> > In order to support very large partitions, the filesystem is created
> > with an abnormally large inode table so that large resizes would be
> > possible.  I traced it to this commit as best I can tell:
> > 
> >     https://github.com/openstack/diskimage-builder/commit/fb246a02eb2ed330d3cc37f5795b3ed026aabe07
> > 
> > I assumed that additional inodes would be allocated along with block
> > groups during an online resize, but that commit contradicts my current
> > understanding. 
> 
> Additional inodes *are* allocated as the file system is grown.
> However thought otherwise was wrong.  What happens is that there is a
> fixed number of inodes per block group.  When the file system is
> resized, either by growing or shrinking file system, as block groups
> are added or removed from the file system, the number of inodes
> is also added or removed.
> 
> > I suggested that the filesystem be created during the time of
> > provisioning to allow a more optimal on-disk layout, and I believe this
> > is being considered now.
> 
> What causes the most damage in terms of a non-optimal data block
> layout, installing the file system on a large file system, and then
> shrinking the file system to its minimum size use resize2fs -M.  There
> is so some non-optimality that occurs as the file system gets filled
> beyond about 90% full, but that it's not nearly so bad as shrinking
> the file system --- which you should avoid at all costs.
> 
> From a performance point of view, the only time you should try to do
> an off-line resize2fs shrink is if you are shrinking the file system
> by a handful of blocks as part of converting a file system in place to
> use LVM or LUKS encryption, and you need to make room for some
> metadata blocks at the end of the partition.
> 
> The other thing thing to note is that if you are using a format such
> as qcow2, or something like the device-mapper's thin-provisining
> (thinkp) scheme, or if you are willing to deal with sparse files, one
> approach is to not resize the file system at all.  You could just use
> a tool like zerofree[1] to zero out all of the unused blocks in the
> file system, and then use "/bin/cp --sparse==always" to cause all zero
> blocks to be treated as sparse blocks on the destination file.
> 
> [1] http://git.kernel.org/cgit/fs/ext2/xfstests-bld.git/tree/kvm-xfstests/util/zerofree.c

I have a zerofree variant that knows how to punch/discard blocks that I'll
throw into contrib/ the next time I send out one of my megapatch sets.

> This is part of how I maintain my root filesystem that I use in a VM
> for testing ext4 changes upstream.  After I update to the latest
> Debian unstable package updates, install the latest updates from the
> xfstests and e2fsprogs git repositories, I then run the following
> script which uses the zerofree.c program to compress the qcow2 root
> file system image that I use with kvm:
> 
> http://git.kernel.org/cgit/fs/ext2/xfstests-bld.git/tree/kvm-xfstests/compress-rootfs
> 
> 
> Also, starting with e2fsprogs 1.42.10, there's another way you can

These three options (-rap) are available in 1.42.9.  Is there a particular
reason not to use it before 1.42.10?

> efficiently deploy a large file system image by only copying the
> blocks which are in use, by using a command like this:
> 
>        e2image -rap src_fs dest_fs
> 
> (See also the -c flag as described in e2image's man page if you want
> to use this technique to do incremental image-based backups onto a
> flash-based backup medium; I was using this for a while to keep two
> laptop SSD's root filesystem in sync with one another.)
> 
> So there are lots of ways that you can do what you need, all without
> playing games with resize2fs.  Perhaps some of them would actually be
> better for your use case.

Calvin Watson noted on Ted's G+ repost that one can use fstrim in newer
versions of QEMU (1.5+?) to punch out unused blocks if the virtual disk is
emulated via virtio-scsi.

--D
> 
> 
> > If it turns out to be not terribly complicated and there is not an
> > immediate time constraint, I would love to try to help with this or at
> > least test patches.
> 
> I will hopefully have a bug fix in the next week or two.  
> 
> Cheers,
> 
> 						- Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html