[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <06724CF51D6BC94E9BEE7A8A8CB82A6740FE22BD23@MX01A.corp.emc.com>
Date: Wed, 23 Sep 2015 12:04:49 -0400
From: "Pocas, Jamie" <Jamie.Pocas@....com>
To: "Theodore Ts'o" <tytso@....edu>
CC: Eric Sandeen <sandeen@...hat.com>,
"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>
Subject: RE: resize2fs stuck in ext4_group_extend with 100% CPU Utilization
With Small Volumes
Interesting. Thanks for the detailed break-down! I don't mind the workaround of using 4k "soft" block size on the filesystem, even for smaller filesystems. Now that I understand better, I think you were on target with your earlier explanation of bd_set_size(). So this means it's not an ext4 bug. I think the online resize of loopback device (or any other block device driver) should use something like the code in check_disk_size_change() instead of bd_set_size(). I will have to test this out. Thanks again.
Regards,
- Jamie
-----Original Message-----
From: Theodore Ts'o [mailto:tytso@....edu]
Sent: Wednesday, September 23, 2015 11:14 AM
To: Pocas, Jamie
Cc: Eric Sandeen; linux-ext4@...r.kernel.org
Subject: Re: resize2fs stuck in ext4_group_extend with 100% CPU Utilization With Small Volumes
On Wed, Sep 23, 2015 at 12:20:17AM -0400, Pocas, Jamie wrote:
> Ted, just to add another data point, with some minor adjustments to
> the script to use xfs instead, such as using "mkfs.xfs -b size=1024"
> to force 1k blocks, I cannot reproduce the issue and the data block
> size doesn't change from 1k.
Yes, that's not surprising, because XFS doesn't use the buffer cache layer. Ext4 does, because that's the basis of how the jbd2 layer works. It does change the block size as reported by the block device and which is used by the buffer cache layer, though. (Internally, this is known as the "soft" block size; it's basically the data in which data is cached in the buffer cache layer):
root@...-xfstests:~# truncate -s 100M /tmp/foo.img root@...-xfstests:~# mkfs.xfs -b size=1024 /tmp/foo.img
meta-data=/tmp/foo.img isize=512 agcount=4, agsize=25600 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=0
data = bsize=1024 blocks=102400, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=1024 blocks=2573, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
root@...-xfstests:~# mount -o loop /tmp/foo.img /mnt root@...-xfstests:~# blockdev --getbsz /dev/loop0
1024
root@...-xfstests:~# losetup -c /dev/loop0 root@...-xfstests:~# blockdev --getbsz /dev/loop0
4096 <--------- BUG, note the change in the block size root@...-xfstests:~# touch /mnt/foo root@...-xfstests:~# sync
<------ The reason why we don't hang is that XFS doesn't use the
<------ buffer cache
root@...-xfstests:~# umount /mnt
Also feel free to try my repro, but using "blockdev --getbsz /dev/loop" before and after the losetup -c command, and note that it does not hang even though there is no resize2fs in the command sequence at all:
root@...-xfstests:~# cp /dev/null /tmp/foo.img root@...-xfstests:~# truncate -s 100M /tmp/foo.img root@...-xfstests:~# mke2fs -t ext4 /tmp/foo.img mke2fs 1.43-WIP (18-May-2015)
Discarding device blocks: done
Creating filesystem with 102400 1k blocks and 25688 inodes Filesystem UUID: 27dfdbbe-f3a9-48a7-abe8-5a52798a9849
Superblock backups stored on blocks:
8193, 24577, 40961, 57345, 73729
Allocating group tables: done
Writing inode tables: done
Creating journal (4096 blocks): done
Writing superblocks and filesystem accounting information: done
root@...-xfstests:~# mount -o loop /tmp/foo.img /mnt root@...-xfstests:~# blockdev --getbsz /dev/loop0
1024
root@...-xfstests:~# losetup -c /dev/loop0 root@...-xfstests:~# blockdev --getbsz /dev/loop0
4096 <------------ BUG
root@...-xfstests:~# touch /mnt/foo
<------- Should hang here, even though there is no resize2fs command
<------- If it doesn't hang right away, try typing the "sync" command
> Suffer this small analogy
> for me and let me know where I am wrong: say hypothetically I expand a
> small partition (or LVM for that matter). Then I try to use resize2fs
> to grow the ext filesystem on it. I expect that this should *not*
> change the block size of the underlying device (of course not!) nor
> the filesystem's block size.
The cause of your misunderstanding is not understanding that there are actually 4 different concepts of block/sector size:
* The logical block/sector size of the underlying storage device
- Retrived via "blockdev --getss /dev/sdXX"
- This is the smallest unit that can be sent to the disk from
the Host OS. If the logical sector size is different from
the physical block size, and write is smaller than the
physical sector size (see below), then the disk will do a
read-modify-write.
- The file system block size MUST be greater than or equal to
the logical sector size.
* The physical block/sector size of the underlying storage device
- Retrived via "blockdev --getpbsz /dev/sdXX"
- This is the smallest unit can be physically written to the
storage media.
- The file system block size SHOULD be greater than or equal
to the logical sector size. (To avoid read-modify-write
operations by the hard drive that will bad for performance.)
* The "soft" block size of the block device.
- Retrived via "blockdev --getbsz /dev/sdXX"
- This represents the units of storage which is used to cache
data in the buffer cache. This only matters if you are
using buffer cache --- for example, if you are doing
buffered I/O to a block device, or if you are using a file
system such as ext4 which is using buffer cache. Since data
is indexed in the buffer cache by the 3-tuple (block device,
block number, block size), Bad Things happen if you try to
change the block size while the file system is mounted.
Normally, the kernel will prevent you from changing the
block size under these circumstances.
* The file system block size.
- Retrieved by some file-system dependent command. For ext4,
this is "dumpe2fs -h".
- Set at format time. For file systems that use the buffer
cache, the file system driver will automatically set the
"soft" block size of the block device when the file system
is mounted.
Speaking of LVM, I can't reproduce the problem using LVM, at least not with a 4.3-rc2 kernel:
root@...-xfstests:~# pvcreate /dev/vdc
Physical volume "/dev/vdc" successfully created root@...-xfstests:~# vgcreate test /dev/vdc
Volume group "test" successfully created root@...-xfstests:~# lvcreate -L 100M -n small /dev/test
Logical volume "small" created
root@...-xfstests:~# mkfs.ext4 -Fq /dev/test/small root@...-xfstests:~# mount -o loop /dev/test/small /mnt root@...-xfstests:~# blockdev --getbsz /dev/loop0
1024
root@...-xfstests:~# lvresize -L 1G /dev/test/small
Size of logical volume test/small changed from 100.00 MiB (25 extents) to 1.00 GiB (256 extents).
Logical volume small successfully resized root@...-xfstests:~# blockdev --getbsz /dev/loop0
1024 <------ NO BUG, see the block size has not changed root@...-xfstests:~# lvcreate -L 100M -n small /dev/test^C root@...-xfstests:~# touch /mnt/foo ; sync root@...-xfstests:~# resize2fs /dev/test/small resize2fs 1.43-WIP (18-May-2015) Filesystem at /dev/test/small is mounted on /mnt; on-line resizing required old_desc_blocks = 1, new_desc_blocks = 8 The filesystem on /dev/test/small is now 1048576 (1k) blocks long.
<------ Note that resize2fs works just fine!
root@...-xfstests:~# touch /mnt/bar ; sync root@...-xfstests:~# umount /mnt root@...-xfstests:~#
You might see if this works on CentOS; but if it doesn't, I'm pretty convinced this is a bug outside of ext4, and I've already given you a workaround --- using "-b 4096" on the command line to mkfs.ext4 or mke2fs.
Alternatively, here's another workaround; you can change modify your /etc/mke2fs.conf so the "small" and "floppy" stanzas read:
[fs_types]
small = {
blocksize = 4096
inode_size = 128
inode_ratio = 4096
}
floppy = {
blocksize = 4096
inode_size = 128
inode_ratio = 8192
}
I'm pretty certain your failures won't reproduce if you either change how you call mke2fs for small file systems, or change your /etc/mke2fs.conf file as shown above.
Cheers,
- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists