[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <53C7A5CA.4050903@free.fr>
Date: Thu, 17 Jul 2014 12:30:34 +0200
From: Mason <mpeg.blue@...e.fr>
To: Andreas Dilger <adilger@...ger.ca>
CC: Ext4 Developers List <linux-ext4@...r.kernel.org>,
linux-fsdevel <linux-fsdevel@...r.kernel.org>
Subject: Re: After unlinking a large file on ext4, the process stalls for a
long time
Hello,
Andreas Dilger wrote:
> Mason wrote:
>
>> The use case is
>> - allocate a large file
>> - stick a file system on it
>> - store stuff (typically video files) inside this "private" FS
>> - when the user decides he doesn't need it anymore, unmount and unlink
>> (I also have a resize operation in there, but I wanted to get the
>> basics before taking the hard stuff head on.)
>>
>> So, in the limit, we don't store anything at all: just create and
>> immediately delete. This was my test.
>
> I would agree that LVM is the real solution that you want to use.
> It is specifically designed for this, and has much less overhead than
> a filesystem on a loopback device on a file on another filesystem.
> The amount of space overhead is tuneable, but typically the volumes
> are allocated in multiples of 4MB chunks.
I'll take a look at LVM. (But, at this point, it's too late to change
the architecture of the system.)
> That said, I think you've found some kind of strange performance problem,
> and it is worthwhile to figure this out.
>
>>>> /tmp # time ./foo /mnt/hdd/xxx 5
>>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [68 ms]
>>>> unlink(filename): 0 [0 ms]
>>>> 0.00user 1.86system 0:01.92elapsed 97%CPU (0avgtext+0avgdata 528maxresident)k
>>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
>>>>
>>>> /tmp # time ./foo /mnt/hdd/xxx 10
>>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [141 ms]
>>>> unlink(filename): 0 [0 ms]
>>>> 0.00user 3.71system 0:03.83elapsed 96%CPU (0avgtext+0avgdata 528maxresident)k
>>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
>>>>
>>>> /tmp # time ./foo /mnt/hdd/xxx 100
>>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [1882 ms]
>>>> unlink(filename): 0 [0 ms]
>>>> 0.00user 37.12system 0:38.93elapsed 95%CPU (0avgtext+0avgdata 528maxresident)k
>>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
>>>>
>>>> /tmp # time ./foo /mnt/hdd/xxx 300
>>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [3883 ms]
>>>> unlink(filename): 0 [0 ms]
>>>> 0.00user 111.38system 1:55.04elapsed 96%CPU (0avgtext+0avgdata 528maxresident)k
>>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
Preliminary info:
The partition was created/mounted with
$ mkfs.ext4 -m 0 -i 1024000 -L ZOZO -O ^has_journal,^huge_file /dev/sda1
$ mount -t ext4 /dev/sda1 /mnt/hdd -o noexec,noatime
(mount is busybox, in case it matters)
mke2fs 1.42.10 (18-May-2014)
/dev/sda1 contains a ext4 file system labelled 'ZOZO'
last mounted on /mnt/hdd on Wed Jul 16 15:40:40 2014
Proceed anyway? (y,n) y
Creating filesystem with 104857600 4k blocks and 460800 inodes
Filesystem UUID: 8c12c8fe-6ab8-4888-b9a3-6f28c86020eb
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000
Allocating group tables: done
Writing inode tables: done
Writing superblocks and filesystem accounting information: done
/dev/sda1 on /mnt/hdd type ext4 (rw,noexec,noatime,barrier=1)
/* No support for xattr in this kernel */
# dumpe2fs -h /dev/sda1
dumpe2fs 1.42.10 (18-May-2014)
Filesystem volume name: ZOZO
Last mounted on: <not available>
Filesystem UUID: 8c12c8fe-6ab8-4888-b9a3-6f28c86020eb
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file uninit_bg dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
Filesystem state: not clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 460800
Block count: 104857600
Reserved block count: 0
Free blocks: 104803944
Free inodes: 460789
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 999
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 144
Inode blocks per group: 9
Flex block group size: 16
Filesystem created: Thu Jul 17 11:14:27 2014
Last mount time: Thu Jul 17 11:14:29 2014
Last write time: Thu Jul 17 11:14:29 2014
Mount count: 1
Maximum mount count: -1
Last checked: Thu Jul 17 11:14:27 2014
Check interval: 0 (<none>)
Lifetime writes: 4883 kB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group unknown)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Default directory hash: half_md4
Directory Hash Seed: 157f2107-76fc-417b-9a07-491951c873b7
> Firstly, have you tried using "fallocate()" directly, instead of
> posix_fallocate()? It may be (depending on your userspace) that
> posix_fallocate() is writing zeroes to the file instead of using
> the fallocate() syscall, and the kernel is busy cleaning up all
> of the dirty pages when the file is unlinked. You could try using
> strace to see what system calls are actually being used.
Unfortunately, I'm using a prehistoric version of glibc (2.8)
that doesn't support the fallocate wrapper (imported in 2.10).
I'm 70% sure that posix_fallocate() is not actually writing zeros
to the file, because when I tested it on ext2, creating a 300-GB
file took hours, literally (approx. 3 hours). The same operation
on ext4 takes a few seconds. (Although, now that I think of it,
it could be working asynchronously, or defer some operation, that
I eventually have to pay for on deletion.)
# time strace -tt -T ./foo /mnt/hdd/xxx 300 2> strace.out
posix_fallocate(fd, 0, size_in_GiB << 30): 0 [414 ms]
unlink(filename): 0 [1 ms]
12:23:27.218838 open("/mnt/hdd/xxx", O_WRONLY|O_CREAT|O_EXCL|O_LARGEFILE, 0600) = 3 <0.000486>
12:23:27.220121 clock_gettime(CLOCK_MONOTONIC, {79879, 926227018}) = 0 <0.000105>
12:23:27.221029 SYS_4320() = 0 <0.412013>
12:23:27.633673 clock_gettime(CLOCK_MONOTONIC, {79880, 339646593}) = 0 <0.000104>
12:23:27.634657 fstat64(1, {st_mode=S_IFCHR|0755, st_rdev=makedev(4, 64), ...}) = 0 <0.000116>
12:23:27.636187 ioctl(1, TIOCNXCL, {B115200 opost isig icanon echo ...}) = 0 <0.000146>
12:23:27.637509 old_mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x77248000 <0.000143>
12:23:27.638306 write(1, "posix_fallocate(fd, 0, size_in_G"..., 54) = 54 <0.000237>
12:23:27.639496 clock_gettime(CLOCK_MONOTONIC, {79880, 345448452}) = 0 <0.000102>
12:23:27.640168 unlink("/mnt/hdd/xxx") = 0 <0.000231>
12:23:27.641174 clock_gettime(CLOCK_MONOTONIC, {79880, 347202581}) = 0 <0.000100>
12:23:27.641984 write(1, "unlink(filename): 0 [1 ms]\n", 27) = 27 <0.000157>
12:23:27.643056 exit_group(0) = ?
0.02user 111.51system 1:51.99elapsed 99%CPU (0avgtext+0avgdata 864maxresident)k
0inputs+0outputs (0major+459minor)pagefaults 0swaps
AFAICT, SYS_4320() is fallocate.
/*
* Linux o32 style syscalls are in the range from 4000 to 4999.
*/
#define __NR_Linux 4000
#define __NR_fallocate (__NR_Linux + 320)
Where is the process stalling? That is a mystery. Seems it's stuck
in exit_group(), waiting for the kernel to clean up on its behalf?
Maybe I need ftrace, or something to profile the kernel?
> Secondly, where is the process actually stuck? From your output
> above, the unlink() call takes no measurable time before returning,
> so I don't see where it is actually stuck. Again, running your
> test with "strace -tt -T ./foo /mnt/hdd/xxx 300" will show which
> syscall is actually taking so much time to complete. I don't
> think it is unlink().
See above, the process is stalled, but I don't know where!
--
Regards.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists