linux-kernel - Re: [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120420095247.GA30070@gmail.com>
Date:	Fri, 20 Apr 2012 17:52:47 +0800
From:	Zheng Liu <gnehzuil.liu@...il.com>
To:	Lukas Czerner <lczerner@...hat.com>
Cc:	Eric Sandeen <sandeen@...hat.com>, linux-kernel@...r.kernel.org,
	linux-fsdevel@...r.kernel.org, linux-ext4@...r.kernel.org,
	Zheng Liu <wenqing.lz@...bao.com>,
	Ric Wheeler <rwheeler@...hat.com>, Ted Ts'o <tytso@....edu>,
	Andreas Dilger <adilger@...mcloud.com>,
	Dave Chinner <david@...morbit.com>
Subject: Re: [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate

Hi all,

These days I dig this slowdown and find that the *root cause* might be
journal.  I use blktrace to grab a detailed behaviour and find the below
phenonmenon.  I post the blktrace's result in here.

[Test command]
time for((i=0;i<2000;i++)); do \
	dd if=/dev/zero of=/mnt/sda1/testfile conv=notrunc bs=4k count=1 \
	seek=`expr $i \* 16` oflag=sync,direct 2>/dev/null; \
	done

[scripts]
blktrace -d /dev/sda1
blkparse -i sda1.* -o blktrace.log
cat blktrace.log | grep 'D' | grep 'W' > result.log

[result (ext4)]
[cut]
  8,1    0       14    49.969995286    10  D  WS 38011023 + 40 [kworker/0:1]
  8,1    0       20    49.996170768     0  D  WS 38011063 + 8 [swapper/0]
  8,1    0       29    50.006811878 10011  D  WS 278719 + 8 [dd] 
  8,1    0       31    50.013996421    10  D  WS 38011071 + 24 [kworker/0:1]
  8,1    0       37    50.029656811     0  D  WS 38011095 + 8 [swapper/0]
  8,1    1       70    50.039768259 10013  D  WS 278847 + 8 [dd] 
  8,1    0       41    50.046996403    10  D  WS 38011103 + 24 [kworker/0:1]
  8,1    0       47    50.071458802     0  D  WS 38011127 + 8 [swapper/0]
  8,1    1       89    50.081529060 10015  D  WS 278975 + 8 [dd] 
  8,1    0       51    50.088996276    10  D  WS 38011135 + 24 [kworker/0:1]
  8,1    0       57    50.113247880     0  D  WS 38011159 + 8 [swapper/0]
  8,1    1      108    50.123329330 10017  D  WS 279103 + 8 [dd] 
  8,1    0       61    50.130995672    10  D  WS 38011167 + 24 [kworker/0:1]
  8,1    0       67    50.155052076     0  D  WS 38011191 + 8 [swapper/0]
  8,1    1      126    50.165154127 10019  D  WS 279231 + 8 [dd] 
  8,1    0       71    50.172995678    10  D  WS 38011199 + 24 [kworker/0:1]
  8,1    0       77    50.196855020     0  D  WS 38011223 + 8 [swapper/0]
  8,1    1      145    50.206945237 10021  D  WS 279359 + 8 [dd] 
  8,1    0       81    50.214997236    10  D  WS 38011231 + 24 [kworker/0:1]
  8,1    0       87    50.238643778     0  D  WS 38011255 + 8 [swapper/0]
  8,1    1      164    50.248738960 10023  D  WS 279487 + 8 [dd] 
  8,1    0       91    50.255996776    10  D  WS 38011263 + 24 [kworker/0:1]
  8,1    0       97    50.280447549     0  D  WS 38011287 + 8 [swapper/0]
  8,1    1      183    50.290550957 10025  D  WS 279615 + 8 [dd] 
[cut]

We can see that every one write operation needs to do two journal
writes, one writes journal header and data, and another writes commit.

Then I run the same benchmark in xfs to do a comparison.  This
comparison just aims to explain why the slowdown occurs in ext4.

[result (xfs)]
[cut]
  8,1    0       70     0.256951000     0  D WSM 40162600 + 3 [swapper/0]
  8,1    1       50     0.271551873 12575  D  WS 1311 + 8 [dd] 
  8,1    0       78     0.282466586     0  D WSM 40162603 + 3 [swapper/0]
  8,1    1       55     0.296547264 12577  D  WS 1439 + 8 [dd] 
  8,1    0       86     0.307978442     0  D WSM 40162606 + 3 [swapper/0]
  8,1    1       60     0.321578789 12579  D  WS 1567 + 8 [dd] 
  8,1    0       94     0.333494988     0  D WSM 40162609 + 3 [swapper/0]
  8,1    1       65     0.346582549 12581  D  WS 1695 + 8 [dd] 
  8,1    0      102     0.359005937     0  D WSM 40162612 + 3 [swapper/0]
  8,1    1       70     0.371613387 12583  D  WS 1823 + 8 [dd] 
  8,1    0      110     0.384552158     0  D WSM 40162615 + 3 [swapper/0]
  8,1    1       75     0.396604067 12585  D  WS 1951 + 8 [dd] 
  8,1    0      118     0.410062404     0  D WSM 40162618 + 3 [swapper/0]
  8,1    1       80     0.421614702 12587  D  WS 2079 + 8 [dd] 
  8,1    0      126     0.436783655     0  D WSM 40162621 + 3 [swapper/0]
  8,1    1       85     0.454989457 12589  D  WS 2207 + 8 [dd] 
  8,1    0      134     0.470633321     0  D WSM 40162624 + 3 [swapper/0]
  8,1    1       90     0.488311574 12591  D  WS 2335 + 8 [dd] 
  8,1    0      142     0.504477295     0  D WSM 40162627 + 3 [swapper/0]
  8,1    1       95     0.521675622 12593  D  WS 2463 + 8 [dd] 
  8,1    0      150     0.538326978     0  D WSM 40162630 + 3 [swapper/0]
  8,1    1      100     0.555016257 12595  D  WS 2591 + 8 [dd] 
  8,1    0      158     0.563839349     0  D WSM 40162633 + 4 [swapper/0]
  8,1    1      105     0.580049767 12597  D  WS 2719 + 8 [dd] 
  8,1    0      166     0.589336947     0  D WSM 40162637 + 4 [swapper/0]
  8,1    1      110     0.605037173 12599  D  WS 2847 + 8 [dd] 
  8,1    0      174     0.614850369     0  D WSM 40162641 + 4 [swapper/0]
  8,1    1      115     0.630078920 12601  D  WS 2975 + 8 [dd] 
[cut]

The result shows that every one write in xfs only needs to do one
journal write.  Meanwhile the size of journal write is smaller than the
size of ext4.  Certainly, in modern disk, I believe that the disk writes
64k as easily as 4k.  So, IMO, twice journal writes is the *root
*cause*.  I use 'wc -l' to roughly count the number of I/Os.

$ wc -l result.log
ext4:	6018
xfs:	4021
ratio:	1.5:1

The benchmark result:
ext4:	75.615s
xfs:	54.395s
ratio:	1.4:1

Now, in JBD2, it at least needs to do two journal writes.  I have an
idea to solve this problem, and I will send a new RFC to the mailing
list.

Please feel free to review it.  Any comments are welcome.  Thank you.

Regards,
Zheng
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/