linux-ext4 - [3.9] Parallel unlinks serialise completely

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <20130504013643.GC19978@dastard>
Date:	Sat, 4 May 2013 11:36:43 +1000
From:	Dave Chinner <david@...morbit.com>
To:	linux-ext4@...r.kernel.org
Subject: [3.9] Parallel unlinks serialise completely

Hi folks,

Just an FYI.  I was running a few fsmark workloads to compare
xfs/btrfs/ext4 performance (as i do every so often), and found that
ext4 is serialising unlinks on the orphan list mutex completely. The
script I've been running:

$ cat fsmark-50-test-ext4.sh 
#!/bin/bash

sudo umount /mnt/scratch > /dev/null 2>&1
sudo mkfs.ext4 /dev/vdc
sudo mount /dev/vdc /mnt/scratch
sudo chmod 777 /mnt/scratch
cd /home/dave/src/fs_mark-3.3/
time ./fs_mark  -D  10000  -S0  -n  100000  -s  0  -L  63 \
        -d  /mnt/scratch/0  -d  /mnt/scratch/1 \
        -d  /mnt/scratch/2  -d  /mnt/scratch/3 \
        -d  /mnt/scratch/4  -d  /mnt/scratch/5 \
        -d  /mnt/scratch/6  -d  /mnt/scratch/7 \
        | tee >(stats --trim-outliers | tail -1 1>&2)
sync
sleep 30
sync

echo walking files
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
time (
        for d in /mnt/scratch/[0-9]* ; do

                for i in $d/*; do
                        (
                                echo $i
                                find $i -ctime 1 > /dev/null
                        ) > /dev/null 2>&1
                done &
        done
        wait
)

echo removing files
for f in /mnt/scratch/* ; do time rm -rf $f &  done
wait
$

This is on a 100TB sparse VM image on a RAID0 of 4xSSDs, but that's
pretty much irrelevant to the problem being see. That is, I'm seeing
just a little over 1 CPU being expended during the unlink phase, and
only one of the 8 rm processes is running at a time.

`perf top -U -G` shows this as the leading 2 CPU consumers:

  11.99%  [kernel]  [k] __mutex_unlock_slowpat
   - __mutex_unlock_slowpat
      - 99.79% mutex_unloc
         + 51.06% ext4_orphan_add
         + 46.86% ext4_orphan_del
           1.04% do_unlinkat
              sys_unlinkat
              system_call_fastpath
              unlinkat
           0.95% vfs_unlink
              do_unlinkat
              sys_unlinkat
              system_call_fastpath
              unlinkat
-   7.14%  [kernel]  [k] __mutex_lock_slowpath
   - __mutex_lock_slowpath
      - 99.83% mutex_lock
         + 81.84% ext4_orphan_add
           11.21% ext4_orphan_del
              ext4_evict_inode
              evict
              iput
              do_unlinkat
              sys_unlinkat
              system_call_fastpath
              unlinkat
         + 3.47% vfs_unlink
         + 3.24% do_unlinkat

and the workload is running at roughly 40,000 context switches/s at
roughly 7000 iops.

Which looks rather like all unlinks are serialising the orphan list.

The overall results of the test are roughly:

	create		find		unlink
ext4    24m21s		8m17s		37m51s
xfs	 9m52s		6m53s		13m59s

The other notable thing about the unlink completion is this:

	first rm	last rm
ext4	30m26s		37m51s
xfs	13m52s		13m59s

There is significant unfairness in behaviour of the parallel
unlinks. The first 3 processes completed by 30m39s, but the last 5
processes all completed between 37m40s and 37m51s, 7 minutes later...

FWIW, there is also significant serialisation of the create
workload, but I didn't look at that at all.

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html