linux-kernel - Re: [4.2, Regression] Queued spinlocks cause major XFS performance regression

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150904113233.GT3644@twins.programming.kicks-ass.net>
Date:	Fri, 4 Sep 2015 13:32:33 +0200
From:	Peter Zijlstra <peterz@...radead.org>
To:	Dave Chinner <david@...morbit.com>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Waiman Long <Waiman.Long@...com>,
	Ingo Molnar <mingo@...nel.org>
Subject: Re: [4.2, Regression] Queued spinlocks cause major XFS performance
 regression

On Fri, Sep 04, 2015 at 06:12:34PM +1000, Dave Chinner wrote:
> You probably don't even need a VM to reproduce it - that would
> certainly be an interesting counterpoint if it didn't....

Even though you managed to restore your DEBUG_SPINLOCK performance by
changing virt_queued_spin_lock() to use __delay(1), I ran the thing on
actual hardware just to test.

[ Note: In any case, I would recommend you use (or at least try)
  PARAVIRT_SPINLOCKS if you use VMs, as that is where we were looking for
  performance, the test-and-set fallback really wasn't meant as a
  performance option (although it clearly sucks worse than expected).

  Pre qspinlock, your setup would have used regular ticket locks on
  vCPUs, which mostly works as long as there is almost no vCPU
  preemption, if you overload your machine such that the vCPU threads
  get preempted that will implode into silly-land. ]

So on to native performance:

 - IVB-EX, 4-socket, 15 core, hyperthreaded, for a total of 120 CPUs
 - 1.1T of md-stripe (5x200GB) SSDs
 - Linux v4.2 (distro style .config)
 - Debian "testing" base system
 - xfsprogs v3.2.1


# mkfs.xfs -f -m "crc=1,finobt=1" /dev/md0
log stripe unit (524288 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB
meta-data=/dev/md0               isize=512    agcount=32, agsize=9157504 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1
data     =                       bsize=4096   blocks=293038720, imaxpct=5
         =                       sunit=128    swidth=640 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=143088, version=2
         =                       sectsz=512   sunit=8 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

# mount -o logbsize=262144,nobarrier /dev/md0 /mnt/scratch

# ./fs_mark  -D  10000  -S0  -n  50000  -s  0  -L  32 \
         -d  /mnt/scratch/0  -d  /mnt/scratch/1 \
         -d  /mnt/scratch/2  -d  /mnt/scratch/3 \
         -d  /mnt/scratch/4  -d  /mnt/scratch/5 \
         -d  /mnt/scratch/6  -d  /mnt/scratch/7 \
         -d  /mnt/scratch/8  -d  /mnt/scratch/9 \
         -d  /mnt/scratch/10  -d  /mnt/scratch/11 \
         -d  /mnt/scratch/12  -d  /mnt/scratch/13 \
         -d  /mnt/scratch/14  -d  /mnt/scratch/15 \


Regular v4.2 (qspinlock) does:

     0      6400000            0     286491.9          3500179
     0      7200000            0     293229.5          3963140
     0      8000000            0     271182.4          3708212
     0      8800000            0     300592.0          3595722

Modified v4.2 (ticket) does:

     0      6400000            0     310419.6          3343821
     0      7200000            0     348346.5          4721133
     0      8000000            0     328098.2          3235753
     0      8800000            0     316765.3          3238971


Which shows that qspinlock is clearly slower, even for these large-ish
NUMA boxes where it was supposed to be better.

Clearly our benchmarks used before this were not sufficient, and more
works needs to be done.


Also, I note that after running to completion, there is only 14G of
actual data on the device, so you don't need silly large storage to run
this -- I expect your previous 275G quote was due to XFS populating the
sparse file with meta-data or something along those lines.

Further note, rm -rf /mnt/scratch0/*, takes for bloody ever :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/