lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 4 Sep 2015 13:32:33 +0200
From:	Peter Zijlstra <peterz@...radead.org>
To:	Dave Chinner <david@...morbit.com>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Waiman Long <Waiman.Long@...com>,
	Ingo Molnar <mingo@...nel.org>
Subject: Re: [4.2, Regression] Queued spinlocks cause major XFS performance
 regression

On Fri, Sep 04, 2015 at 06:12:34PM +1000, Dave Chinner wrote:
> You probably don't even need a VM to reproduce it - that would
> certainly be an interesting counterpoint if it didn't....

Even though you managed to restore your DEBUG_SPINLOCK performance by
changing virt_queued_spin_lock() to use __delay(1), I ran the thing on
actual hardware just to test.

[ Note: In any case, I would recommend you use (or at least try)
  PARAVIRT_SPINLOCKS if you use VMs, as that is where we were looking for
  performance, the test-and-set fallback really wasn't meant as a
  performance option (although it clearly sucks worse than expected).

  Pre qspinlock, your setup would have used regular ticket locks on
  vCPUs, which mostly works as long as there is almost no vCPU
  preemption, if you overload your machine such that the vCPU threads
  get preempted that will implode into silly-land. ]

So on to native performance:

 - IVB-EX, 4-socket, 15 core, hyperthreaded, for a total of 120 CPUs
 - 1.1T of md-stripe (5x200GB) SSDs
 - Linux v4.2 (distro style .config)
 - Debian "testing" base system
 - xfsprogs v3.2.1


# mkfs.xfs -f -m "crc=1,finobt=1" /dev/md0
log stripe unit (524288 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB
meta-data=/dev/md0               isize=512    agcount=32, agsize=9157504 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1
data     =                       bsize=4096   blocks=293038720, imaxpct=5
         =                       sunit=128    swidth=640 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=143088, version=2
         =                       sectsz=512   sunit=8 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

# mount -o logbsize=262144,nobarrier /dev/md0 /mnt/scratch

# ./fs_mark  -D  10000  -S0  -n  50000  -s  0  -L  32 \
         -d  /mnt/scratch/0  -d  /mnt/scratch/1 \
         -d  /mnt/scratch/2  -d  /mnt/scratch/3 \
         -d  /mnt/scratch/4  -d  /mnt/scratch/5 \
         -d  /mnt/scratch/6  -d  /mnt/scratch/7 \
         -d  /mnt/scratch/8  -d  /mnt/scratch/9 \
         -d  /mnt/scratch/10  -d  /mnt/scratch/11 \
         -d  /mnt/scratch/12  -d  /mnt/scratch/13 \
         -d  /mnt/scratch/14  -d  /mnt/scratch/15 \


Regular v4.2 (qspinlock) does:

     0      6400000            0     286491.9          3500179
     0      7200000            0     293229.5          3963140
     0      8000000            0     271182.4          3708212
     0      8800000            0     300592.0          3595722

Modified v4.2 (ticket) does:

     0      6400000            0     310419.6          3343821
     0      7200000            0     348346.5          4721133
     0      8000000            0     328098.2          3235753
     0      8800000            0     316765.3          3238971


Which shows that qspinlock is clearly slower, even for these large-ish
NUMA boxes where it was supposed to be better.

Clearly our benchmarks used before this were not sufficient, and more
works needs to be done.


Also, I note that after running to completion, there is only 14G of
actual data on the device, so you don't need silly large storage to run
this -- I expect your previous 275G quote was due to XFS populating the
sparse file with meta-data or something along those lines.

Further note, rm -rf /mnt/scratch0/*, takes for bloody ever :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ