linux-kernel - Re: [PATCH 2/7] locking/rwsem: more aggressive use of optimistic spinning

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140815033447.GJ20518@dastard>
Date:	Fri, 15 Aug 2014 13:34:48 +1000
From:	Dave Chinner <david@...morbit.com>
To:	Waiman Long <waiman.long@...com>
Cc:	Jason Low <jason.low2@...com>, Ingo Molnar <mingo@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	linux-kernel@...r.kernel.org, Davidlohr Bueso <davidlohr@...com>,
	Scott J Norton <scott.norton@...com>
Subject: Re: [PATCH 2/7] locking/rwsem: more aggressive use of optimistic
 spinning

On Wed, Aug 13, 2014 at 12:41:06PM -0400, Waiman Long wrote:
> On 08/13/2014 01:51 AM, Dave Chinner wrote:
> >On Mon, Aug 04, 2014 at 11:44:19AM -0400, Waiman Long wrote:
> >>On 08/04/2014 12:10 AM, Jason Low wrote:
> >>>On Sun, 2014-08-03 at 22:36 -0400, Waiman Long wrote:
> >>>>The rwsem_can_spin_on_owner() function currently allows optimistic
> >>>>spinning only if the owner field is defined and is running. That is
> >>>>too conservative as it will cause some tasks to miss the opportunity
> >>>>of doing spinning in case the owner hasn't been able to set the owner
> >>>>field in time or the lock has just become available.
> >>>>
> >>>>This patch enables more aggressive use of optimistic spinning by
> >>>>assuming that the lock is spinnable unless proved otherwise.
> >>>>
> >>>>Signed-off-by: Waiman Long<Waiman.Long@...com>
> >>>>---
> >>>>  kernel/locking/rwsem-xadd.c |    2 +-
> >>>>  1 files changed, 1 insertions(+), 1 deletions(-)
> >>>>
> >>>>diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
> >>>>index d058946..dce22b8 100644
> >>>>--- a/kernel/locking/rwsem-xadd.c
> >>>>+++ b/kernel/locking/rwsem-xadd.c
> >>>>@@ -285,7 +285,7 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
> >>>>  static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
> >>>>  {
> >>>>  	struct task_struct *owner;
> >>>>-	bool on_cpu = false;
> >>>>+	bool on_cpu = true;	/* Assume spinnable unless proved not to be */
> >>>Hi,
> >>>
> >>>So "on_cpu = true" was recently converted to "on_cpu = false" in order
> >>>to address issues such as a 5x performance regression in the xfs_repair
> >>>workload that was caused by the original rwsem optimistic spinning code.
> >>>
> >>>However, patch 4 in this patchset does address some of the problems with
> >>>spinning when there are readers. CC'ing Dave Chinner, who did the
> >>>testing with the xfs_repair workload.
> >>>
> >>This patch set enables proper reader spinning and so the problem
> >>that we see with xfs_repair workload should go away. I should have
> >>this patch after patch 4 to make it less confusing. BTW, patch 3 can
> >>significantly reduce spinlock contention in rwsem. So I believe the
> >>xfs_repair workload should run faster with this patch than both 3.15
> >>and 3.16.
> >I see lots of handwaving. I documented the test I ran when I
> >reported the problem so anyone with a 16p system and an SSD can
> >reproduce it. I don't have the bandwidth to keep track of the lunacy
> >of making locks scale these days - that's what you guys are doing.
> >
> >I gave you a simple, reliable workload that is extremely sensitive
> >to rwsem perturbations, so you should be adding it to your
> >regression tests rather than leaving it for others to notice you
> >screwed up....
> >
> >Cheers,
> >
> >Dave.
> 
> If you can send me a rwsem workload that I can use for testing
> purpose, it will be highly appreciated.

<create sparse vm image file of 500TB on ssd with XFS on it>
xfs_io -f -c "truncate 500t" -c "extsize 1m" /path/to/vm/image/file

<start 16p/16GB RAM vm with image file configured as:
-drive file=/path/to/vm/image/file,if=virtio,cache=none >

In vm:

download and build fsmark from here:

git://oss.sgi.com/dgc/fs_mark

download and install xfsprogs v3.2.1 from here:

git://oss.sgi.com/xfs/cmds/xfsprogs.git tags/v3.2.1

Setup up the target filesystem:

# mkfs.xfs -f -m "crc=1,finobt=1" /dev/vda
# mount -o logbsize=262144,nobarrier /dev/vda /mnt/scratch


Run:

# fs_mark  -D  10000  -S0  -n  50000  -s  0  -L  32 \
        -d  /mnt/scratch/0  -d  /mnt/scratch/1 \
        -d  /mnt/scratch/2  -d  /mnt/scratch/3 \
        -d  /mnt/scratch/4  -d  /mnt/scratch/5 \
        -d  /mnt/scratch/6  -d  /mnt/scratch/7 \
        -d  /mnt/scratch/8  -d  /mnt/scratch/9 \
        -d  /mnt/scratch/10  -d  /mnt/scratch/11 \
        -d  /mnt/scratch/12  -d  /mnt/scratch/13 \
        -d  /mnt/scratch/14  -d  /mnt/scratch/15 \

If you've got everything set up right, that should run at around
200-250,000 file creates/s. When finished, unmount and run:

# xfs_repair -o bhash=500000 /dev/vda

And that should spend quite a long while pounding on the mmap_sem
until the the userspace buffer cache stops growing.

I just ran the above on 3.16, saw this from perf:

  37.30%  [kernel]  [k] _raw_spin_unlock_irqrestore
   - _raw_spin_unlock_irqrestore
      - 62.00% rwsem_wake
         - call_rwsem_wake
            + 83.52% sys_mprotect
            + 16.23% __do_page_fault
      + 35.15% try_to_wake_up
      + 0.96% update_blocked_averages
      + 0.61% pagevec_lru_move_fn
-  23.35%  [kernel]  [k] _raw_spin_unlock_irq
   - _raw_spin_unlock_irq
      + 51.37% finish_task_switch
      + 39.37% rwsem_down_write_failed
      + 8.49% rwsem_down_read_failed
        0.62% run_timer_softirq
+   5.22%  [kernel]  [k] native_read_tsc
+   3.89%  [kernel]  [k] rwsem_down_write_failed
.....

Cheers,

Dave.

-- 
Dave Chinner
david@...morbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/