lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20080218123717.GW23197@kernel.dk>
Date:	Mon, 18 Feb 2008 13:37:17 +0100
From:	Jens Axboe <jens.axboe@...cle.com>
To:	"Alan D. Brunelle" <Alan.Brunelle@...com>
Cc:	linux-kernel@...r.kernel.org, npiggin@...e.de, dgc@....com,
	arjan@...ux.intel.com
Subject: Re: IO queueing and complete affinity w/ threads: Some results

On Thu, Feb 14 2008, Alan D. Brunelle wrote:
> Taking a step back, I went to a very simple test environment:
> 
> o  4-way IA64
> o  2 disks (on separate RAID controller, handled by separate ports on the same FC HBA - generates different IRQs).
> o  Using write-cached tests - keep all IOs inside of the RAID controller's cache, so no perturbations due to platter accesses)
> 
> Basically:
> 
> o  CPU 0 handled IRQs for /dev/sds
> o  CPU 2 handled IRQs for /dev/sdaa
> 
> We placed an IO generator on CPU1 (for /dev/sds) and CPU3 (for /dev/sdaa). The IO generator performed 4KiB sequential direct AIOs in a very small range (2MB - well within the controller cache on the external storage device). We have found that this is a simple way to maximize throughput, and thus be able to watch the system for effects without worrying about odd seek & other platter-induced issues. Each test took about 6 minutes to run (ran a specific amount of IO, so we could compare & contrast system measurements).
> 
> First: overall performance
> 
> 2.6.24 (no patches)              : 106.90 MB/sec
> 
> 2.6.24 + original patches + rq=0 : 103.09 MB/sec
>                             rq=1 :  98.81 MB/sec
> 
> 2.6.24 + kthreads patches + rq=0 : 106.85 MB/sec
>                             rq=1 : 107.16 MB/sec
> 
> So, the kthreads patches works much better here - and on-par or better than straight 2.6.24. I also ran Caliper (akin to Oprofile, proprietary and ia64-specific, sorry), and looked at the cycles used. On an ia64 back-end-bubbles are deadly, and can be caused by cache misses &c. Looking at the gross data:
> 
> Kernel                                CPU_CYCLES       BACK END BUBBLES  100.0 * (BEB/CC)
> --------------------------------   -----------------  -----------------  ----------------
> 2.6.24 (no patches)              : 2,357,215,454,852    231,547,237,267   9.8%
> 
> 2.6.24 + original patches + rq=0 : 2,444,895,579,790    242,719,920,828   9.9%
>                             rq=1 : 2,551,175,203,455    148,586,145,513   5.8%
> 
> 2.6.24 + kthreads patches + rq=0 : 2,359,376,156,043    255,563,975,526  10.8%
>                             rq=1 : 2,350,539,631,362    208,888,961,094   8.9%
> 
> For both the original & kthreads patches we see a /significant/ drop in bubbles when setting rq=1 over rq=0. This shows up in extra CPU cycles available (not spent in %system) - a graph is provided up on http://free.linux.hp.com/~adb/jens/cached_mps.png - it shows the results from stats extracted from running mpstat in conjunction with the IO runs.
> 
> Combining %sys & %soft IRQ, we see:
> 
> Kernel                              % user     % sys   % iowait   % idle
> --------------------------------   --------  --------  --------  --------
> 2.6.24 (no patches)              :   0.141%   10.088%   43.949%   45.819%
> 
> 2.6.24 + original patches + rq=0 :   0.123%   11.361%   43.507%   45.008%
>                             rq=1 :   0.156%    6.030%   44.021%   49.794%
> 
> 2.6.24 + kthreads patches + rq=0 :   0.163%   10.402%   43.744%   45.686%
>                             rq=1 :   0.156%    8.160%   41.880%   49.804%
> 
> The good news (I think) is that even with rq=0 with the kthreads patches we're getting on-par performance w/ 2.6.24, so the default case should be ok...
> 
> I've only done a few runs by hand with this - these results are from one representative run out of the bunch - but at least this (I believe) shows what this patch stream is intending to do: optimize placement of IO completion handling to minimize cache & TLB disruptions. Freeing up cycles in the kernel is always helpful! :-)
> 
> I'm going to try similar runs on an AMD64 w/ Oprofile and see what results I get there... (BTW: I'll be dropping testing of the original patch sequence, the kthreads patches look better in general (both in terms of code & results, coincidence?).

Alan, thanks for your very nice testing efforts on this! It's very
encouraging to see that the kthread based approach is even faster than
the softirq one, since the code is indeed much simpler and doesn't
require any arch modifications. So I'd agree that just testing the
kthread approach is the best way forward, and that scrapping the remote
softirq trigger stuff is sanest.

My main worry with the current code is the ->lock in the per-cpu
completion structure. If we do a lot of migrations to other CPUs, then
that cacheline will be bounced around. But we'll be dirtying the list of
that CPU structure anyway, so playing games to make that part lockless
is probably pretty pointless. So if you get around to testing on bigger
SMP boxes, it'd be interesting to look for. So far it looks like it's a
net win with more idle time, the benefit of keeping the rq completion
queue local must be out weighing the cost of diddling with the per-cpu
data.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ