linux-kernel - Re: [RFC]numa: improve I/O performance by optimizing numa interleave allocation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <1322019383.22361.346.camel@sli10-conroe>
Date:	Wed, 23 Nov 2011 11:36:23 +0800
From:	Shaohua Li <shaohua.li@...el.com>
To:	Andi Kleen <ak@...ux.intel.com>
Cc:	lkml <linux-kernel@...r.kernel.org>, linux-mm <linux-mm@...ck.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Jens Axboe <axboe@...nel.dk>, Christoph Lameter <cl@...ux.com>,
	"lee.schermerhorn@...com" <lee.schermerhorn@...com>
Subject: Re: [RFC]numa: improve I/O performance by optimizing numa
 interleave allocation

On Mon, 2011-11-21 at 09:39 +0800, Shaohua Li wrote:
> On Sat, 2011-11-19 at 01:30 +0800, Andi Kleen wrote:
> > On Fri, Nov 18, 2011 at 03:12:12PM +0800, Shaohua Li wrote:
> > > If mem plicy is interleaves, we will allocated pages from nodes in a round
> > > robin way. This surely can do interleave fairly, but not optimal.
> > > 
> > > Say the pages will be used for I/O later. Interleave allocation for two pages
> > > are allocated from two nodes, so the pages are not physically continuous. Later
> > > each page needs one segment for DMA scatter-gathering. But maxium hardware
> > > segment number is limited. The non-continuous pages will use up maxium
> > > hardware segment number soon and we can't merge I/O to bigger DMA. Allocating
> > > pages from one node hasn't such issue. The memory allocator pcp list makes
> > > we can get physically continuous pages in several alloc quite likely.
> > 
> > FWIW it depends a lot on the IO hardware if the SG limitation
> > really makes a measurable difference for IO performance. I saw some wins from 
> > clustering using the IOMMU before, but that was a long time ago. I wouldn't 
> > consider it a truth without strong numbers, and then also only
> > for that particular device measured.
> > 
> > My understanding is that modern IO devices like NHM Express will
> > be faster at large SG lists.
> This is a LSI SAS1068E HBA card attaching some hard disks. The
> clustering has real benefit here. I/O throughput increases 3% or so.
> Not sure about NHM Express, wondering why large SG list could be faster.
> doesn't large SG means large DMA descriptor?
> 
> > > So can we make both interleave fairness and continuous allocation happy?
> > > Simplily we can adjust the round robin algorithm. We switch to another node
> > > after several (N) allocation happens. If N isn't too big, we can still get
> > > fair allocation. And we get N continuous pages. I use N=8 in below patch.
> > > I thought 8 isn't too big for modern NUMA machine. Applications which use
> > > interleave are unlikely run short time, so I thought fairness still works.
> > 
> > It depends a lot on the CPU access pattern.
> > 
> > Some workloads seem to do reasonable well with 2MB huge page interleaving.
> > But others actually prefer the cache line interleaving supplied by 
> > the BIOS.
> > 
> > So you can have a trade off between IO and CPU performance.
> > When in doubt I usually opt for CPU performance by default.
> Can you elaborate this more? the cache line interleaving can only be
> supplied by BIOS. OS can provide N*PAGE_SIZE interleave. I'm wondering
> what's the difference for example a 4k or 8k interleave for CPU
> performance. Actually if adjacent pages interleaved in two nodes could
> be in the same coloring, while two adjacent pages allocated from one
> node not. So clustering could be more cache efficient from coloring
> point of view.
> 
> > I definitely wouldn't make it default, but if there are workloads
> > that benefits a lot it could be an additional parameter to the
> > interleave policy.
> Christoph suggested the same way. the problem is we need change the API,
> right? And how are users supposed to use it? It would be difficult to
> determine the correct parameter.
> 
> If 8 pages clustering is too big, maybe we can use small. I guess a 2
> pages clustering is a big win too.
> 
> And I didn't change the allocation with a VMA case, which is supposed to
> be used for anonymous pages.
I tried a 2 pages clustering, it has the same effect like 8 page
clustering in my test environment.
would making the clustering a config option or sysctl be better?

Thanks,
Shaohua


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/