[<prev] [next>] [day] [month] [year] [list]
Message-ID: <4FFBFCA9.2060307@att.net>
Date: Tue, 10 Jul 2012 04:58:01 -0500
From: Daniel Santos <danielfsantos@....net>
To: Andrew Morton <akpm@...ux-foundation.org>,
Christopher Li <sparse@...isli.org>,
David Daney <david.daney@...ium.com>,
David Howells <dhowells@...hat.com>,
David Rientjes <rientjes@...gle.com>,
Hidetoshi Seto <seto.hidetoshi@...fujitsu.com>,
"H. Peter Anvin" <hpa@...or.com>, Ingo Molnar <mingo@...e.hu>,
Ingo Molnar <mingo@...nel.org>, Joe Perches <joe@...ches.com>,
Konstantin Khlebnikov <khlebnikov@...nvz.org>,
linux-doc@...r.kernel.org, linux-sparse@...r.kernel.org,
LKML <linux-kernel@...r.kernel.org>,
Paul Gortmaker <paul.gortmaker@...driver.com>,
Paul Turner <pjt@...gle.com>,
Pavel Pisa <pisa@....felk.cvut.cz>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>,
Richard Weinberger <richard@....at>,
Rob Landley <rob@...dley.net>,
Steven Rostedt <rostedt@...dmis.org>,
Suresh Siddha <suresh.b.siddha@...el.com>
Subject: Generic Red-Black Trees: preliminary performance results
I've completed some rudimentary test code. It is designed to compile
both in user & kernel space but only currently compiles in userland. I
ran this on 9 different versions of gcc, all on x86_64 with CFLAGS="-O2
-g3 -pipe -march=k8". The below summary data shows the % increase in
time consumed (decrease in performance) using my generic red-black trees
over hand-coded functions for insertion.
gcc ver % decrease in speed
4.7.1 -5.39%
4.6.2 2.60%
4.5.3 18.07%
4.4.6 20.52%
4.3.6 13.53%
4.2.4 11.84%
4.1.2 16.36%
4.0.4 35.70%
3.4.6 47.28%
I don't understand why the generic code ran faster than hand-coded on
gcc 4.7.1 as I haven't examined the assembly output yet. However, I'm
pretty certain I understand why it was 2% slower on 4.6. This has to do
with an optimization flaw. In the hand-coded insert/find functions, I
used the "if (a->key > b->key) .. else if (a->key < b->key)" construct,
where as the generic code calculates a diff and compares that against
zero and 4.6.2 is adding an unnecessary cmp instruction:
3f2: 8b 48 18 mov 0x18(%rax),%ecx
3f5: 8b 7a 18 mov 0x18(%rdx),%edi
3f8: 48 29 f9 sub %rdi,%rcx
3fb: 48 83 f9 00 cmp $0x0,%rcx
3ff: 7f df jg 3e0 <grbtest_insert+0xb0>
401: 0f 84 c9 00 00 00 je 4d0 <grbtest_insert+0x1a0>
The test configuration was for a tree that tracks both leftmost,
rightmost and count and uses unique keys where the insert function
replaces an existing object. These tests weren't ideal. While I
allocated 4096 objects (32 bytes each) to stick in my tree, I used 0xff
for a key mask, so only 256 object would be in the tree at once and I
didn't notice this until I was most of the way through the tests. My
intention was to run the test with a data set small enough to fit into
the L3 cache, so as to reduce overhead from memory access and isolate
the actual differences in the algorithm.
When I get this all cleaned up, I'll release another patch set with the
test code added. (This also automates correctness tests for find,
insert, find_near and insert_near). Also, I'm going to experiment with
branch prediction to see if I can squeeze a little more performance out
of the older compilers.
Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists