[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4AF2EA2D.6040301@gmail.com>
Date: Thu, 05 Nov 2009 16:07:25 +0100
From: Eric Dumazet <eric.dumazet@...il.com>
To: Andi Kleen <andi@...stfloor.org>
CC: Octavian Purdila <opurdila@...acom.com>,
Lucian Adrian Grijincu <lgrijincu@...acom.com>,
netdev@...r.kernel.org
Subject: Re: [RFC] [PATCH] udp: optimize lookup of UDP sockets to by including
destination address in the hash key
Andi Kleen a écrit :
>> I assume cache is cold or even on other cpu (worst case), dealing with
>> 100.000+ sockets or so...
>
> Other CPU cache hit is actually typically significantly
> faster than a DRAM access (unless you're talking about a very large NUMA
> system and a remote CPU far away)
Even if data is dirty in remote CPU cache ?
I dont speak of shared data. (if data is shared, workload mostly fits caches)
>> If workload fits in one CPU cache/registers, we dont mind taking one
>> or two cache lines per object, obviously.
>
> It's more like part of your workload needs to fit.
>
> For example if you use a tree and the higher levels fit into
> the cache, having a few levels in the tree is (approximately) free.
>
> That's why I'm not always fond of large hash tables. They pretty
> much guarantee a lot of cache misses under high load, because
> they have little locality.
We already had this discussion Andi, and you know some servers handle 1.000.000+
sockets, 100.000+ frames per second on XX.XXX different flows, and a binary tree
means 20 accesses before target. Only 5 or 6 first levels are in cache.
Machine is barely usable.
hash table with 2.000.000 slots gives one or two accesses before target,
and rcu is trivial with hash tables.
btree are ok for generalist workloads, and rcu is more complex.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists