[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4BA27F72.40901@iki.fi>
Date: Thu, 18 Mar 2010 21:30:58 +0200
From: Timo Teräs <timo.teras@....fi>
To: Herbert Xu <herbert@...dor.apana.org.au>
CC: netdev@...r.kernel.org
Subject: Re: [PATCH] xfrm: cache bundle lookup results in flow cache
Timo Teräs wrote:
> Herbert Xu wrote:
>> On Wed, Mar 17, 2010 at 04:16:21PM +0200, Timo Teräs wrote:
>>> The problem is if I have multipoint gre1 and policy that says
>>> "encrypt all gre in transport mode".
>>>
>>> Thus for each public address, I get one bundle. But the
>>> xfrm_lookup() is called for each packet because ipgre_tunnel_xmit()
>>> calls ip_route_output_key() on per-packet basis.
>>>
>>> For my use-case it makes a huge difference.
>>
>> But if your traffic switches between those tunnels on each packet,
>> we're back to square one, right?
>
> Not to my understanding. Why would it change?
Here's how things go to my understanding.
When we are forwarding packets, each packet goes through
__xfrm_route_forward(). Or if we are sending from ip_gre (like in
my case), the packets go through ip_route_output_key().
Both of these call xfrm_lookup() to get the xfrm_dst instead
of the real rtable dst. This is done on per-packet basis.
Basically, the xfrm_dst is never kept referenced directly
on either code path. Instead it needs to be xfrm'ed per-packet.
Now, these created xfrm_dst gets cached in policy->bundles
with ref count zero and not deleted until garbage collection
is required. That's how xfrm_find_bundle tries to reuse them
(but it does O(n) lookup).
On the gre+esp case (should apply to any forward path too)
the caching of bundle speeds up the output path considerably
as there can be a lot of xfrm_dst's. Especially if it's a
wildcard transport mode policy which basically get's bundles
for each destination. So there can be a lot of xfrm_dst's
all valid, since they refer to unique xfrm_state on
per-destination basis.
Now, even more, since xfrm_dst needs to be regenerated if
the underlying ipv4 rtable entry has expired there can be
a lot of bundles. So the linear search is a major performance
killer.
Btw. it looks like the xfrm_dst garbage collection is broke.
It's only garbage collected if a network device goes down,
or dst_alloc calls it after gc threshold is exceeded. Since
gc threshold got dynamic not long ago, it can be very big.
This causes stale xfrm_dst's to be kept alive, and what is
worse their inner rtable dst is kept referenced. But as they
were expired and dst_free'd the dst core "free still referenced"
dst's goes through that list over and over again and will
kill the whole system performance when the list grows long.
I think as a minimum we should add 'do stale_bundle' check
to all xfrm_dst's every n minutes or so.
>>> Then we cannot maintain policy use time. But if it's not a
>>> requirement, we could drop the policy from cache.
>>
>> I don't see why we can't maintain the policy use time if we did
>> this, all you need is a back-pointer from the top xfrm_dst.
>
> Sure.
Actually no. As the pmtu case showed, it's more likely that
xfrm_dst needs to be regenerated, but the policy stays the
same since policy db isn't touched that often. If we keep
them separately we can almost most of the time avoid doing
policy lookup which is also O(n). Also the currently cache
entry validation is needs to check policy's bundles_genid
before allowing touching of xfrm_dst. Otherwise we would have
to keep global bundle_genid, and we'd lose the parent pointer
on cache miss.
Caching bundles be another win too. Since if do cache entries
like this, we could track how many cache miss xfrm_dst's
we've had and use that decide when to trigger xfrm_dst
garbage collector instead of (or in addition to) fixed timer.
- Timo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists