[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090416144748.GB6924@linux.vnet.ibm.com>
Date: Thu, 16 Apr 2009 07:47:48 -0700
From: "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To: Patrick McHardy <kaber@...sh.net>
Cc: Eric Dumazet <dada1@...mosbay.com>,
Stephen Hemminger <shemminger@...tta.com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
David Miller <davem@...emloft.net>, jeff.chua.linux@...il.com,
paulus@...ba.org, mingo@...e.hu, laijs@...fujitsu.com,
jengelh@...ozas.de, r000n@...0n.net, linux-kernel@...r.kernel.org,
netfilter-devel@...r.kernel.org, netdev@...r.kernel.org,
benh@...nel.crashing.org
Subject: Re: [PATCH] netfilter: use per-cpu spinlock and RCU (v5)
On Thu, Apr 16, 2009 at 03:53:15PM +0200, Patrick McHardy wrote:
> Eric Dumazet wrote:
>> Stephen Hemminger a écrit :
>>> This is an alternative version of ip/ip6/arp tables locking using
>>> per-cpu locks. This avoids the overhead of synchronize_net() during
>>> update but still removes the expensive rwlock in earlier versions.
>>>
>>> The idea for this came from an earlier version done by Eric Dumazet.
>>> Locking is done per-cpu, the fast path locks on the current cpu
>>> and updates counters. The slow case involves acquiring the locks on
>>> all cpu's. This version uses RCU for the table->base reference
>>> but per-cpu-lock for counters.
>
>> This version is a regression over 2.6.2[0-9], because of two points
>> 1) Much more atomic ops :
>> Because of additional
>>> + spin_lock(&__get_cpu_var(ip_tables_lock));
>>> ADD_COUNTER(e->counters, ntohs(ip->tot_len), 1);
>>> + spin_unlock(&__get_cpu_var(ip_tables_lock));
>> added on each counter updates.
>> On many setups, each packet coming in or out of the machine has
>> to update between 2 to 20 rule counters. So to avoid *one* atomic ops
>> of read_unlock(), this v4 version adds 2 to 20 atomic ops...
>
> I agree, this seems to be a step backwards.
>
>> I still not see the problem between the previous version (2.6.2[0-8]) that
>> had a central
>> rwlock, that hurted performance on SMP because of cache line ping pong,
>> and the solution
>> having one rwlock per cpu.
>> We wanted to reduce the cache line ping pong first. This *is* the hurting
>> point,
>> by an order of magnitude.
>
> Dave doesn't seem to like the rwlock approach.
Well, we don't really need an rwlock, especially given that we really
don't want two "readers" incrementing the same counter concurrently.
A safer approach would be to maintain a flag in the task structure
tracking which (if any) of the per-CPU locks you hold. Also maintain
a recursion-depth counter. If the flag says you don't already hold
the lock, set it and acquire the lock. Either way, increment the
recursion-depth counter:
if (current->netfilter_lock_held != cur_cpu) {
BUG_ON(current->netfilter_lock_held != CPU_NONE);
spin_lock(per_cpu(..., cur_cpu));
current->netfilter_lock_held = cur_cpu;
}
current->netfilter_lock_nesting++;
And reverse the process to unlock:
if (--current->netfilter_lock_nesting == 0) {
spin_unlock(per_cpu(..., cur_cpu));
current->netfilter_lock_held = CPU_NONE;
}
> I don't see a way to
> do anything asynchronously like call_rcu() to improve this, so to
> bring up one of Stephens suggestions again:
>
>>> > * use on_each_cpu() somehow to do grace periood?
>
> We could use this to replace the counters, presuming it is
> indeed faster than waiting for a RCU grace period.
One way to accomplish this is to take Mathieu Desnoyers's user-level
RCU implementation and drop it into the kernel, replacing the POSIX
signal handling with on_each_cpu(), smp_call_function(), or whatever.
>> 2) Second problem : Potential OOM
>> About freeing old rules with call_rcu() and/or schedule_work(), this is
>> going
>> to OOM pretty fast on small appliances with basic firewall setups loading
>> rules one by one, as done by original topic reporter.
>> We had reports from guys using linux with 4MB of available ram (French
>> provider free.fr on
>> their applicance box), and we had to use SLAB_DESTROY_BY_RCU thing on
>> conntrack
>> to avoid OOM for their setups. We dont want to use call_rcu() and queue
>> 100 or 200 vfree().
>
> Agreed.
This is not a real problem be handled by doing a synchronize_rcu() every
so often as noted in a prior email elsewhere in this thread:
call_rcu(...);
if (++count > 50) {
synchronize_rcu();
count = 0;
}
This choice of constant would reduce the grace-period pain to 2% of the
full effect, which should be acceptable, at least if I remember the
original problem report of 0.2 seconds growing to 6.0 seconds -- this
would give you:
(6.0-0.2)/50+0.2 = .316
I would argue that 100 milliseconds is an OK penalty for a deprecated
feature. But of course the per-CPU lock approach should avoid even that
penalty, albeit at some per-packet penalty. However, my guess is that
this per-packet penalty is not measureable at the system level.
And if the penalty of a single uncontended lock -is- measureable, I will
be very quick to offer my congratulations, at least once I get my jaw
off my keyboard. ;-)
Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists