[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090411041533.GB6822@linux.vnet.ibm.com>
Date: Fri, 10 Apr 2009 21:15:33 -0700
From: "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: David Miller <davem@...emloft.net>, Ingo Molnar <mingo@...e.hu>,
Lai Jiangshan <laijs@...fujitsu.com>, shemminger@...tta.com,
jeff.chua.linux@...il.com, dada1@...mosbay.com, jengelh@...ozas.de,
kaber@...sh.net, r000n@...0n.net,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
netfilter-devel@...r.kernel.org, netdev@...r.kernel.org
Subject: Re: iptables very slow after commit
784544739a25c30637397ace5489eeb6e15d7d49
On Fri, Apr 10, 2009 at 06:39:18PM -0700, Linus Torvalds wrote:
>
>
> On Fri, 10 Apr 2009, David Miller wrote:
> >
> > [ CC:'ing netfilter-devel and netdev... ]
>
> I wonder if we should bring in the RCU people too, for them to tell you
> that the networking people are beign silly, and should not synchronize
> with the very heavy-handed
>
> synchronize_net()
>
> but instead of doing synchronization (which is probably why adding a few
> hundred rules then takes several seconds - each synchronizes and that
> takes a timer tick or so), add the rules to be free'd on some rcu-freeing
> list for later freeing.
>
> Or whatever. Paul? synchronize_net() just calls synchronize_rcu(), and
> with that knowledge and a simple
>
> git show 784544739a25c30637397ace5489eeb6e15d7d49
>
> I bet you can already tell people how to fix their performance issue.
Well, I am certainly happy to demonstrate my ignorance of the networking
code by throwing out a few suggestions.
So, Dave and Steve, you might want to get out your barf bag before
reading further. You have been warned! ;-)
1. Assuming that the synchronize_net() is intended to guarantee
that the new rules will be in effect before returning to
user space:
a. Split this functionality, so that there is a new
user-space primitive that installs a new rule, but
without waiting. They provide an additional user-space
primitive that waits for the rules to take effect.
Then, when loading a long list of rules, load them
using the non-waiting primitive, and wait at the end.
b. As above, but provide a flag that says whether or not
to wait. Same general effect.
But I am not seeing the direct connection between this patch
and netfilter, so...
2. For the xt_replace_table() case, it would be necessary to add an
rcu_head to the xt_table_info, and replace each caller's direct
calls to xt_free_table_info() with call_rcu().
Now this has an issue in that the caller wants to return the
final counter values. My assumption is that these values do
not in fact need to be exact. If I am wrong about that, then
my suggestion would lose the counts from late readers.
I must defer to the networking guys as to whether this is
acceptable or not. If not, more head-scratching would be
required. (But it looks to me that the rule is being trashed,
so who cares about the extra counts?)
In addition, a malicious user might be able to force this to
happen extremely frequently, running the system out of memory.
One way to fix this is to invoke synchronize_net() one out of
20 times or some such.
3. For the alloc_counters() case, the comments indicate that we
really truly do want an atomic sampling of the counters.
The counters are 64-bit entities, which is a bit inconvenient.
Though people using this functionality are no doubt quite happy
to never have to worry about overflow, I hasten to add!
I will nevertheless suggest the following egregious hack to
get a consistent sample of one counter for some other CPU:
a. Disable interrupts
b. Atomically exchange the bottom 32 bits of the
counter with the value zero.
c. Atomically exchange the top 32 bits of the counter
with the value zero.
d. Concatenate the values obtained in (b) and (c), which
is the snapshot value.
e. Re-enable interrupts. Yes, for each counter. Do it
for the honor of the -rt patchset. ;-)
Disabling interrupts should make it impossible for
the low-order 32 bits of the counter to overflow before
we get around to zeroing the upper 32 bits. Yes, this
is horribly paranoid, but please keep in mind that even
my level of paranoia is not always sufficient to keep
RCU working correctly. :-/
Architectures with 64-bit atomics can simply do a 64-bit
exchange (or cmpxchg(), for that matter).
Now we still have the possibility that the other CPU is still
hammering away on the counter that we just zeroed from a
long-running RCU read-side critical section.
So, we also need to add an rcu_head somewhere, perhaps reuse
the one in xt_table_info, create a second one, or squirrel one
away somewhere else. As long as there is a way to get to the
old counter values. And a flag to indicate that the rcu_head
is in use. It is socially irresponsible to pass a given
rcu_head to call_rcu() before it has been invoked after the
previous time it was passed to call_rcu(). But you guys all
knew that already.
We replace the synchronize_net() with call_rcu(), more or less.
The call_rcu() probably needs to be under the lock -- or at the
very least, setting the flag saying that it is in use needs to
be under the lock.
The RCU callback function traverses the old counters one last
time, adding their values to the new set of counters. No
atomic exchange tricks are required this time, since all the
RCU readers that could possibly have held a reference to the
old set of counters must now be done. We now clear the flag,
allowing the next counter snapshot to proceed.
OK, OK, Dave and Steve, I should have suggested that you get two
barf bags. Maybe three. ;-)
Additional caveat: coward that I am, I looked only at the IPv4 code.
There might well be additional complications in the arp and IPv6 code.
However, I do believe that something like this might actually work.
Thoughts?
Thanx, Paul
> Linus
>
> ---
> > > On Fri, 10 Apr 2009 17:15:52 +0800 (SGT)
> > > Jeff Chua <jeff.chua.linux@...il.com> wrote:
> > >>
> > >> Adding 200 records in iptables took 6.0sec in 2.6.30-rc1 compared to
> > >> 0.2sec in 2.6.29. I've bisected down this commit.
> > >>
> > >> There are a few patches on top of the original patch. When I reverted the
> > >> original commit + changing rcu_read() to rcu_read_bh(), it speeds up the
> > >> inserts back to .2sec again.
> > >>
> > >> I'm loading all the firewall rules during boot-up and this 6 secs slowness
> > >> is really not very nice to wait for.
> > >
> > > The performance benefit during operation is more important. The load
> > > time is fixable. The problem is probably generic to any set of rules,
> > > but could you post some info about your configuration (like the rule
> > > set), and the system configuration (# of cpu's, config etc).
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > the body of a message to majordomo@...r.kernel.org
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at http://www.tux.org/lkml/
> >
> --
> To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists