[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090203132220.21a16ea1@extreme>
Date: Tue, 3 Feb 2009 13:22:20 -0800
From: Stephen Hemminger <shemminger@...tta.com>
To: paulmck@...ux.vnet.ibm.com
Cc: Eric Dumazet <dada1@...mosbay.com>,
David Miller <davem@...emloft.net>, netdev@...r.kernel.org
Subject: Re: [PATCH 3/3] iptables: lock free counters (alternate version)
On Tue, 3 Feb 2009 13:10:00 -0800
"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com> wrote:
> On Tue, Feb 03, 2009 at 09:20:04PM +0100, Eric Dumazet wrote:
> > Paul E. McKenney a écrit :
> > > On Tue, Feb 03, 2009 at 08:00:16PM +0100, Eric Dumazet wrote:
> > >> Stephen Hemminger a écrit :
> > >>> This is an alternative to earlier RCU/seqcount_t version of counters.
> > >>> The counters operate as usual without locking, but when counters are rotated
> > >>> around the CPU's entries. RCU is used in two ways, first to handle the
> > >>> counter rotation, second for replace.
> > >> Is it a working patch or just a prototype ?
> > >>
> > >>> Signed-off-by: Stephen Hemminger <shemminger@...tta.com>
> > >>>
> > >>> ---
> > >>> include/linux/netfilter/x_tables.h | 10 +++-
> > >>> net/ipv4/netfilter/arp_tables.c | 73 +++++++++++++++++++++++++++---------
> > >>> net/ipv4/netfilter/ip_tables.c | 68 ++++++++++++++++++++++++---------
> > >>> net/ipv6/netfilter/ip6_tables.c | 75 ++++++++++++++++++++++++++-----------
> > >>> net/netfilter/x_tables.c | 43 +++++++++++++++------
> > >>> 5 files changed, 197 insertions(+), 72 deletions(-)
> > >>>
> > >>> --- a/include/linux/netfilter/x_tables.h 2009-02-02 15:06:39.893751845 -0800
> > >>> +++ b/include/linux/netfilter/x_tables.h 2009-02-02 15:28:10.022574005 -0800
> > >>> @@ -353,7 +353,7 @@ struct xt_table
> > >>> unsigned int valid_hooks;
> > >>>
> > >>> /* Lock for the curtain */
> > >>> - rwlock_t lock;
> > >>> + struct mutex lock;
> > >>>
> > >>> /* Man behind the curtain... */
> > >>> struct xt_table_info *private;
> > >>> @@ -383,9 +383,15 @@ struct xt_table_info
> > >>> unsigned int hook_entry[NF_INET_NUMHOOKS];
> > >>> unsigned int underflow[NF_INET_NUMHOOKS];
> > >>>
> > >>> + /* For the dustman... */
> > >>> + union {
> > >>> + struct rcu_head rcu;
> > >>> + struct work_struct work;
> > >>> + };
> > >>> +
> > >>> /* ipt_entry tables: one per CPU */
> > >>> /* Note : this field MUST be the last one, see XT_TABLE_INFO_SZ */
> > >>> - char *entries[1];
> > >>> + void *entries[1];
> > >>> };
> > >>>
> > >>> #define XT_TABLE_INFO_SZ (offsetof(struct xt_table_info, entries) \
> > >>> --- a/net/ipv4/netfilter/ip_tables.c 2009-02-02 15:06:29.684249364 -0800
> > >>> +++ b/net/ipv4/netfilter/ip_tables.c 2009-02-02 15:14:13.256499021 -0800
> > >>> @@ -347,10 +347,12 @@ ipt_do_table(struct sk_buff *skb,
> > >>> mtpar.family = tgpar.family = NFPROTO_IPV4;
> > >>> tgpar.hooknum = hook;
> > >>>
> > >>> - read_lock_bh(&table->lock);
> > >>> IP_NF_ASSERT(table->valid_hooks & (1 << hook));
> > >>> - private = table->private;
> > >>> - table_base = (void *)private->entries[smp_processor_id()];
> > >>> +
> > >>> + rcu_read_lock_bh();
> > >>> + private = rcu_dereference(table->private);
> > >>> + table_base = rcu_dereference(private->entries[smp_processor_id()]);
> > >>> +
> > >>> e = get_entry(table_base, private->hook_entry[hook]);
> > >>>
> > >>> /* For return from builtin chain */
> > >>> @@ -445,7 +447,7 @@ ipt_do_table(struct sk_buff *skb,
> > >>> }
> > >>> } while (!hotdrop);
> > >>>
> > >>> - read_unlock_bh(&table->lock);
> > >>> + rcu_read_unlock_bh();
> > >>>
> > >>> #ifdef DEBUG_ALLOW_ALL
> > >>> return NF_ACCEPT;
> > >>> @@ -892,45 +894,73 @@ set_entry_to_counter(const struct ipt_en
> > >>> return 0;
> > >>> }
> > >>>
> > >>> +static inline int
> > >>> +set_counter_to_entry(struct ipt_entry *e,
> > >>> + const struct ipt_counters total[],
> > >>> + unsigned int *i)
> > >>> +{
> > >>> + SET_COUNTER(e->counters, total[*i].bcnt, total[*i].pcnt);
> > >>> +
> > >>> + (*i)++;
> > >>> + return 0;
> > >>> +}
> > >>> +
> > >>> +
> > >>> static void
> > >>> -get_counters(const struct xt_table_info *t,
> > >>> +get_counters(struct xt_table_info *t,
> > >>> struct xt_counters counters[])
> > >>> {
> > >>> unsigned int cpu;
> > >>> unsigned int i;
> > >>> unsigned int curcpu;
> > >>> + struct ipt_entry *e;
> > >>>
> > >>> - /* Instead of clearing (by a previous call to memset())
> > >>> - * the counters and using adds, we set the counters
> > >>> - * with data used by 'current' CPU
> > >>> - * We dont care about preemption here.
> > >>> - */
> > >>> + preempt_disable();
> > >>> curcpu = raw_smp_processor_id();
> > >>> -
> > >>> + e = t->entries[curcpu];
> > >>> i = 0;
> > >>> - IPT_ENTRY_ITERATE(t->entries[curcpu],
> > >>> + IPT_ENTRY_ITERATE(e,
> > >>> t->size,
> > >>> set_entry_to_counter,
> > >> Hum, current cpu might be interrupted by NIC, since you only disabled preemption.
> > >> set_entry_to_counter() might get garbage.
> > >> I suppose I already mentioned it :)
> > >>
> > >>> counters,
> > >>> &i);
> > >>>
> > >>> for_each_possible_cpu(cpu) {
> > >>> + void *p;
> > >>> +
> > >>> if (cpu == curcpu)
> > >>> continue;
> > >>> +
> > >>> + /* Swizzle the values and wait */
> > >>> + e->counters = ((struct xt_counters) { 0, 0 });
> > >> I dont see what you want to do here...
> > >>
> > >> e->counters is the counter associated with rule #0
> > >>
> > >>> + p = t->entries[cpu];
> > >>> + rcu_assign_pointer(t->entries[cpu], e);
> > >>> + synchronize_net();
> > >>
> > >> Oh well, not this synchronize_net() :)
> > >>
> > >> This wont provide atomic sampling of counters for whole CPUS, and introduce large delays
> > >> on big machines (NR_CPUS >= 64)
> > >
> > > Why would this not provide the moral equivalent of atomic sampling?
> > > The code above switches to another counter set, and waits for a grace
> > > period. Shouldn't this mean that all CPUs that were incrementing the
> > > old set of counters have finished doing so, so that the aggregate count
> > > covers all CPUs that started their increments before the pointer switch?
> > > Same as acquiring a write lock, which would wait for all CPUs that
> > > started their increments before starting the write-lock acquisition.
> > > CPUs that started their increments after starting the write acquisition
> > > would not be accounted for in the total, same as the RCU approach.
> > >
> > > Steve's approach does delay reading out the counters, but it avoids
> > > delaying any CPU trying to increment the counters.
> >
> > I see your point, but this is not what Stephen implemented.
> >
> > So.. CPU will increments which counters, if not delayed ?
>
> The new set installed by the rcu_assign_pointer().
>
> > How counters will be synced again after our 'iptables -L' finished ?
>
> The usual approach would be to have three sets of counters, one currently
> being incremented, one just removed from service, and the last one holding
> the cumulative value. After a synchronize_net() following removing
> a set from service, you add in the values in the previous set removed
> from service. Then you keep the new set for the next 'iptables -L'.
>
> > "iptable -L" is not supposed to miss some counters updates (only some packets
> > might be droped at NIC level because we spend time in the collection).
> > If packets matches some rules, we really want up2date counters.
>
> No counter updates would be lost using the above method.
>
> > Maybe we need for this collection an extra "cpu", to collect
> > all increments that were done when CPUs where directed to a
> > "secondary table/counters"
>
> It should be easier to maintain a third set of counters that hold the
> accumulated counts from the earlier instances of 'iptables -L'.
>
> > > So what am I missing here?
> >
> > Well, I saw one synchronize_net() inside the for_each_possible_cpu(cpu) loop.
> > Say we have NR_CPUS=4096, how long will it takes to perform "iptables -L" ?
>
> Good point, the for_each_possible_cpu() was cut out -- I should have
> gone back and looked at the original patch.
>
> Seems like it should be possible to do a single synchronize_net()
> after swizzling all the counters...
>
> > General/intuitive idea would be :
> >
> > switch pointers to a newly allocated table (and zeroed counters)
> > wait one RCU grace period
> > collect/sum all counters of "old" table + (all cpus) into user provided table
> > restore previous table
> > wait one RCU grace period
> > disable_bh()
> > collect/sum all counters of "new and temporary" table (all cpus) and
> > reinject them into local cpu table (this cpu should not be interrupted)
> > enable_bh()
> >
> > This way, "iptables -L" is not too expensive and doesnt block packet processing at all.
>
> My thought would be:
>
> o acquire some sort of mutex.
>
> o switch counters to newly allocated (and zeroed) table (T1).
> The old table being switched out is T2.
>
> o wait one RCU grace period.
>
> o Sum T2 into a single global counter (G).
>
> o Free T2.
>
> o Copy G to a local variable.
>
> o release the mutex.
>
> o Return the value of the local variable.
>
> Then you can repeat, allocating a new table again and using the new
> value of G.
>
> Which may well be what you are saying above,
>
I was using the current CPU counter as the global counter G.
> Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists