netdev - [PATCH] conntrack: Reduce conntrack count in nf_conntrack

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <49C8CCF4.5050104@cosmosbay.com>
Date:	Tue, 24 Mar 2009 13:07:16 +0100
From:	Eric Dumazet <dada1@...mosbay.com>
To:	Joakim Tjernlund <Joakim.Tjernlund@...nsmode.se>
CC:	avorontsov@...mvista.com, Patrick McHardy <kaber@...sh.net>,
	netdev@...r.kernel.org,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
Subject: [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()

Joakim Tjernlund a écrit :
> Eric Dumazet <dada1@...mosbay.com> wrote on 24/03/2009 10:12:53:
>> Joakim Tjernlund a écrit :
>>> Patrick McHardy <kaber@...sh.net> wrote on 23/03/2009 18:49:15:
>>>> Joakim Tjernlund wrote:
>>>>> Patrick McHardy <kaber@...sh.net> wrote on 23/03/2009 13:29:33:
>>>>>
>>>>>
>>>>>>> There is no /proc/net/netfilter/nf_conntrack. There is a
>>>>>>> /proc/net/nf_conntrack though and it is empty. If I telnet
>>>>>>> to the board I see:
>>>>>>>
>>>>>> That means that something is leaking conntrack references, most 
>>> likely
>>>>>> by leaking skbs. Since I haven't seen any other reports, my guess 
>>> would
>>>>>> be the ucc_geth driver.
>>>>>>
>>>>> Mucking around with the ucc_geth driver I found that if I:
>>>>>  - Move TX from IRQ to NAPI context
>>>>>  - double the weight.
>>>>>  - after booting up, wait a few mins until the JFFS2 GC kernel 
> thread 
>>> has 
>>>>> stopped
>>>>>    scanning the FS 
>>>>>
>>>>> Then the "nf_conntrack: table full, dropping packet." msgs stops.
>>>>> Does this seem right to you guys?
>>>> No. As I said, something seems to be leaking packets. You should be
>>>> able to confirm that by checking the sk_buff slabs in /proc/slabinfo.
>>>> If that *doesn't* show any signs of a leak, please run "conntrack -E"
>>>> to capture the conntrack events before the "table full" message
>>>> appears and post the output.
>>> skbuff does not differ much, but others do
>>>
>>> Before ping:
>>>   skbuff_fclone_cache    0      0    352   11    1 : tunables   54 27 
> 0 
>>> : slabdata      0      0      0
>>>   skbuff_head_cache     20     20    192   20    1 : tunables  120 60 
> 0 
>>> : slabdata      1      1      0
>>>   size-64              731    767     64   59    1 : tunables  120 60 
> 0 
>>> : slabdata     13     13      0
>>>   nf_conntrack          10     19    208   19    1 : tunables  120 60 
> 0 
>>> : slabdata      1      1      0
>>>
>>> During ping: 
>>>   skbuff_fclone_cache    0      0    352   11    1 : tunables   54 27 
> 0 
>>> : slabdata      0      0      0
>>>   skbuff_head_cache     40     40    192   20    1 : tunables  120 60 
> 0 
>>> : slabdata      2      2      0
>>>   size-64             8909   8909     64   59    1 : tunables  120 60 
> 0 
>>> : slabdata    151    151      0
>>>   nf_conntrack        5111   5111    208   19    1 : tunables  120 60 
> 0 
>>> : slabdata    269    269      0
>>>
>>> This feels more like the freeing of conntrack objects are delayed and 
>>> builds up when ping flooding.
>>>
>>> Don't have "conntrack -E" for my embedded board so that will have to 
> wait 
>>> a bit longer.
>> I dont understand how your ping can use so many conntrack entries...
>>
>> Then, as I said yesterday, I believe you have a RCU delay, because of
>> a misbehaving driver or something...
>>
>> grep RCU .config
> grep RCU .config
> # RCU Subsystem
> CONFIG_CLASSIC_RCU=y
> # CONFIG_TREE_RCU is not set
> # CONFIG_PREEMPT_RCU is not set
> # CONFIG_TREE_RCU_TRACE is not set
> # CONFIG_PREEMPT_RCU_TRACE is not set
> # CONFIG_RCU_TORTURE_TEST is not set
> # CONFIG_RCU_CPU_STALL_DETECTOR is not set
> 
>> grep CONFIG_SMP .config
> grep CONFIG_SMP .config
> # CONFIG_SMP is not set
> 
>> You could change qhimark from 10000 to 1000 in kernel/rcuclassic.c (line 
> 80)
>> as a workaround. It should force a quiescent state after 1000 freed 
> conntracks.
> 
> right, doing this almost killed all conntrack messages, had to stress it 
> pretty
> hard before I saw handful "nf_conntrack: table full, dropping packet"
> 
> RCU is not my cup of tea, do you have any ideas were to look?

In a stress situation, you feed more deleted conntracks to call_rcu() than
the blimit (10 real freeing per RCU softirq invocation). 

So with default qhimark being 10000, this means about 10000 conntracks
can sit in RCU (per CPU) before being really freed.

Only when hitting 10000, RCU enters a special mode to free all queued items, instead
of a small batch of 10

To solve your problem we can :

1) reduce qhimark from 10000 to 1000 (for example)
   Probably should be done to reduce some spikes in RCU code when freeing
   whole 10000 elements...
OR
2) change conntrack tunable (max conntrack entries on your machine)
OR
3) change net/netfilter/nf_conntrack_core.c to decrement net->ct.count
  in nf_conntrack_free() instead of callback.

[PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()

We use RCU to defer freeing of conntrack structures. In DOS situation, RCU might
accumulate about 10.000 elements per CPU in its internal queues. To get accurate
conntrack counts (at the expense of slightly more RAM used), we might consider
conntrack counter not taking into account "about to be freed elements, waiting
in RCU queues". We thus decrement it in nf_conntrack_free(), not in the RCU
callback.

Signed-off-by: Eric Dumazet <dada1@...mosbay.com>


diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index f4935e3..6478dc7 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -516,16 +516,17 @@ EXPORT_SYMBOL_GPL(nf_conntrack_alloc);
 static void nf_conntrack_free_rcu(struct rcu_head *head)
 {
 	struct nf_conn *ct = container_of(head, struct nf_conn, rcu);
-	struct net *net = nf_ct_net(ct);
 
 	nf_ct_ext_free(ct);
 	kmem_cache_free(nf_conntrack_cachep, ct);
-	atomic_dec(&net->ct.count);
 }
 
 void nf_conntrack_free(struct nf_conn *ct)
 {
+	struct net *net = nf_ct_net(ct);
+
 	nf_ct_ext_destroy(ct);
+	atomic_dec(&net->ct.count);
 	call_rcu(&ct->rcu, nf_conntrack_free_rcu);
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_free);



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html