[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140108201838.GI9894@breakpoint.cc>
Date: Wed, 8 Jan 2014 21:18:38 +0100
From: Florian Westphal <fw@...len.de>
To: Eric Dumazet <eric.dumazet@...il.com>
Cc: Florian Westphal <fw@...len.de>, Andrey Vagin <avagin@...nvz.org>,
netfilter-devel@...r.kernel.org, netfilter@...r.kernel.org,
coreteam@...filter.org, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org, vvs@...nvz.org,
Pablo Neira Ayuso <pablo@...filter.org>,
Patrick McHardy <kaber@...sh.net>,
Jozsef Kadlecsik <kadlec@...ckhole.kfki.hu>,
"David S. Miller" <davem@...emloft.net>,
Cyrill Gorcunov <gorcunov@...nvz.org>
Subject: Re: [PATCH] netfilter: nf_conntrack: fix RCU race in
nf_conntrack_find_get
Eric Dumazet <eric.dumazet@...il.com> wrote:
> > The confirmed bit should always be set here.
>
> So why are you testing it ?
To detect ct object recycling when tuple is identical.
This is my understanding of how we can end up with two
cpus thinking they have exclusive ownership of the same ct:
A cpu0: starts lookup: find ct for tuple t
B cpu1: starts lookup: find ct for tuple t
C cpu0: finds ct c for tuple t, no refcnt taken yet
cpu1: finds ct c for tuple t, no refcnt taken yet
cpu2: destroys ct c, removes from hash table, calls ->destroy function
D cpu0: tries to increment refcnt; fails since its 0: lookup ends
E cpu0: allocates a new ct object since no acceptable ct was found for t
F cpu0: allocator gives us just-freed ct c
G cpu0: initialises ct, sets refcnt to 1
H cpu0: adds extensions, ct object is put on unconfirmed list and
assigned to skb->nfct
I cpu0: skb continues through network stack
J cpu1: tries to increment refcnt, ok
K cpu1: checks if ct matches requested tuple t: it does
L cpu0: sets refcnt conntrack tuple, allocates extensions, etc.
cpu1: sets skb->nfct to ct, skb continues through network stack
-> both cpu0 and cpu1 reference a ct object that was not in hash table
cpu0 and cpu1 will then race, for example in
net/ipv4/netfilter/iptable_nat.c:nf_nat_rule_find():
if (!nf_nat_initialized(ct, HOOK2MANIP(hooknum)))
ret = alloc_null_binding(ct, hooknum);
[ Its possible that I misunderstand and that there is something that
precents this from happening. Usually its the 'tuple equal' test
that is performed post-atomic-inc-not-zero that detects the recycling,
so step K above would fail ]
The idea of the 'confirmed bit test' is that when its not set then the
conntrack was recycled and should not be used before the cpu that
currently 'owns' that entry has put it into the hash table again.
> I did this RCU conversion, so I think I know what I am talking about.
Yes, I know. If you have any suggestions on how to fix it, I'd be very
interested to hear about them.
> The entry should not be put into hash table (or refcnt set to 1),
> if its not ready. It is that simple.
I understand what you're saying, but I don't see how we can do it.
I think the assumption that we have a refcnt on skb->nfct is everywhere.
If I understand you correctly we'd have to differentiate between
'can be used fully (e.g. nf_nat_setup_info done for both directions)'
and 'a new conntrack was created (extensions may not be ready yet)'.
But currently in both cases the ct is assigned to the skb, and in both
cases a refcnt is taken. I am out of ideas, except perhaps using
ct->lock to serialize the nat setup (but I don't like that since
I'm not sure that the nat race is the only one).
> We need to address this problem without adding an extra test and
> possible loop for the lookups.
Agreed. I don't like the extra test either.
Many thanks for looking into this Eric!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists