netdev - Re: Fw: [Bug 13339] New: rtable leak in ipv4/route.c

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090519192450.GF28034@hmsreliant.think-freely.org>
Date:	Tue, 19 May 2009 15:24:50 -0400
From:	Neil Horman <nhorman@...driver.com>
To:	Eric Dumazet <dada1@...mosbay.com>
Cc:	Jarek Poplawski <jarkao2@...il.com>, lav@....ru,
	Stephen Hemminger <shemminger@...ux-foundation.org>,
	netdev@...r.kernel.org
Subject: Re: Fw: [Bug 13339] New: rtable leak in ipv4/route.c

On Tue, May 19, 2009 at 08:47:54PM +0200, Eric Dumazet wrote:
> Neil Horman a écrit :
> > On Tue, May 19, 2009 at 05:32:29PM +0200, Eric Dumazet wrote:
> >> Jarek Poplawski a écrit :
> >>> On 19-05-2009 04:35, Stephen Hemminger wrote:
> >>>> Begin forwarded message:
> >>>>
> >>>> Date: Mon, 18 May 2009 14:10:20 GMT
> >>>> From: bugzilla-daemon@...zilla.kernel.org
> >>>> To: shemminger@...ux-foundation.org
> >>>> Subject: [Bug 13339] New: rtable leak in ipv4/route.c
> >>>>
> >>>>
> >>>> http://bugzilla.kernel.org/show_bug.cgi?id=13339
> >>> ...
> >>>> 2.6.29 patch has introduced flexible route cache rebuilding. Unfortunately the
> >>>> patch has at least one critical flaw, and another problem.
> >>>>
> >>>> rt_intern_hash calculates rthi pointer, which is later used for new entry
> >>>> insertion. The same loop calculates cand pointer which is used to clean the
> >>>> list. If the pointers are the same, rtable leak occurs, as first the cand is
> >>>> removed then the new entry is appended to it.
> >>>>
> >>>> This leak leads to unregister_netdevice problem (usage count > 0).
> >>>>
> >>>> Another problem of the patch is that it tries to insert the entries in certain
> >>>> order, to facilitate counting of entries distinct by all but QoS parameters.
> >>>> Unfortunately, referencing an existing rtable entry moves it to list beginning,
> >>>> to speed up further lookups, so the carefully built order is destroyed.
> >> We could change rt_check_expire() to be smarter and handle any order in chains.
> >>
> >> This would let rt_intern_hash() be simpler.
> >>
> >> As its a more performance critical path, all would be good :)
> >>
> >>>> For the first problem the simplest patch it to set rthi=0 when rthi==cand, but
> >>>> it will also destroy the ordering.
> >>> I think fixing this bug fast is more important than this
> >>> ordering or counting. Could you send your patch proposal?
> >>>
> > 
> > I was thinking something along these lines (also completely untested).  The
> > extra check in rt_intern hash prevents us from identifying a 'group leader' that
> > is about to be deleted, and sets rthi to point at the group leader rather than
> > the last entry in the group as it previously did, which we then mark with the
> > new flag in the dst_entry (updating it on rt_free of course).  This lets us
> > identify the boundaries of a group, so when we do a move to front heruistic, we
> > can move the entire group rather than just the single entry.    I prefer this to
> > Erics approach since, even though rt_check_expire isn't in a performance
> > critical path, the addition of the extra list search when the list is unordered
> > makes rt_check_expire run in O(n^2) rather than O(n), which I think will have a
> > significant impact on high traffic systems.  Since the ordered list creation was
> > done at almost zero performance cost in rt_intern_hash, I'd like to keep it if
> > at all possible.
> > 
> > There is of course a shortcomming in my patch in that it doesn't actually do
> > anything currently with DST_GRPLDR other than set/clear it appropriately, I've
> > yet to identify where the route cache move to front heruistic is implemented.
> > If someone could show me where that is, I'd be happy to finish this patch.
> > 
> 
> O(n^2) ? Not exactly ...
> 
you're right, on closer examination, its not O(n^2), but its not what you have
below either I don't think
 
> N * (Y + x)  instead of N * (Y),  where x <= Y/100, and we can make x being 0
> in fact with proper prefetching.
> 
>From the way I read you patch (which is actually pretty clever as I look at it),
you add an extra loop that searches from the start of the given chain to the
point at which the outer loop is set looking for a duplicate, so that each
rtable entry which varies only by non-hash relevant values is counted only once.
Thats good.  But that means for a chain of n nodes in which you are checking the
ith node for duplicates you need to visit at worst i other nodes, so to process
the entire chain we need to visit SUM(i=1..n)(i) == (n(n-1)/2) nodes rather than
the simply linear search requireing the inspection of only n nodes.  I see how
with prefetching the cost of the additional node visitation can be minimized,
but thats still a significantly larger set of nodes to check per expiration run.

> Real cost is bringing into CPU cache the information needed, by two
> orders of magnitude. (one cache line miss : more than 200 cycles,
> and we need at least two cache lines per rtable.)

> Adding 'group' notion to dst entries is overengineering an already
> too complex ipv4/route.c file, that is almost unreadable these days,
> since even you can not spot the point where we move the entry at chain
> front.
> 
> Hint : search for /* Put it first */ in rt_intern_hash()
> 
Ah! So its not actually on rtable access (i.e. route lookup) that we do a
move-to-front, its only when we're hashing in new entries.  That explains why I
didn't see it, I was thinking route lookups moved the entry to the front, not
inserts.

> Moving whole group in front would defeat the purpose of move, actually,
> since rank in chain is used to decay the timeout in garbage collector.
> (search for tmo >>= 1; )
> 
Argh, so the list is implicitly ordered by expiration time.  That really defeats
the entire purpose of doing grouping in the ilst at all.  If thats the case,
then I agree, its probably better to to take the additional visitation hit in in
check_expire above than to try and preserve ordering. 

Thanks
Neil

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html