netdev - Re: [PATCHv4 iproute2 2/2] lib/libnetlink: update rtnl

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Wed, 11 Oct 2017 13:10:07 +0200
From:   Phil Sutter <phil@....cc>
To:     Stephen Hemminger <stephen@...workplumber.org>
Cc:     Michal Kubecek <mkubecek@...e.cz>, Hangbin Liu <haliu@...hat.com>,
        netdev@...r.kernel.org, Hangbin Liu <liuhangbin@...il.com>
Subject: Re: [PATCHv4 iproute2 2/2] lib/libnetlink: update rtnl_talk to
 support malloc buff at run time

On Tue, Oct 10, 2017 at 09:47:43AM -0700, Stephen Hemminger wrote:
> On Tue, 10 Oct 2017 08:41:17 +0200
> Michal Kubecek <mkubecek@...e.cz> wrote:
> 
> > On Mon, Oct 09, 2017 at 10:25:25PM +0200, Phil Sutter wrote:
> > > Hi Stephen,
> > > 
> > > On Mon, Oct 02, 2017 at 10:37:08AM -0700, Stephen Hemminger wrote:  
> > > > On Thu, 28 Sep 2017 21:33:46 +0800
> > > > Hangbin Liu <haliu@...hat.com> wrote:
> > > >   
> > > > > From: Hangbin Liu <liuhangbin@...il.com>
> > > > > 
> > > > > This is an update for 460c03f3f3cc ("iplink: double the buffer size also in
> > > > > iplink_get()"). After update, we will not need to double the buffer size
> > > > > every time when VFs number increased.
> > > > > 
> > > > > With call like rtnl_talk(&rth, &req.n, NULL, 0), we can simply remove the
> > > > > length parameter.
> > > > > 
> > > > > With call like rtnl_talk(&rth, nlh, nlh, sizeof(req), I add a new variable
> > > > > answer to avoid overwrite data in nlh, because it may has more info after
> > > > > nlh. also this will avoid nlh buffer not enough issue.
> > > > > 
> > > > > We need to free answer after using.
> > > > > 
> > > > > Signed-off-by: Hangbin Liu <liuhangbin@...il.com>
> > > > > Signed-off-by: Phil Sutter <phil@....cc>
> > > > > ---  
> > > > 
> > > > Most of the uses of rtnl_talk() don't need to this peek and dynamic sizing.
> > > > Can only those places that need that be targeted?  
> > > 
> > > We could probably do that, by having a buffer on stack in __rtnl_talk()
> > > which will be used instead of the allocated one if 'answer' is NULL. Or
> > > maybe even introduce a dedicated API call for the dynamically allocated
> > > receive buffer. But I really doubt that's feasible: AFAICT, that stack
> > > buffer still needs to be reasonably sized since the reply might be
> > > larger than the request (reusing the request buffer would be the most
> > > simple way to tackle this), also there is support for extack which may
> > > bloat the response to arbitrary size. Hangbin has shown in his benchmark
> > > that the overhead of the second syscall is negligible, so why care about
> > > that and increase code complexity even further?
> > > 
> > > Not saying it's not possible, but I just doubt it's worth the effort.  
> > 
> > Agreed. Current code is based on the assumption that we can estimate the
> > maximum reply length in advance and the reason for this series is that
> > this assumption turned out to be wrong. I'm afraid that if we replace
> > it by an assumption that we can estimate the maximum reply length for
> > most requests with only few exceptions, it's only matter of time for us
> > to be proven wrong again.
> > 
> > Michal Kubecek
> > 
> 
> For query responses, yes the response may be large. But for the common cases of
> add address or add route, the response should just be ack or error.

And with extack, error is comprised of the original request plus an
arbitrarily sized error message, so we can't just reuse the request
buffer and are back to "guessing" the right length again.

To get an idea of what we're talking about, I wrote a simple benchmark
which adds 256 * 254 (= 65024) addresses to an interface, then removes
them again one by one and measured the time that takes for binaries with
and without Hangbin's patches:

OP	Vanilla		Hangbin		Delta
--------------------------------------------------------
add	real 2m16.244s	real 2m27.964s	+11.72s	(108.6%)
	user 0m15.241s	user 0m17.295s	+2.054s	(113.5%)
	sys  1m40.229s	sys  1m48.239s	+8.01s	(108.0%)

remove	real 1m44.950s	real 1m47.044s	+2.094s	(102.0%)
	user 0m13.899s	user 0m14.723s	+0.824s (105.9%)
	sys  1m30.798s	sys  1m31.938s	+1.140s (101.3%)

So the overhead of the second syscall and dynamic memory allocation is
less than 10% overall. Given the short time a single call to 'ip'
typically takes, I don't think the difference is noticeable even in
highly performance critical applications.

Cheers, Phil