[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1389739537.31367.273.camel@edumazet-glaptop2.roam.corp.google.com>
Date: Tue, 14 Jan 2014 14:45:37 -0800
From: Eric Dumazet <eric.dumazet@...il.com>
To: Austin S Hemmelgarn <ahferroin7@...il.com>
Cc: Hannes Frederic Sowa <hannes@...essinduktion.org>,
netdev@...r.kernel.org, dborkman@...hat.com,
linux-kernel@...r.kernel.org, darkjames-ws@...kjames.pl
Subject: Re: [PATCH RFC] reciprocal_divide: correction/update of the
algorithm
On Tue, 2014-01-14 at 15:53 -0500, Austin S Hemmelgarn wrote:
> On 2014-01-14 14:50, Eric Dumazet wrote:
> > On Tue, 2014-01-14 at 14:22 -0500, Austin S Hemmelgarn wrote:
> >
> >> I disagree with the statement that current CPU's have reasonably fast
> >> dividers. A lot of embedded processors and many low-end x86 CPU's do
> >> not in-fact have any hardware divider, and usually provide it using
> >> microcode based emulation if they provide it at all. The AMD Jaguar
> >> micro-architecture in particular comes to mind, it uses an iterative
> >> division algorithm provided by the microcode that only produces 2 bits
> >> of quotient per cycle, even in the best case (2 8-bit integers and an
> >> integral 8-bit quotient) this still takes 4 cycles, which is twice as
> >> slow as any other math operation on the same processor.
> >
> > I doubt you run any BPF filter with a divide instruction in it on these
> > platform.
> >
> > Get real, do not over optimize things where it does not matter.
> >
> Actually, I have three Jaguar based routers, and use BPF regularly as
> part of their iptables rules to log certain packet types.
1) Have you enabled BPF JIT
2) Do you have divide instructions in a BPF filter,
if yes, I would like to have an example.
(current code works well for small divisors anyway)
3) How much time is spent in BPF compared to the rest of the stack,
especially if you run iptables...
Spending 2 or 3 days of work to save ~7 cycles for a divide that
probably can be replaced by a shift anyway, while spending 5000 cycles
per packet is what I call not a wise optimization, especially
if dealing with old hardware.
Even on a Jaguar, the proposed alternative
+ u32 t = (u32)(((u64)a * R.m) >> 32);
+ return (t + ((a - t) >> R.sh1)) >> R.sh2;
will have a similar cost.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists