[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACKH++afaAaa7a6ViYjo_PjpF1bXYtOuJaa-4umEOSVgW1+g3w@mail.gmail.com>
Date: Fri, 29 Jul 2011 22:09:36 -0700
From: Rui Ueyama <rui314@...il.com>
To: Eric Dumazet <eric.dumazet@...il.com>
Cc: netdev@...r.kernel.org
Subject: Re: [PATCH] net: filter: Convert the BPF VM to threaded code
The result of benchmark looks good. A simple benchmark that sends 10M UDP
packets to lo took 76.24 seconds on average on Core 2 Duo L7500@...GHz.when
tcpdump is running. With this patch it took 75.41 seconds, which means we save
80ns for each packet on that processor.
I think converting the VM to threaded code is low hanging fruit, even
if we'd have
JIT compilers for popular architectures. Most of the lines in my patch
are indentation
change, so the actual change is not big.
Vanilla kernel:
(without tcpdump)
ruiu@...e:~$ time ./udpflood 10000000
real 0m57.909s
user 0m1.368s
sys 0m56.484s
ruiu@...e:~$ time ./udpflood 10000000
real 0m57.686s
user 0m1.360s
sys 0m56.288s
ruiu@...e:~$ time ./udpflood 10000000
real 0m58.457s
user 0m1.300s
sys 0m57.116s
(with tcpdump)
ruiu@...e:~$ time ./udpflood 10000000
real 1m16.025s
user 0m1.464s
sys 1m14.505s
ruiu@...e:~$ time ./udpflood 10000000
real 1m15.860s
user 0m1.232s
sys 1m14.573s
ruiu@...e:~$ time ./udpflood 10000000
real 1m16.861s
user 0m1.504s
sys 1m15.301s
Kernel with the patch:
(without tcpdump)
ruiu@...e:~$ time ./udpflood 10000000
real 0m59.272s
user 0m1.308s
sys 0m57.924s
ruiu@...e:~$ time ./udpflood 10000000
real 0m59.624s
user 0m1.336s
sys 0m58.244s
ruiu@...e:~$ time ./udpflood 10000000
real 0m59.340s
user 0m1.240s
sys 0m58.056s
(with tcpdump)
ruiu@...e:~$ time ./udpflood 10000000
real 1m15.392s
user 0m1.372s
sys 1m13.965s
ruiu@...e:~$ time ./udpflood 10000000
real 1m15.352s
user 0m1.452s
sys 1m13.845s
ruiu@...e:~$ time ./udpflood 10000000
real 1m15.508s
user 0m1.464s
sys 1m13.989s
Tcpdump I used is this: tcpdump -p -n -s -i lo net 192.168.2.0/24
On Fri, Jul 29, 2011 at 2:30 AM, Eric Dumazet <eric.dumazet@...il.com> wrote:
> Le vendredi 29 juillet 2011 à 01:10 -0700, Rui Ueyama a écrit :
>> Convert the BPF VM to threaded code to improve performance.
>>
>> The BPF VM is basically a big for loop containing a switch statement. That is
>> slow because for each instruction it checks the for loop condition and does the
>> conditional branch of the switch statement.
>>
>> This patch eliminates the conditional branch, by replacing it with jump table
>> using GCC's labels-as-values feature. The for loop condition check can also be
>> removed, because the filter code always end with a RET instruction.
>>
>
> Well...
>
>
>> +#define NEXT goto *jump_table[(++fentry)->code]
>> +
>> + /* Dispatch the first instruction */
>> + goto *jump_table[fentry->code];
>
> This is the killer, as this cannot be predicted by the cpu.
>
> Do you have benchmark results to provide ?
>
> We now have BPF JIT on x86_64 and powerpc, and possibly on MIPS and ARM
> on a near future.
>
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists