netdev - Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160405112905.66b84e13@redhat.com>
Date:	Tue, 5 Apr 2016 11:29:05 +0200
From:	Jesper Dangaard Brouer <brouer@...hat.com>
To:	Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc:	Brenden Blanco <bblanco@...mgrid.com>,
	John Fastabend <john.fastabend@...il.com>,
	Tom Herbert <tom@...bertland.com>,
	Daniel Borkmann <daniel@...earbox.net>,
	"David S. Miller" <davem@...emloft.net>,
	Linux Kernel Network Developers <netdev@...r.kernel.org>,
	ogerlitz@...lanox.com, brouer@...hat.com
Subject: Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver
 filter


On Mon, 4 Apr 2016 13:00:34 -0700 Alexei Starovoitov <alexei.starovoitov@...il.com> wrote:

> As seen in 'perf report' from patch 5:
>   3.32%  ksoftirqd/1    [kernel.vmlinux]  [k] sk_load_byte_positive_offset
> this is 14Mpps and 4 assembler instructions in the above function
> are consuming 3% of the cpu.

At this level we also need to take into account the cost/overhead of a
function call.  Which I've measured to between 5-7 cycles, part of my
time_bench_sample[1] test.

> Making new_load_byte to be single  x86 insn would be really cool.
> 
> Of course, there are other pieces to accelerate:
>  12.71%  ksoftirqd/1    [mlx4_en]         [k] mlx4_en_alloc_frags
>   6.87%  ksoftirqd/1    [mlx4_en]         [k] mlx4_en_free_frag
>   4.20%  ksoftirqd/1    [kernel.vmlinux]  [k] get_page_from_freelist
>   4.09%  swapper        [mlx4_en]         [k] mlx4_en_process_rx_cq
> and I think Jesper's work on batch allocation is going help that a lot.

Actually, it looks like all of this "overhead" comes from the page
alloc/free (+ dma unmap/map). We would need a page-pool recycle
mechanism to solve/remove this overhead.  For the early drop case we
might be able to hack recycle the page directly in the driver (and also
avoid dma_unmap/map cycle).


[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer