[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALDO+Sb=8yTdEofBB5Nav-Ea+T-bzqm6eM6_1LLb46etMz+ULA@mail.gmail.com>
Date: Tue, 27 Mar 2018 17:06:50 -0700
From: William Tu <u9012063@...il.com>
To: Jesper Dangaard Brouer <brouer@...hat.com>
Cc: Björn Töpel <bjorn.topel@...il.com>,
magnus.karlsson@...el.com,
Alexander Duyck <alexander.h.duyck@...el.com>,
Alexander Duyck <alexander.duyck@...il.com>,
John Fastabend <john.fastabend@...il.com>,
Alexei Starovoitov <ast@...com>,
willemdebruijn.kernel@...il.com,
Daniel Borkmann <daniel@...earbox.net>,
Linux Kernel Network Developers <netdev@...r.kernel.org>,
Björn Töpel <bjorn.topel@...el.com>,
michael.lundkvist@...csson.com, jesse.brandeburg@...el.com,
anjali.singhai@...el.com, jeffrey.b.shaw@...el.com,
ferruh.yigit@...el.com, qi.z.zhang@...el.com
Subject: Re: [RFC PATCH 00/24] Introducing AF_XDP support
On Tue, Mar 27, 2018 at 2:37 AM, Jesper Dangaard Brouer
<brouer@...hat.com> wrote:
> On Mon, 26 Mar 2018 14:58:02 -0700
> William Tu <u9012063@...il.com> wrote:
>
>> > Again high count for NMI ?!?
>> >
>> > Maybe you just forgot to tell perf that you want it to decode the
>> > bpf_prog correctly?
>> >
>> > https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#perf-tool-symbols
>> >
>> > Enable via:
>> > $ sysctl net/core/bpf_jit_kallsyms=1
>> >
>> > And use perf report (while BPF is STILL LOADED):
>> >
>> > $ perf report --kallsyms=/proc/kallsyms
>> >
>> > E.g. for emailing this you can use this command:
>> >
>> > $ perf report --sort cpu,comm,dso,symbol --kallsyms=/proc/kallsyms --no-children --stdio -g none | head -n 40
>> >
>>
>> Thanks, I followed the steps, the result of l2fwd
>> # Total Lost Samples: 119
>> #
>> # Samples: 2K of event 'cycles:ppp'
>> # Event count (approx.): 25675705627
>> #
>> # Overhead CPU Command Shared Object Symbol
>> # ........ ... ....... .................. ..................................
>> #
>> 10.48% 013 xdpsock xdpsock [.] main
>> 9.77% 013 xdpsock [kernel.vmlinux] [k] clflush_cache_range
>> 8.45% 013 xdpsock [kernel.vmlinux] [k] nmi
>> 8.07% 013 xdpsock [kernel.vmlinux] [k] xsk_sendmsg
>> 7.81% 013 xdpsock [kernel.vmlinux] [k] __domain_mapping
>> 4.95% 013 xdpsock [kernel.vmlinux] [k] ixgbe_xmit_frame_ring
>> 4.66% 013 xdpsock [kernel.vmlinux] [k] skb_store_bits
>> 4.39% 013 xdpsock [kernel.vmlinux] [k] syscall_return_via_sysret
>> 3.93% 013 xdpsock [kernel.vmlinux] [k] pfn_to_dma_pte
>> 2.62% 013 xdpsock [kernel.vmlinux] [k] __intel_map_single
>> 2.53% 013 xdpsock [kernel.vmlinux] [k] __alloc_skb
>> 2.36% 013 xdpsock [kernel.vmlinux] [k] iommu_no_mapping
>> 2.21% 013 xdpsock [kernel.vmlinux] [k] alloc_skb_with_frags
>> 2.07% 013 xdpsock [kernel.vmlinux] [k] skb_set_owner_w
>> 1.98% 013 xdpsock [kernel.vmlinux] [k] __kmalloc_node_track_caller
>> 1.94% 013 xdpsock [kernel.vmlinux] [k] ksize
>> 1.84% 013 xdpsock [kernel.vmlinux] [k] validate_xmit_skb_list
>> 1.62% 013 xdpsock [kernel.vmlinux] [k] kmem_cache_alloc_node
>> 1.48% 013 xdpsock [kernel.vmlinux] [k] __kmalloc_reserve.isra.37
>> 1.21% 013 xdpsock xdpsock [.] xq_enq
>> 1.08% 013 xdpsock [kernel.vmlinux] [k] intel_alloc_iova
>>
>
> You did use net/core/bpf_jit_kallsyms=1 and correct perf commands decoding of
> bpf_prog, so the perf top#3 'nmi' is likely a real NMI call... which looks wrong.
>
Thanks, you're right. Let me dig more on this NMI behavior.
>
>> And l2fwd under "perf stat" looks OK to me. There is little context
>> switches, cpu is fully utilized, 1.17 insn per cycle seems ok.
>>
>> Performance counter stats for 'CPU(s) 6':
>> 10000.787420 cpu-clock (msec) # 1.000 CPUs utilized
>> 24 context-switches # 0.002 K/sec
>> 0 cpu-migrations # 0.000 K/sec
>> 0 page-faults # 0.000 K/sec
>> 22,361,333,647 cycles # 2.236 GHz
>> 13,458,442,838 stalled-cycles-frontend # 60.19% frontend cycles idle
>> 26,251,003,067 instructions # 1.17 insn per cycle
>> # 0.51 stalled cycles per insn
>> 4,938,921,868 branches # 493.853 M/sec
>> 7,591,739 branch-misses # 0.15% of all branches
>> 10.000835769 seconds time elapsed
>
> This perf stat also indicate something is wrong.
>
> The 1.17 insn per cycle is NOT okay, it is too low (compared to what
> usually I see, e.g. 2.36 insn per cycle).
>
> It clearly says you have 'stalled-cycles-frontend' and '60.19% frontend
> cycles idle'. This means your CPU have issues/bottleneck fetching
> instructions. Explained by Andi Kleen here [1]
>
> [1] https://github.com/andikleen/pmu-tools/wiki/toplev-manual
>
thanks for the link!
It's definitely weird that my frontend cycle (fetch and decode)
stalled is so high.
I assume this xdpsock code is small and should all fit into the icache.
However, doing another perf stat on xdpsock l2fwd shows
13,720,109,581 stalled-cycles-frontend # 60.01% frontend cycles
idle (23.82%)
<not supported> stalled-cycles-backend
7,994,837 branch-misses # 0.16% of all branches
(23.80%)
996,874,424 bus-cycles # 99.679 M/sec (23.80%)
18,942,220,445 ref-cycles # 1894.067 M/sec (28.56%)
100,983,226 LLC-loads # 10.097 M/sec (23.80%)
4,897,089 LLC-load-misses # 4.85% of all
LL-cache hits (23.80%)
66,659,889 LLC-stores # 6.665 M/sec (9.52%)
8,373 LLC-store-misses # 0.837 K/sec (9.52%)
158,178,410 LLC-prefetches # 15.817 M/sec (9.52%)
3,011,180 LLC-prefetch-misses # 0.301 M/sec (9.52%)
8,190,383,109 dTLB-loads # 818.971 M/sec (9.52%)
20,432,204 dTLB-load-misses # 0.25% of all dTLB
cache hits (9.52%)
3,729,504,674 dTLB-stores # 372.920 M/sec (9.52%)
992,231 dTLB-store-misses # 0.099 M/sec (9.52%)
<not supported> dTLB-prefetches
<not supported> dTLB-prefetch-misses
11,619 iTLB-loads # 0.001 M/sec (9.52%)
1,874,756 iTLB-load-misses # 16135.26% of all
iTLB cache hits (14.28%)
I have super high iTLB-load-misses. This is probably the cause of high
frontend stalled.
Do you know any way to improve iTLB hit rate?
Thanks
William
Powered by blists - more mailing lists