[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180328100136.6202448b@redhat.com>
Date: Wed, 28 Mar 2018 10:01:36 +0200
From: Jesper Dangaard Brouer <brouer@...hat.com>
To: William Tu <u9012063@...il.com>
Cc: Björn Töpel <bjorn.topel@...il.com>,
magnus.karlsson@...el.com,
Alexander Duyck <alexander.h.duyck@...el.com>,
Alexander Duyck <alexander.duyck@...il.com>,
John Fastabend <john.fastabend@...il.com>,
Alexei Starovoitov <ast@...com>,
willemdebruijn.kernel@...il.com,
Daniel Borkmann <daniel@...earbox.net>,
Linux Kernel Network Developers <netdev@...r.kernel.org>,
Björn Töpel
<bjorn.topel@...el.com>, michael.lundkvist@...csson.com,
jesse.brandeburg@...el.com, anjali.singhai@...el.com,
jeffrey.b.shaw@...el.com, ferruh.yigit@...el.com,
qi.z.zhang@...el.com, brouer@...hat.com, dendibakh@...il.com
Subject: Re: [RFC PATCH 00/24] Introducing AF_XDP support
On Tue, 27 Mar 2018 17:06:50 -0700
William Tu <u9012063@...il.com> wrote:
> On Tue, Mar 27, 2018 at 2:37 AM, Jesper Dangaard Brouer
> <brouer@...hat.com> wrote:
> > On Mon, 26 Mar 2018 14:58:02 -0700
> > William Tu <u9012063@...il.com> wrote:
> >
> >> > Again high count for NMI ?!?
> >> >
> >> > Maybe you just forgot to tell perf that you want it to decode the
> >> > bpf_prog correctly?
> >> >
> >> > https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#perf-tool-symbols
> >> >
> >> > Enable via:
> >> > $ sysctl net/core/bpf_jit_kallsyms=1
> >> >
> >> > And use perf report (while BPF is STILL LOADED):
> >> >
> >> > $ perf report --kallsyms=/proc/kallsyms
> >> >
> >> > E.g. for emailing this you can use this command:
> >> >
> >> > $ perf report --sort cpu,comm,dso,symbol --kallsyms=/proc/kallsyms --no-children --stdio -g none | head -n 40
> >> >
> >>
> >> Thanks, I followed the steps, the result of l2fwd
> >> # Total Lost Samples: 119
> >> #
> >> # Samples: 2K of event 'cycles:ppp'
> >> # Event count (approx.): 25675705627
> >> #
> >> # Overhead CPU Command Shared Object Symbol
> >> # ........ ... ....... .................. ..................................
> >> #
> >> 10.48% 013 xdpsock xdpsock [.] main
> >> 9.77% 013 xdpsock [kernel.vmlinux] [k] clflush_cache_range
> >> 8.45% 013 xdpsock [kernel.vmlinux] [k] nmi
> >> 8.07% 013 xdpsock [kernel.vmlinux] [k] xsk_sendmsg
> >> 7.81% 013 xdpsock [kernel.vmlinux] [k] __domain_mapping
> >> 4.95% 013 xdpsock [kernel.vmlinux] [k] ixgbe_xmit_frame_ring
> >> 4.66% 013 xdpsock [kernel.vmlinux] [k] skb_store_bits
> >> 4.39% 013 xdpsock [kernel.vmlinux] [k] syscall_return_via_sysret
> >> 3.93% 013 xdpsock [kernel.vmlinux] [k] pfn_to_dma_pte
> >> 2.62% 013 xdpsock [kernel.vmlinux] [k] __intel_map_single
> >> 2.53% 013 xdpsock [kernel.vmlinux] [k] __alloc_skb
> >> 2.36% 013 xdpsock [kernel.vmlinux] [k] iommu_no_mapping
> >> 2.21% 013 xdpsock [kernel.vmlinux] [k] alloc_skb_with_frags
> >> 2.07% 013 xdpsock [kernel.vmlinux] [k] skb_set_owner_w
> >> 1.98% 013 xdpsock [kernel.vmlinux] [k] __kmalloc_node_track_caller
> >> 1.94% 013 xdpsock [kernel.vmlinux] [k] ksize
> >> 1.84% 013 xdpsock [kernel.vmlinux] [k] validate_xmit_skb_list
> >> 1.62% 013 xdpsock [kernel.vmlinux] [k] kmem_cache_alloc_node
> >> 1.48% 013 xdpsock [kernel.vmlinux] [k] __kmalloc_reserve.isra.37
> >> 1.21% 013 xdpsock xdpsock [.] xq_enq
> >> 1.08% 013 xdpsock [kernel.vmlinux] [k] intel_alloc_iova
> >>
> >
> > You did use net/core/bpf_jit_kallsyms=1 and correct perf commands decoding of
> > bpf_prog, so the perf top#3 'nmi' is likely a real NMI call... which looks wrong.
> >
> Thanks, you're right. Let me dig more on this NMI behavior.
>
> >
> >> And l2fwd under "perf stat" looks OK to me. There is little context
> >> switches, cpu is fully utilized, 1.17 insn per cycle seems ok.
> >>
> >> Performance counter stats for 'CPU(s) 6':
> >> 10000.787420 cpu-clock (msec) # 1.000 CPUs utilized
> >> 24 context-switches # 0.002 K/sec
> >> 0 cpu-migrations # 0.000 K/sec
> >> 0 page-faults # 0.000 K/sec
> >> 22,361,333,647 cycles # 2.236 GHz
> >> 13,458,442,838 stalled-cycles-frontend # 60.19% frontend cycles idle
> >> 26,251,003,067 instructions # 1.17 insn per cycle
> >> # 0.51 stalled cycles per insn
> >> 4,938,921,868 branches # 493.853 M/sec
> >> 7,591,739 branch-misses # 0.15% of all branches
> >> 10.000835769 seconds time elapsed
> >
> > This perf stat also indicate something is wrong.
> >
> > The 1.17 insn per cycle is NOT okay, it is too low (compared to what
> > usually I see, e.g. 2.36 insn per cycle).
> >
> > It clearly says you have 'stalled-cycles-frontend' and '60.19% frontend
> > cycles idle'. This means your CPU have issues/bottleneck fetching
> > instructions. Explained by Andi Kleen here [1]
> >
> > [1] https://github.com/andikleen/pmu-tools/wiki/toplev-manual
> >
> thanks for the link!
>
> It's definitely weird that my frontend cycle (fetch and decode)
> stalled is so high.
>
> I assume this xdpsock code is small and should all fit into the icache.
> However, doing another perf stat on xdpsock l2fwd shows
>
> 13,720,109,581 stalled-cycles-frontend # 60.01% frontend cycles
> idle (23.82%)
>
> <not supported> stalled-cycles-backend
> 7,994,837 branch-misses # 0.16% of all branches
> (23.80%)
> 996,874,424 bus-cycles # 99.679 M/sec (23.80%)
> 18,942,220,445 ref-cycles # 1894.067 M/sec (28.56%)
> 100,983,226 LLC-loads # 10.097 M/sec (23.80%)
> 4,897,089 LLC-load-misses # 4.85% of all LL-cache hits (23.80%)
> 66,659,889 LLC-stores # 6.665 M/sec (9.52%)
> 8,373 LLC-store-misses # 0.837 K/sec (9.52%)
> 158,178,410 LLC-prefetches # 15.817 M/sec (9.52%)
> 3,011,180 LLC-prefetch-misses # 0.301 M/sec (9.52%)
> 8,190,383,109 dTLB-loads # 818.971 M/sec (9.52%)
> 20,432,204 dTLB-load-misses # 0.25% of all dTLB cache hits (9.52%)
> 3,729,504,674 dTLB-stores # 372.920 M/sec (9.52%)
> 992,231 dTLB-store-misses # 0.099 M/sec (9.52%)
> <not supported> dTLB-prefetches
> <not supported> dTLB-prefetch-misses
> 11,619 iTLB-loads # 0.001 M/sec (9.52%)
> 1,874,756 iTLB-load-misses # 16135.26% of all iTLB cache hits (14.28%)
What was the sample period for this perf stat?
> I have super high iTLB-load-misses. This is probably the cause of high
> frontend stalled.
It looks very strange that your iTLB-loads are 11,619, while the
iTLB-load-misses are much much higher 1,874,756.
> Do you know any way to improve iTLB hit rate?
The xdpsock code should be small enough to fit in the iCache, but it
might be layout in memory in an unfortunate way. You could play with
rearranging the C-code (look at the objdump alignments).
If you want to know the details about code alignment issue, and how to
troubleshoot them, you should read this VERY excellent blog post by
Denis Bakhvalov:
https://dendibakh.github.io/blog/2018/01/18/Code_alignment_issues
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
Powered by blists - more mailing lists