netdev - Re: [RFC PATCH 00/24] Introducing AF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALDO+Sb=8yTdEofBB5Nav-Ea+T-bzqm6eM6_1LLb46etMz+ULA@mail.gmail.com>
Date:   Tue, 27 Mar 2018 17:06:50 -0700
From:   William Tu <u9012063@...il.com>
To:     Jesper Dangaard Brouer <brouer@...hat.com>
Cc:     Björn Töpel <bjorn.topel@...il.com>,
        magnus.karlsson@...el.com,
        Alexander Duyck <alexander.h.duyck@...el.com>,
        Alexander Duyck <alexander.duyck@...il.com>,
        John Fastabend <john.fastabend@...il.com>,
        Alexei Starovoitov <ast@...com>,
        willemdebruijn.kernel@...il.com,
        Daniel Borkmann <daniel@...earbox.net>,
        Linux Kernel Network Developers <netdev@...r.kernel.org>,
        Björn Töpel <bjorn.topel@...el.com>,
        michael.lundkvist@...csson.com, jesse.brandeburg@...el.com,
        anjali.singhai@...el.com, jeffrey.b.shaw@...el.com,
        ferruh.yigit@...el.com, qi.z.zhang@...el.com
Subject: Re: [RFC PATCH 00/24] Introducing AF_XDP support

On Tue, Mar 27, 2018 at 2:37 AM, Jesper Dangaard Brouer
<brouer@...hat.com> wrote:
> On Mon, 26 Mar 2018 14:58:02 -0700
> William Tu <u9012063@...il.com> wrote:
>
>> > Again high count for NMI ?!?
>> >
>> > Maybe you just forgot to tell perf that you want it to decode the
>> > bpf_prog correctly?
>> >
>> > https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#perf-tool-symbols
>> >
>> > Enable via:
>> >  $ sysctl net/core/bpf_jit_kallsyms=1
>> >
>> > And use perf report (while BPF is STILL LOADED):
>> >
>> >  $ perf report --kallsyms=/proc/kallsyms
>> >
>> > E.g. for emailing this you can use this command:
>> >
>> >  $ perf report --sort cpu,comm,dso,symbol --kallsyms=/proc/kallsyms --no-children --stdio -g none | head -n 40
>> >
>>
>> Thanks, I followed the steps, the result of l2fwd
>> # Total Lost Samples: 119
>> #
>> # Samples: 2K of event 'cycles:ppp'
>> # Event count (approx.): 25675705627
>> #
>> # Overhead  CPU  Command  Shared Object       Symbol
>> # ........  ...  .......  ..................  ..................................
>> #
>>     10.48%  013  xdpsock  xdpsock             [.] main
>>      9.77%  013  xdpsock  [kernel.vmlinux]    [k] clflush_cache_range
>>      8.45%  013  xdpsock  [kernel.vmlinux]    [k] nmi
>>      8.07%  013  xdpsock  [kernel.vmlinux]    [k] xsk_sendmsg
>>      7.81%  013  xdpsock  [kernel.vmlinux]    [k] __domain_mapping
>>      4.95%  013  xdpsock  [kernel.vmlinux]    [k] ixgbe_xmit_frame_ring
>>      4.66%  013  xdpsock  [kernel.vmlinux]    [k] skb_store_bits
>>      4.39%  013  xdpsock  [kernel.vmlinux]    [k] syscall_return_via_sysret
>>      3.93%  013  xdpsock  [kernel.vmlinux]    [k] pfn_to_dma_pte
>>      2.62%  013  xdpsock  [kernel.vmlinux]    [k] __intel_map_single
>>      2.53%  013  xdpsock  [kernel.vmlinux]    [k] __alloc_skb
>>      2.36%  013  xdpsock  [kernel.vmlinux]    [k] iommu_no_mapping
>>      2.21%  013  xdpsock  [kernel.vmlinux]    [k] alloc_skb_with_frags
>>      2.07%  013  xdpsock  [kernel.vmlinux]    [k] skb_set_owner_w
>>      1.98%  013  xdpsock  [kernel.vmlinux]    [k] __kmalloc_node_track_caller
>>      1.94%  013  xdpsock  [kernel.vmlinux]    [k] ksize
>>      1.84%  013  xdpsock  [kernel.vmlinux]    [k] validate_xmit_skb_list
>>      1.62%  013  xdpsock  [kernel.vmlinux]    [k] kmem_cache_alloc_node
>>      1.48%  013  xdpsock  [kernel.vmlinux]    [k] __kmalloc_reserve.isra.37
>>      1.21%  013  xdpsock  xdpsock             [.] xq_enq
>>      1.08%  013  xdpsock  [kernel.vmlinux]    [k] intel_alloc_iova
>>
>
> You did use net/core/bpf_jit_kallsyms=1 and correct perf commands decoding of
> bpf_prog, so the perf top#3 'nmi' is likely a real NMI call... which looks wrong.
>
Thanks, you're right. Let me dig more on this NMI behavior.

>
>> And l2fwd under "perf stat" looks OK to me. There is little context
>> switches, cpu is fully utilized, 1.17 insn per cycle seems ok.
>>
>> Performance counter stats for 'CPU(s) 6':
>>   10000.787420      cpu-clock (msec)          #    1.000 CPUs utilized
>>             24      context-switches          #    0.002 K/sec
>>              0      cpu-migrations            #    0.000 K/sec
>>              0      page-faults               #    0.000 K/sec
>> 22,361,333,647      cycles                    #    2.236 GHz
>> 13,458,442,838      stalled-cycles-frontend   #   60.19% frontend cycles idle
>> 26,251,003,067      instructions              #    1.17  insn per cycle
>>                                               #    0.51  stalled cycles per insn
>>  4,938,921,868      branches                  #  493.853 M/sec
>>      7,591,739      branch-misses             #    0.15% of all branches
>>   10.000835769 seconds time elapsed
>
> This perf stat also indicate something is wrong.
>
> The 1.17 insn per cycle is NOT okay, it is too low (compared to what
> usually I see, e.g. 2.36  insn per cycle).
>
> It clearly says you have 'stalled-cycles-frontend' and '60.19% frontend
> cycles idle'.   This means your CPU have issues/bottleneck fetching
> instructions. Explained by Andi Kleen here [1]
>
> [1] https://github.com/andikleen/pmu-tools/wiki/toplev-manual
>
thanks for the link!
It's definitely weird that my frontend cycle (fetch and decode)
stalled is so high.
I assume this xdpsock code is small and should all fit into the icache.
However, doing another perf stat on xdpsock l2fwd shows

13,720,109,581      stalled-cycles-frontend   # 60.01% frontend cycles
idle     (23.82%)

  <not supported>      stalled-cycles-backend
        7,994,837      branch-misses           # 0.16% of all branches
         (23.80%)
      996,874,424      bus-cycles         # 99.679 M/sec          (23.80%)
   18,942,220,445      ref-cycles      # 1894.067 M/sec          (28.56%)
      100,983,226      LLC-loads         # 10.097 M/sec          (23.80%)
        4,897,089      LLC-load-misses           # 4.85% of all
LL-cache hits     (23.80%)
       66,659,889      LLC-stores          # 6.665 M/sec          (9.52%)
            8,373 LLC-store-misses          # 0.837 K/sec (9.52%)
      158,178,410      LLC-prefetches         # 15.817 M/sec          (9.52%)
        3,011,180      LLC-prefetch-misses       # 0.301 M/sec          (9.52%)
    8,190,383,109      dTLB-loads       # 818.971 M/sec          (9.52%)
       20,432,204      dTLB-load-misses          # 0.25% of all dTLB
cache hits   (9.52%)
    3,729,504,674      dTLB-stores       # 372.920 M/sec          (9.52%)
          992,231  dTLB-store-misses         # 0.099 M/sec          (9.52%)
  <not supported>      dTLB-prefetches
  <not supported>      dTLB-prefetch-misses
           11,619 iTLB-loads                # 0.001 M/sec (9.52%)
        1,874,756      iTLB-load-misses          # 16135.26% of all
iTLB cache hits  (14.28%)

I have super high iTLB-load-misses. This is probably the cause of high
frontend stalled.
Do you know any way to improve iTLB hit rate?

Thanks
William