netdev - Re: [PATCH net-next v2 4/4] net: hns3: support skb's frag page recycling based on page pool

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YfO1q52G7GKl+P40@myrica>
Date:   Fri, 28 Jan 2022 09:21:47 +0000
From:   Jean-Philippe Brucker <jean-philippe@...aro.org>
To:     Yunsheng Lin <linyunsheng@...wei.com>
Cc:     linuxarm@...neuler.org, ilias.apalodimas@...aro.org,
        salil.mehta@...wei.com, netdev@...r.kernel.org,
        moyufeng@...wei.com, alexanderduyck@...com, brouer@...hat.com,
        kuba@...nel.org
Subject: Re: [PATCH net-next v2 4/4] net: hns3: support skb's frag page
 recycling based on page pool

On Fri, Jan 28, 2022 at 12:00:35PM +0800, Yunsheng Lin wrote:
> On 2022/1/26 22:30, Jean-Philippe Brucker wrote:
> > Hi,
> > 
> > On Fri, Aug 06, 2021 at 10:46:22AM +0800, Yunsheng Lin wrote:
> >> This patch adds skb's frag page recycling support based on
> >> the frag page support in page pool.
> >>
> >> The performance improves above 10~20% for single thread iperf
> >> TCP flow with IOMMU disabled when iperf server and irq/NAPI
> >> have a different CPU.
> >>
> >> The performance improves about 135%(14Gbit to 33Gbit) for single
> >> thread iperf TCP flow when IOMMU is in strict mode and iperf
> >> server shares the same cpu with irq/NAPI.
> >>
> >> Signed-off-by: Yunsheng Lin <linyunsheng@...wei.com>
> > 
> > This commit is giving me some trouble, but I haven't managed to pinpoint
> > the exact problem.
> 
> Hi,
> Thanks for reporting the problem.
> 
> We also hit a similiar problem during internal CI testing, but was not
> able to trigger it manually, so was not able to find the root case yet.
> 
> Is your test case more likely to trigger the problem?

The problem shows up shortly after boot for me, when I'm not doing
anything special, just ssh'ing into the server. I did manage to trigger it
faster with a "netperf -T TCP_MAERTS" job. Maybe I have something enabled
in my config that makes it easier to trigger?  Attached the .config to
this reply, but I think it corresponds pretty much to debian's config.

> > Symptoms are:
> > * A page gets unmapped twice from page_pool_release_page(). The second
> >   time, dma-iommu.c warns about the empty PTE [1]
> > * The rx ring still accesses the page after the first unmap, causing SMMU
> >   translation faults [2]
> > * That leads to APEI errors and reset of the device, at which time
> >   page_pool_inflight() complains about "Negative(-x) inflight packet-pages".
> > 
> > After some debugging, it looks like the page gets released three times
> > instead of two:
> > 
> > (1) first in page_pool_drain_frag():
> > 
> >         page_pool_alloc_frag+0x1fc/0x248
> >         hns3_alloc_and_map_buffer+0x30/0x170
> >         hns3_nic_alloc_rx_buffers+0x9c/0x170
> >         hns3_clean_rx_ring+0x854/0x950
> >         hns3_nic_common_poll+0xa0/0x218
> >         __napi_poll+0x38/0x1b0
> >         net_rx_action+0xe8/0x248
> >         __do_softirq+0x120/0x284
> > 
> > (2) Then later by page_pool_return_skb_page(), which (I guess) unmaps the
> >     page:
> > 
> >         page_pool_put_page+0x214/0x308
> >         page_pool_return_skb_page+0x48/0x60
> >         skb_release_data+0x168/0x188
> >         skb_release_all+0x28/0x38
> >         kfree_skb+0x30/0x90
> >         packet_rcv+0x4c/0x410
> >         __netif_receive_skb_list_core+0x1f4/0x218
> >         netif_receive_skb_list_internal+0x18c/0x2a8
> > 
> > (3) And finally, soon after, by clean_rx_ring() which causes pp_frag_count
> >     underflow (seen after removing the optimization in
> >     page_pool_atomic_sub_frag_count_return):
> > 
> >         page_pool_put_page+0x2a0/0x308
> >           page_pool_put_full_page
> >           hns3_alloc_skb
> >           hns3_handle_rx_bd
> >         hns3_clean_rx_ring+0x744/0x950
> >         hns3_nic_common_poll+0xa0/0x218
> >         __napi_poll+0x38/0x1b0
> >         net_rx_action+0xe8/0x248
> > 
> > So I'm guessing (2) happens too early while the RX ring is still using the
> > page, but I don't know more. I'd be happy to add more debug and to test
> 
> If the reference counting or pp_frag_count of the page is manipulated correctly,
> I think step 2&3 does not have any dependency between each other.
> 
> > fixes if you have any suggestions.
> 
> My initial thinking is to track if the reference counting or pp_frag_count of
> the page is manipulated correctly.

It looks like pp_frag_count is dropped too many times: after (1),
pp_frag_count only has 1 ref, so (2) drops it to 0 and (3) results in
underflow. I turned page_pool_atomic_sub_frag_count_return() into
"atomic_long_sub_return(nr, &page->pp_frag_count)" to make sure (the
atomic_long_read() bit normally hides this). Wasn't entirely sure if this
is expected behavior, though.

Thanks,
Jean

> 
> Perhaps using the newly added reference counting tracking infrastructure?
> Will look into how to use the reference counting tracking infrastructure
> for the above problem.
> 
> > 
> > Thanks,
> > Jean
> > 
> > 
> > [1] ------------[ cut here ]------------
> >      WARNING: CPU: 71 PID: 0 at drivers/iommu/dma-iommu.c:848 iommu_dma_unmap_page+0xbc/0xd8
> >      Modules linked in: fuse overlay ipmi_si hisi_hpre hisi_zip ecdh_generic hisi_trng_v2 ecc ipmi_d>
> >      CPU: 71 PID: 0 Comm: swapper/71 Not tainted 5.16.0-g3813c61fbaad #22
> >      Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V5.B133.01 03/25/2021
> >      pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> >      pc : iommu_dma_unmap_page+0xbc/0xd8
> >      lr : iommu_dma_unmap_page+0x38/0xd8
> >      sp : ffff800010abb8d0
> >      x29: ffff800010abb8d0 x28: ffff20200ee80000 x27: 0000000000000042
> >      x26: ffff20201a7ed800 x25: ffff20200be7a5c0 x24: 0000000000000002
> >      x23: 0000000000000020 x22: 0000000000001000 x21: 0000000000000000
> >      x20: 0000000000000002 x19: ffff002086b730c8 x18: 0000000000000001
> >      x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
> >      x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000008
> >      x11: 000000000000ffff x10: 0000000000000001 x9 : 0000000000000004
> >      x8 : 0000000000000000 x7 : 0000000000000000 x6 : ffff202006274800
> >      x5 : 0000000000000009 x4 : 0000000000000001 x3 : 000000000000001e
> >      x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000
> >      Call trace:
> >       iommu_dma_unmap_page+0xbc/0xd8
> >       dma_unmap_page_attrs+0x30/0x1d0
> >       page_pool_release_page+0x40/0x88
> >       page_pool_return_page+0x18/0x80
> >       page_pool_put_page+0x248/0x288
> >       hns3_clean_rx_ring+0x744/0x950
> >       hns3_nic_common_poll+0xa0/0x218
> >       __napi_poll+0x38/0x1b0
> >       net_rx_action+0xe8/0x248
> >       __do_softirq+0x120/0x284
> >       irq_exit_rcu+0xe0/0x100
> >       el1_interrupt+0x3c/0x88
> >       el1h_64_irq_handler+0x18/0x28
> >       el1h_64_irq+0x78/0x7c
> >       arch_cpu_idle+0x18/0x28
> >       default_idle_call+0x20/0x68
> >       do_idle+0x214/0x260
> >       cpu_startup_entry+0x28/0x70
> >       secondary_start_kernel+0x160/0x170
> >       __secondary_switched+0x90/0x94
> >      ---[ end trace 432d1737b4b96ed9 ]---
> > 
> >     (please ignore the kernel version, I can reproduce this with v5.14 and
> >     v5.17-rc1, and bisected to this commit.)
> > 
> > [2] arm-smmu-v3 arm-smmu-v3.6.auto: event 0x10 received:
> >     arm-smmu-v3 arm-smmu-v3.6.auto: 	0x0000bd0000000010
> >     arm-smmu-v3 arm-smmu-v3.6.auto: 	0x000012000000007c
> >     arm-smmu-v3 arm-smmu-v3.6.auto: 	0x00000000ff905800
> >     arm-smmu-v3 arm-smmu-v3.6.auto: 	0x00000000ff905000


View attachment "config" of type "text/plain" (249004 bytes)