netdev - Re: [PATCH net-next v1 08/11] xdp: tracking page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190618171951.17128ed8@carbon>
Date:   Tue, 18 Jun 2019 17:19:51 +0200
From:   Jesper Dangaard Brouer <brouer@...hat.com>
To:     Ivan Khoronzhuk <ivan.khoronzhuk@...aro.org>
Cc:     Tariq Toukan <tariqt@...lanox.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        Ilias Apalodimas <ilias.apalodimas@...aro.org>,
        Toke Høiland-Jørgensen <toke@...e.dk>,
        "toshiaki.makita1@...il.com" <toshiaki.makita1@...il.com>,
        "grygorii.strashko@...com" <grygorii.strashko@...com>,
        "mcroce@...hat.com" <mcroce@...hat.com>, brouer@...hat.com
Subject: Re: [PATCH net-next v1 08/11] xdp: tracking page_pool resources and
 safe removal


On Tue, 18 Jun 2019 15:54:33 +0300 Ivan Khoronzhuk <ivan.khoronzhuk@...aro.org> wrote:

> On Sun, Jun 16, 2019 at 10:56:25AM +0000, Tariq Toukan wrote:
> >
> >On 6/15/2019 12:33 PM, Ivan Khoronzhuk wrote:  
> >> On Thu, Jun 13, 2019 at 08:28:42PM +0200, Jesper Dangaard Brouer wrote:
[...]
> >>
> >> What would you recommend to do for the following situation:
> >>
> >> Same receive queue is shared between 2 network devices. The receive ring is
> >> filled by pages from page_pool, but you don't know the actual port (ndev)
> >> filling this ring, because a device is recognized only after packet is
> >> received.
> >>
> >> The API is so that xdp rxq is bind to network device, each frame has
> >> reference
> >> on it, so rxq ndev must be static. That means each netdev has it's own rxq
> >> instance even no need in it. Thus, after your changes, page must be
> >> returned to
> >> the pool it was taken from, or released from old pool and recycled in
> >> new one
> >> somehow.
> >>
> >> And that is inconvenience at least. It's hard to move pages between
> >> pools w/o performance penalty. No way to use common pool either,
> >> as unreg_rxq now drops the pool and 2 rxqa can't reference same
> >> pool. 
> >
> >Within the single netdev, separate page_pool instances are anyway
> >created for different RX rings, working under different NAPI's.  
> 
> The circumstances are so that same RX ring is shared between 2
> netdevs... and netdev can be known only after descriptor/packet is
> received. Thus, while filling RX ring, there is no actual device,
> but when packet is received it has to be recycled to appropriate
> net device pool. Before this change there were no difference from
> which pool the page was allocated to fill RX ring, as there were no
> owner. After this change there is owner - netdev page pool.

It not really a dependency added in this patchset.  A page_pool is
strictly bound to a single RX-queue, for performance, as this allow us
a NAPI fast-path return used for early drop (XDP_DROP).

I can see that the API xdp_rxq_info_reg_mem_model() make it possible to
call it on different xdp_rxq_info structs with the same page_pool
pointer.  But it was never intended to be used like that, and I
consider it an API usage violation.  I originally wanted to add the
allocator pointer to xdp_rxq_info_reg() call, but the API was extended
in different versions, so I didn't want to break users.  I've actually
tried hard to catch when drivers use the API wrong, via WARN(), but I
guess you found a loop hole.

Besides, we already have a dependency from the RX-queue to the netdev
in the xdp_rxq_info structure.  E.g. the xdp_rxq_info->dev is sort of
central, and dereferenced by BPF-code to read xdp_md->ingress_ifindex,
and also used by cpumap when creating SKBs.


> For cpsw the dma unmap is common for both netdevs and no difference
> who is freeing the page, but there is difference which pool it's
> freed to.
> 
> So that, while filling RX ring the page is taken from page pool of
> ndev1, but packet is received for ndev2, it has to be later
> returned/recycled to page pool of ndev1, but when xdp buffer is
> handed over to xdp prog the xdp_rxq_info has reference on ndev2 ...
>
> And no way to predict the final ndev before packet is received, so no
> way to choose appropriate page pool as now it becomes page owner.
> 
> So, while RX ring filling, the page/dma recycling is needed but should
> be some way to identify page owner only after receiving packet.
> 
> Roughly speaking, something like:
> 
> pool->pages_state_hold_cnt++;
> 
> outside of page allocation API, after packet is received.

Don't EVER manipulate the internal state outside of page allocation
API.  That kills the purpose of defining any API.

> and free of the counter while allocation (w/o owing the page).

You use-case of two netdev's sharing the same RX-queue sounds dubious,
and very hardware specific.  I'm not sure why we want to bend the APIs
to support this?
 If we had to allow page_pool to be registered twice, via
xdp_rxq_info_reg_mem_model() then I guess we could extend page_pool
with a usage/users reference count, and then only really free the
page_pool when refcnt reach zero.  But it just seems and looks wrong
(in the code) as the hole trick to get performance is to only have one
user.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer