lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 14 Dec 2023 12:03:15 -0800
From: Mina Almasry <almasrymina@...gle.com>
To: Pavel Begunkov <asml.silence@...il.com>
Cc: Shailend Chand <shailend@...gle.com>, netdev@...r.kernel.org, 
	linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org, 
	linux-arch@...r.kernel.org, linux-kselftest@...r.kernel.org, 
	bpf@...r.kernel.org, linux-media@...r.kernel.org, 
	dri-devel@...ts.freedesktop.org, "David S. Miller" <davem@...emloft.net>, 
	Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, 
	Jonathan Corbet <corbet@....net>, Jeroen de Borst <jeroendb@...gle.com>, 
	Praveen Kaligineedi <pkaligineedi@...gle.com>, Jesper Dangaard Brouer <hawk@...nel.org>, 
	Ilias Apalodimas <ilias.apalodimas@...aro.org>, Arnd Bergmann <arnd@...db.de>, 
	David Ahern <dsahern@...nel.org>, Willem de Bruijn <willemdebruijn.kernel@...il.com>, 
	Shuah Khan <shuah@...nel.org>, Sumit Semwal <sumit.semwal@...aro.org>, 
	Christian König <christian.koenig@....com>, 
	Yunsheng Lin <linyunsheng@...wei.com>, Harshitha Ramamurthy <hramamurthy@...gle.com>, 
	Shakeel Butt <shakeelb@...gle.com>, Willem de Bruijn <willemb@...gle.com>, 
	Kaiyuan Zhang <kaiyuanz@...gle.com>
Subject: Re: [net-next v1 08/16] memory-provider: dmabuf devmem memory provider

On Mon, Dec 11, 2023 at 12:37 PM Pavel Begunkov <asml.silence@...il.com> wrote:
...
> >> If you remove the branch, let it fall into ->release and rely
> >> on refcounting there, then the callback could also fix up
> >> release_cnt or ask pp to do it, like in the patch I linked above
> >>
> >
> > Sadly I don't think this is possible due to the reasons I mention in
> > the commit message of that patch. Prematurely releasing ppiov and not
> > having them be candidates for recycling shows me a 4-5x degradation in
> > performance.
>
> I don't think I follow. The concept is to only recycle a buffer (i.e.
> make it available for allocation) when its refs drop to zero, which is
> IMHO the only way it can work, and IIUC what this patchset is doing.
>
> That's also I suggest to do, but through a slightly different path.
> Let's say at some moment there are 2 refs (e.g. 1 for an skb and
> 1 for userspace/xarray).
>
> Say it first puts the skb:
>
> napi_pp_put_page()
>    -> page_pool_return_page()
>      -> mp_ops->release_page()
>         -> need_to_free = put_buf()
>            // not last ref, need_to_free==false,
>            // don't recycle, don't increase release_cnt
>
> Then you put the last ref:
>
> page_pool_iov_put_many()
>    -> page_pool_return_page()
>      -> mp_ops->release_page()
>         -> need_to_free = put_buf()
>            // last ref, need_to_free==true,
>            // recycle and release_cnt++
>
> And that last put can even be recycled right into the
> pp / ptr_ring, in which case it doesn't need to touch
> release_cnt. Does it make sense? I don't see where
> 4-5x degradation would come from
>
>

Sorry for the late reply, I have been working on this locally.

What you're saying makes sense, and I'm no longer sure why I was
seeing a perf degradation without '[net-next v1 10/16] page_pool:
don't release iov on elevanted refcount'. However, even though what
you're saying is technically correct, AFAIU it's actually semantically
wrong. When a page is released by the page_pool, we should call
page_pool_clear_pp_info() and completely disconnect the page from the
pool. If we call release_page() on a page and then the page pool sees
it again in page_pool_return_page(), I think that is considered a bug.
In fact I think what you're proposing is as a result of a bug because
we don't call a page_pool_clear_pp_info() equivalent on releasing
ppiov.

However, I'm reasonably confident I figured out the right thing to do
here. The page_pool uses page->pp_frag_count for its refcounting.
pp_frag_count is a misnomer, it's being renamed to pp_ref_count in
Liang's series[1]). In this series I used a get_page/put_page
equivalent for refcounting. Once I transitioned to using
pp_[frag|ref]_count for refcounting inside the page_pool, the issue
went away, and I no longer need the patch 'page_pool: don't release
iov on elevanted refcount'.

There is an additional upside, since pages and ppiovs are both being
refcounted using pp_[frag|ref]_count, we get some unified handling for
ppiov and we reduce the checks around ppiov. This should be fixed
properly in the next series.

I still need to do some work (~1 week) before I upload the next
version as there is a new requirement from MM that we transition to a
new type and not re-use page*, but I uploaded my changes github with
the refcounting issues resolved in case they're useful to you. Sorry
for the churn:

https://github.com/mina/linux/commits/tcpdevmem-v1.5/

[1] https://patchwork.kernel.org/project/netdevbpf/list/?series=809049&state=*

-- 
Thanks,
Mina

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ