lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 19 Dec 2023 23:55:01 +0000
From: Pavel Begunkov <asml.silence@...il.com>
To: Mina Almasry <almasrymina@...gle.com>
Cc: Shailend Chand <shailend@...gle.com>, netdev@...r.kernel.org,
 linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org,
 linux-arch@...r.kernel.org, linux-kselftest@...r.kernel.org,
 bpf@...r.kernel.org, linux-media@...r.kernel.org,
 dri-devel@...ts.freedesktop.org, "David S. Miller" <davem@...emloft.net>,
 Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>,
 Paolo Abeni <pabeni@...hat.com>, Jonathan Corbet <corbet@....net>,
 Jeroen de Borst <jeroendb@...gle.com>,
 Praveen Kaligineedi <pkaligineedi@...gle.com>,
 Jesper Dangaard Brouer <hawk@...nel.org>,
 Ilias Apalodimas <ilias.apalodimas@...aro.org>, Arnd Bergmann
 <arnd@...db.de>, David Ahern <dsahern@...nel.org>,
 Willem de Bruijn <willemdebruijn.kernel@...il.com>,
 Shuah Khan <shuah@...nel.org>, Sumit Semwal <sumit.semwal@...aro.org>,
 Christian König <christian.koenig@....com>,
 Yunsheng Lin <linyunsheng@...wei.com>,
 Harshitha Ramamurthy <hramamurthy@...gle.com>,
 Shakeel Butt <shakeelb@...gle.com>, Willem de Bruijn <willemb@...gle.com>,
 Kaiyuan Zhang <kaiyuanz@...gle.com>
Subject: Re: [net-next v1 08/16] memory-provider: dmabuf devmem memory
 provider

On 12/14/23 20:03, Mina Almasry wrote:
> On Mon, Dec 11, 2023 at 12:37 PM Pavel Begunkov <asml.silence@...il.com> wrote:
> ...
>>>> If you remove the branch, let it fall into ->release and rely
>>>> on refcounting there, then the callback could also fix up
>>>> release_cnt or ask pp to do it, like in the patch I linked above
>>>>
>>>
>>> Sadly I don't think this is possible due to the reasons I mention in
>>> the commit message of that patch. Prematurely releasing ppiov and not
>>> having them be candidates for recycling shows me a 4-5x degradation in
>>> performance.
>>
>> I don't think I follow. The concept is to only recycle a buffer (i.e.
>> make it available for allocation) when its refs drop to zero, which is
>> IMHO the only way it can work, and IIUC what this patchset is doing.
>>
>> That's also I suggest to do, but through a slightly different path.
>> Let's say at some moment there are 2 refs (e.g. 1 for an skb and
>> 1 for userspace/xarray).
>>
>> Say it first puts the skb:
>>
>> napi_pp_put_page()
>>     -> page_pool_return_page()
>>       -> mp_ops->release_page()
>>          -> need_to_free = put_buf()
>>             // not last ref, need_to_free==false,
>>             // don't recycle, don't increase release_cnt
>>
>> Then you put the last ref:
>>
>> page_pool_iov_put_many()
>>     -> page_pool_return_page()
>>       -> mp_ops->release_page()
>>          -> need_to_free = put_buf()
>>             // last ref, need_to_free==true,
>>             // recycle and release_cnt++
>>
>> And that last put can even be recycled right into the
>> pp / ptr_ring, in which case it doesn't need to touch
>> release_cnt. Does it make sense? I don't see where
>> 4-5x degradation would come from
>>
>>
> 
> Sorry for the late reply, I have been working on this locally.
> 
> What you're saying makes sense, and I'm no longer sure why I was
> seeing a perf degradation without '[net-next v1 10/16] page_pool:
> don't release iov on elevanted refcount'. However, even though what
> you're saying is technically correct, AFAIU it's actually semantically
> wrong. When a page is released by the page_pool, we should call
> page_pool_clear_pp_info() and completely disconnect the page from the
> pool. If we call release_page() on a page and then the page pool sees
> it again in page_pool_return_page(), I think that is considered a bug.

You're adding a new feature the semantics of which is already
different from what is in there, you can extend it any way as long
as it makes sense and agreed on. IMHO, it does. But well, if
there is a better solution I'm all for it.

> In fact I think what you're proposing is as a result of a bug because
> we don't call a page_pool_clear_pp_info() equivalent on releasing
> ppiov.

I don't get it, what bug? page_pool_clear_pp_info() is not called
for ppiov because it doesn't make sense to call it for ppiov,
there is no reason to clear ppiov->pp, nor there is any pp_magic.


> However, I'm reasonably confident I figured out the right thing to do
> here. The page_pool uses page->pp_frag_count for its refcounting.
> pp_frag_count is a misnomer, it's being renamed to pp_ref_count in
> Liang's series[1]). In this series I used a get_page/put_page
> equivalent for refcounting. Once I transitioned to using
> pp_[frag|ref]_count for refcounting inside the page_pool, the issue
> went away, and I no longer need the patch 'page_pool: don't release
> iov on elevanted refcount'.

Lovely, I'll take a look later! (also assuming it's in v5)


> There is an additional upside, since pages and ppiovs are both being
> refcounted using pp_[frag|ref]_count, we get some unified handling for
> ppiov and we reduce the checks around ppiov. This should be fixed
> properly in the next series.
> 
> I still need to do some work (~1 week) before I upload the next
> version as there is a new requirement from MM that we transition to a
> new type and not re-use page*, but I uploaded my changes github with
> the refcounting issues resolved in case they're useful to you. Sorry
> for the churn:
> 
> https://github.com/mina/linux/commits/tcpdevmem-v1.5/
> 
> [1] https://patchwork.kernel.org/project/netdevbpf/list/?series=809049&state=*
> 

-- 
Pavel Begunkov

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ