lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1644b9d0-27a5-0c2b-c530-bcaa347f73c2@intel.com>
Date:   Tue, 1 Aug 2023 15:42:59 +0200
From:   Alexander Lobakin <aleksander.lobakin@...el.com>
To:     Yunsheng Lin <linyunsheng@...wei.com>
CC:     "David S. Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        Maciej Fijalkowski <maciej.fijalkowski@...el.com>,
        Larysa Zaremba <larysa.zaremba@...el.com>,
        Alexander Duyck <alexanderduyck@...com>,
        Jesper Dangaard Brouer <hawk@...nel.org>,
        "Ilias Apalodimas" <ilias.apalodimas@...aro.org>,
        Simon Horman <simon.horman@...igine.com>,
        <netdev@...r.kernel.org>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH net-next 6/9] page_pool: avoid calling no-op externals
 when possible

From: Yunsheng Lin <linyunsheng@...wei.com>
Date: Sat, 29 Jul 2023 20:46:22 +0800

> On 2023/7/28 22:14, Alexander Lobakin wrote:
>> From: Yunsheng Lin <linyunsheng@...wei.com>
>> Date: Fri, 28 Jul 2023 20:39:24 +0800
>>
>>> On 2023/7/27 22:43, Alexander Lobakin wrote:
>>>> Turned out page_pool_put{,_full}_page() can burn quite a bunch of cycles
>>>> even when on DMA-coherent platforms (like x86) with no active IOMMU or
>>>> swiotlb, just for the call ladder.
>>>> Indeed, it's
>>>>
>>>> page_pool_put_page()
>>>>   page_pool_put_defragged_page()                  <- external
>>>>     __page_pool_put_page()
>>>>       page_pool_dma_sync_for_device()             <- non-inline
>>>>         dma_sync_single_range_for_device()
>>>>           dma_sync_single_for_device()            <- external
>>>>             dma_direct_sync_single_for_device()
>>>>               dev_is_dma_coherent()               <- exit
>>>>
>>>> For the inline functions, no guarantees the compiler won't uninline them
>>>> (they're clearly not one-liners and sometimes compilers uninline even
>>>> 2 + 2). The first external call is necessary, but the rest 2+ are done
>>>> for nothing each time, plus a bunch of checks here and there.
>>>> Since Page Pool mappings are long-term and for one "device + addr" pair
>>>> dma_need_sync() will always return the same value (basically, whether it
>>>> belongs to an swiotlb pool), addresses can be tested once right after
>>>> they're obtained and the result can be reused until the page is unmapped.
>>>> Define the new PP DMA sync operation type, which will mean "do DMA syncs
>>>> for the device, but only when needed" and turn it on by default when the
>>>> driver asks to sync pages. When a page is mapped, check whether it needs
>>>> syncs and if so, replace that "sync when needed" back to "always do
>>>> syncs" globally for the whole pool (better safe than sorry). As long as
>>>> the pool has no pages requiring DMA syncs, this cuts off a good piece
>>>> of calls and checks. When at least one page required it, the pool
>>>> conservatively falls back to "always call sync functions", no per-page
>>>> verdicts. It's a fairly rare case anyway that only a few pages would
>>>> require syncing.
>>>> On my x86_64, this gives from 2% to 5% performance benefit with no
>>>> negative impact for cases when IOMMU is on and the shortcut can't be
>>>> used.
>>>>
>>>
>>> It seems other subsystem may have the similar problem as page_pool,
>>> is it possible to implement this kind of trick in the dma subsystem
>>> instead of every subsystem inventing their own trick?
>>
>> In the ladder I described above most of overhead comes from jumping
>> between Page Pool functions, not the generic DMA ones. Let's say I do
>> this shortcut in dma_sync_single_range_for_device(), that is too late
>> already to count on some good CPU saves.
> 
> We can force inline the page_pool_dma_sync_for_device() function if it
> is 'the good CPU saves' you mentioned above.
> 
>> Plus, DMA sync API operates with dma_addr_t, not struct page. IOW it's
>> not clear to me where to store this "we can shortcut" bit in that case.
> 
> It seems we only need one bit in 'struct device' to do the 'shortcut',
> and there seems to have avaliable bit at the end of 'struct device'?

dma_need_sync() can return different results for two different DMA
addresses within the same device.

> 
> Is it possible that we do something like this patch does in
> dma_sync_single_range_for_device()?
> 
> One thing to note is that there may be multi concurrent callers to
> dma_sync_single_range_for_device(), which seems to be different from
> atomic context for page_pool_dma_map(), so it may need some atomic
> operation for the state changing if we want to implement it in a 'generic'
> way.
> 
>>
>> >From "other subsystem" I remember only XDP sockets. There, they also
>> avoid calling their own non-inline functions in the first place, not the
>> generic DMA ones. So I'd say both cases (PP and XSk) can't be solved via
>> some "generic" solution.
> 
> If PP and XSk both have a similar trick, isn't it a more clear sight
> that it may be solved via some "generic" solution?

Both shortcut their own functions in the first place, so I don't know
what generic solution could be to optimize non-generic functions.

> 
> Is there any reason there is no a similar trick for sync for cpu in
> XSk as below code indicates?
> https://elixir.free-electrons.com/linux/v6.4-rc6/source/include/net/xsk_buff_pool.h#L152
> 
>>
>> Thanks,
>> Olek
>>
>> .
>>

Thanks,
Olek

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ