lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 27 Jul 2023 16:43:33 +0200
From: Alexander Lobakin <aleksander.lobakin@...el.com>
To: "David S. Miller" <davem@...emloft.net>,
	Eric Dumazet <edumazet@...gle.com>,
	Jakub Kicinski <kuba@...nel.org>,
	Paolo Abeni <pabeni@...hat.com>
Cc: Alexander Lobakin <aleksander.lobakin@...el.com>,
	Maciej Fijalkowski <maciej.fijalkowski@...el.com>,
	Larysa Zaremba <larysa.zaremba@...el.com>,
	Yunsheng Lin <linyunsheng@...wei.com>,
	Alexander Duyck <alexanderduyck@...com>,
	Jesper Dangaard Brouer <hawk@...nel.org>,
	Ilias Apalodimas <ilias.apalodimas@...aro.org>,
	Simon Horman <simon.horman@...igine.com>,
	netdev@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: [PATCH net-next 6/9] page_pool: avoid calling no-op externals when possible

Turned out page_pool_put{,_full}_page() can burn quite a bunch of cycles
even when on DMA-coherent platforms (like x86) with no active IOMMU or
swiotlb, just for the call ladder.
Indeed, it's

page_pool_put_page()
  page_pool_put_defragged_page()                  <- external
    __page_pool_put_page()
      page_pool_dma_sync_for_device()             <- non-inline
        dma_sync_single_range_for_device()
          dma_sync_single_for_device()            <- external
            dma_direct_sync_single_for_device()
              dev_is_dma_coherent()               <- exit

For the inline functions, no guarantees the compiler won't uninline them
(they're clearly not one-liners and sometimes compilers uninline even
2 + 2). The first external call is necessary, but the rest 2+ are done
for nothing each time, plus a bunch of checks here and there.
Since Page Pool mappings are long-term and for one "device + addr" pair
dma_need_sync() will always return the same value (basically, whether it
belongs to an swiotlb pool), addresses can be tested once right after
they're obtained and the result can be reused until the page is unmapped.
Define the new PP DMA sync operation type, which will mean "do DMA syncs
for the device, but only when needed" and turn it on by default when the
driver asks to sync pages. When a page is mapped, check whether it needs
syncs and if so, replace that "sync when needed" back to "always do
syncs" globally for the whole pool (better safe than sorry). As long as
the pool has no pages requiring DMA syncs, this cuts off a good piece
of calls and checks. When at least one page required it, the pool
conservatively falls back to "always call sync functions", no per-page
verdicts. It's a fairly rare case anyway that only a few pages would
require syncing.
On my x86_64, this gives from 2% to 5% performance benefit with no
negative impact for cases when IOMMU is on and the shortcut can't be
used.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@...el.com>
---
 include/net/page_pool/types.h | 3 ++-
 net/core/page_pool.c          | 7 ++++++-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index dd26f4b2b66c..9134fb458cb2 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -97,8 +97,9 @@ struct page_pool {
 	bool dma_map:1;				/* Perform DMA mapping */
 	enum {
 		PP_DMA_SYNC_ACT_DISABLED = 0,	/* Driver didn't ask to sync */
+		PP_DMA_SYNC_ACT_SKIP,		/* Syncs can be skipped */
 		PP_DMA_SYNC_ACT_DO,		/* Perform DMA sync ops */
-	} dma_sync_act:1;
+	} dma_sync_act:2;
 	bool page_frag:1;			/* Allow page fragments */
 
 	long frag_users;
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 6a8f105e2df5..529e4b41e9eb 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -197,7 +197,8 @@ static int page_pool_init(struct page_pool *pool,
 		if (!pool->p.max_len)
 			return -EINVAL;
 
-		pool->dma_sync_act = PP_DMA_SYNC_ACT_DO;
+		/* Try to avoid calling no-op syncs */
+		pool->dma_sync_act = PP_DMA_SYNC_ACT_SKIP;
 
 		/* pool->p.offset has to be set according to the address
 		 * offset used by the DMA engine to start copying rx data
@@ -350,6 +351,10 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
 
 	page_pool_set_dma_addr(page, dma);
 
+	if (pool->dma_sync_act == PP_DMA_SYNC_ACT_SKIP &&
+	    dma_need_sync(pool->p.dev, dma))
+		pool->dma_sync_act = PP_DMA_SYNC_ACT_DO;
+
 	if (pool->dma_sync_act == PP_DMA_SYNC_ACT_DO)
 		page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
 
-- 
2.41.0


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ