lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1721282f-7ec8-68bd-6d52-b4ef209f047b@redhat.com>
Date: Tue, 11 Jul 2023 17:49:19 +0200
From: Jesper Dangaard Brouer <jbrouer@...hat.com>
To: Jakub Kicinski <kuba@...nel.org>, netdev@...r.kernel.org
Cc: brouer@...hat.com, almasrymina@...gle.com, hawk@...nel.org,
 ilias.apalodimas@...aro.org, edumazet@...gle.com, dsahern@...il.com,
 michael.chan@...adcom.com, willemb@...gle.com
Subject: Re: [RFC 00/12] net: huge page backed page_pool



On 07/07/2023 20.39, Jakub Kicinski wrote:
> Hi!
> 
> This is an "early PoC" at best. It seems to work for a basic
> traffic test but there's no uAPI and a lot more general polish
> is needed.
> 
> The problem we're seeing is that performance of some older NICs
> degrades quite a bit when IOMMU is used (in non-passthru mode).
> There is a long tail of old NICs deployed, especially in PoPs/
> /on edge. From a conversation I had with Eric a few months
> ago it sounded like others may have similar issues. So I thought

Using page_pool on systems with IOMMU is a big performance gain in
itself.  As it removes the DMA map+unmap call that need to change the
IOMMU setup/table.

> I'd take a swing at getting page pool to feed drivers huge pages.
> 1G pages require hooking into early init via CMA but it works
> just fine.
> 
> I haven't tested this with a real workload, because I'm still
> waiting to get my hands on the right machine. But the experiment
> with bnxt shows a ~90% reduction in IOTLB misses (670k -> 70k).
> 

I see you have discovered that the next bottleneck are the IOTLB misses.
One of the techniques for reducing IOTLB misses is using huge pages.
Called "super-pages" in article (below), and they report that this trick
doesn't work on AMD (Pacifica arch).

I think you have convinced me that the pp_provider idea makes sense for
*this* use-case, because it feels like natural to extend PP with
mitigations for IOTLB misses. (But I'm not 100% sure it fits Mina's
use-case).

What is your page refcnt strategy for these huge-pages. I assume this
rely on PP frags-scheme, e.g. using page->pp_frag_count.
Is this correctly understood?

Generally the pp_provider's will have to use the refcnt schemes
supported by page_pool.  (Which is why I'm not 100% sure this fits
Mina's use-case).


[IOTLB details]:

As mentioned on [RFC 08/12] there are other techniques for reducing 
IOTLB misses, described in:
  IOMMU: Strategies for Mitigating the IOTLB Bottleneck
   - https://inria.hal.science/inria-00493752/document

I took a deeper look at also discovered Intel's documentation:
  - Intel virtualization technology for directed I/O, arch spec
  - 
https://www.intel.com/content/www/us/en/content-details/774206/intel-virtualization-technology-for-directed-i-o-architecture-specification.html

One problem that is interesting to notice is how NICs access the packets
via ring-queue, which is likely larger that number of IOTLB entries.
Thus, a high change of IOTLB misses.  They suggest marking pages with
Eviction Hints (EH) that cause pages to be marked as Transient Mappings
(TM) which allows IOMMU to evict these faster (making room for others).
And then combine this with prefetching.

In this context of how fast a page is reused by NIC and spatial
locality, it is worth remembering that PP have two schemes, (1) the fast
alloc cache that in certain cases can recycle pages (and it based on a
stack approach), (2) normal recycling via the ptr_ring that will have a
longer time before page gets reused.


--Jesper

[RFC 08/12] 
https://lore.kernel.org/all/f8270765-a27b-6ccf-33ea-cda097168d79@redhat.com/


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ