netdev - Re: [RFC 00/12] net: huge page backed page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4b42f2eb-dc29-153e-ace9-5584ea2e5070@redhat.com>
Date: Thu, 13 Jul 2023 12:07:06 +0200
From: Jesper Dangaard Brouer <jbrouer@...hat.com>
To: Jakub Kicinski <kuba@...nel.org>,
 Jesper Dangaard Brouer <jbrouer@...hat.com>
Cc: brouer@...hat.com, netdev@...r.kernel.org, almasrymina@...gle.com,
 hawk@...nel.org, ilias.apalodimas@...aro.org, edumazet@...gle.com,
 dsahern@...il.com, michael.chan@...adcom.com, willemb@...gle.com,
 Ulrich Drepper <drepper@...hat.com>, Luigi Rizzo <lrizzo@...gle.com>,
 Luigi Rizzo <rizzo@....unipi.it>, farshin@....se
Subject: Re: [RFC 00/12] net: huge page backed page_pool



On 12/07/2023 19.19, Jakub Kicinski wrote:
> On Wed, 12 Jul 2023 16:00:46 +0200 Jesper Dangaard Brouer wrote:
>> On 12/07/2023 02.08, Jakub Kicinski wrote:
>>>> Generally the pp_provider's will have to use the refcnt schemes
>>>> supported by page_pool.  (Which is why I'm not 100% sure this fits
>>>> Mina's use-case).
>>>>
>>>> [IOTLB details]:
>>>>
>>>> As mentioned on [RFC 08/12] there are other techniques for reducing
>>>> IOTLB misses, described in:
>>>>     IOMMU: Strategies for Mitigating the IOTLB Bottleneck
>>>>      - https://inria.hal.science/inria-00493752/document
>>>>
>>>> I took a deeper look at also discovered Intel's documentation:
>>>>     - Intel virtualization technology for directed I/O, arch spec
>>>>     -
>>>> https://www.intel.com/content/www/us/en/content-details/774206/intel-virtualization-technology-for-directed-i-o-architecture-specification.html
>>>>
>>>> One problem that is interesting to notice is how NICs access the packets
>>>> via ring-queue, which is likely larger that number of IOTLB entries.
>>>> Thus, a high change of IOTLB misses.  They suggest marking pages with
>>>> Eviction Hints (EH) that cause pages to be marked as Transient Mappings
>>>> (TM) which allows IOMMU to evict these faster (making room for others).
>>>> And then combine this with prefetching.
>>>
>>> Interesting, didn't know about EH.
>>
>> I was looking for a way to set this Eviction Hint (EH) the article
>> talked about, but I'm at a loss.
> 
> Could possibly be something that the NIC has to set inside the PCIe
> transaction headers? Like the old cache hints that predated DDIO?
> 

Yes, perhaps it is outdated?

>>>> In this context of how fast a page is reused by NIC and spatial
>>>> locality, it is worth remembering that PP have two schemes, (1) the fast
>>>> alloc cache that in certain cases can recycle pages (and it based on a
>>>> stack approach), (2) normal recycling via the ptr_ring that will have a
>>>> longer time before page gets reused.
>>>
>>> I read somewhere that Intel IOTLB can be as small as 256 entries.
>>
>> Are IOTLB hardware different from the TLB hardware block?

Anyone knows this?

>> I can find data on TLB sizes, which says there are two levels on Intel,
>> quote from "248966-Software-Optimization-Manual-R047.pdf":
>>
>>    Nehalem microarchitecture implements two levels of translation
>> lookaside buffer (TLB). The first level consists of separate TLBs for
>> data and code. DTLB0 handles address translation for data accesses, it
>> provides 64 entries to support 4KB pages and 32 entries for large pages.
>> The ITLB provides 64 entries (per thread) for 4KB pages and 7 entries
>> (per thread) for large pages.
>>
>>    The second level TLB (STLB) handles both code and data accesses for
>> 4KB pages. It support 4KB page translation operation that missed DTLB0
>> or ITLB. All entries are 4-way associative. Here is a list of entries
>> in each DTLB:
>>
>>    • STLB for 4-KByte pages: 512 entries (services both data and
>> instruction look-ups).
>>    • DTLB0 for large pages: 32 entries.
>>    • DTLB0 for 4-KByte pages: 64 entries.
>>
>>    An DTLB0 miss and STLB hit causes a penalty of 7cycles. Software only
>> pays this penalty if the DTLB0 is used in some dispatch cases. The
>> delays associated with a miss to the STLB and PMH are largely nonblocking.
> 
> No idea :( This is an old paper from Rolf in his Netronome days which
> says ~Sandy Bridge had only IOTLB 64 entries:
> 
> https://dl.acm.org/doi/pdf/10.1145/3230543.3230560
> 
Title: "Understanding PCIe performance for end host networking"

> But it's a pretty old paper.

I *HIGHLY* recommend this paper, and I've recommended it before [1].

  [1] 
https://lore.kernel.org/all/b8fa06c4-1074-7b48-6868-4be6fecb4791@redhat.com/

There is a very new (May 2023) publication[2].
  - Title: Overcoming the IOTLB wall for multi-100-Gbps Linux-based 
networking
  - By Luigi Rizzo (netmap inventor) and Alireza Farshin (author of prev 
paper).
  - [2] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10280580/

There are actually benchmarking page_pool and are suggesting using 2MiB
huge-pages, which you just implemented for page_pool.
p.s. pretty cool to see my page_pool design being described in such 
details and with a picture [3]/

  [3] 
https://www.ncbi.nlm.nih.gov/core/lw/2.0/html/tileshop_pmc/tileshop_pmc_inline.html?title=Click%20on%20image%20to%20zoom&p=PMC3&id=10280580_peerj-cs-09-1385-g020.jpg

> 
>>> So it seems pretty much impossible for it to cache accesses to 4k
>>> pages thru recycling. I thought that even 2M pages will start to
>>> be problematic for multi queue devices (1k entries on each ring x
>>> 32 rings == 128MB just sitting on the ring, let alone circulation).
>>>    
>>
>> Yes, I'm also worried about how badly these NIC rings and PP ptr_ring
>> affects the IOTLB's ability to cache entries.  Why I suggested testing
>> out the Eviction Hint (EH), but I have not found a way to use/enable
>> these as a quick test in your environment.
> 
> FWIW the first version of the code I wrote actually had the coherent
> ring memory also use the huge pages - the MEP allocator underlying the
> page pool can be used by the driver directly to allocate memory for
> other uses than the page pool.
> 
> But I figured that's going to be a nightmare to upstream, and Alex said
> that even on x86 coherent DMA memory is just write combining not cached
> (which frankly IDK why, possibly yet another thing we could consider
> optimizing?!)
> 
> So I created two allocators, one for coherent (backed by 2M pages) and
> one for non-coherent (backed by 1G pages).

I think it is called Coherent vs Streaming DMA.

> 
> For the ptr_ring I was considering bumping the refcount of pages
> allocated from outside the 1G pool, so that they do not get recycled.
> I deferred optimizing that until I can get some production results.
> The extra CPU cost of loss of recycling could outweigh the IOTLB win.
> 
> All very exciting stuff, I wish the days were slightly longer :)
> 

Know the problem of (human) cycles in a day.
Rizzo's article describes a lot of experiments, that might save us/you 
some time.

--Jesper