netdev - Re: Memory providers multiplexing (Was: [PATCH net-next v4 4/5] page_pool: remove PP_FLAG_PAGE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a2569132-393e-0149-f76c-f6de282e1c96@redhat.com>
Date: Mon, 24 Jul 2023 16:56:27 +0200
From: Jesper Dangaard Brouer <jbrouer@...hat.com>
To: Mina Almasry <almasrymina@...gle.com>, Jason Gunthorpe <jgg@...pe.ca>
Cc: brouer@...hat.com, Christian König
 <christian.koenig@....com>, Hari Ramakrishnan <rharix@...gle.com>,
 David Ahern <dsahern@...nel.org>, Samiullah Khawaja <skhawaja@...gle.com>,
 Willem de Bruijn <willemb@...gle.com>, Jakub Kicinski <kuba@...nel.org>,
 Christoph Hellwig <hch@....de>, John Hubbard <jhubbard@...dia.com>,
 Dan Williams <dan.j.williams@...el.com>,
 Jesper Dangaard Brouer <jbrouer@...hat.com>,
 Alexander Duyck <alexander.duyck@...il.com>,
 Yunsheng Lin <linyunsheng@...wei.com>, davem@...emloft.net,
 pabeni@...hat.com, netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
 Lorenzo Bianconi <lorenzo@...nel.org>, Yisen Zhuang
 <yisen.zhuang@...wei.com>, Salil Mehta <salil.mehta@...wei.com>,
 Eric Dumazet <edumazet@...gle.com>, Sunil Goutham <sgoutham@...vell.com>,
 Geetha sowjanya <gakula@...vell.com>, Subbaraya Sundeep
 <sbhatta@...vell.com>, hariprasad <hkelam@...vell.com>,
 Saeed Mahameed <saeedm@...dia.com>, Leon Romanovsky <leon@...nel.org>,
 Felix Fietkau <nbd@....name>, Ryder Lee <ryder.lee@...iatek.com>,
 Shayne Chen <shayne.chen@...iatek.com>, Sean Wang <sean.wang@...iatek.com>,
 Kalle Valo <kvalo@...nel.org>, Matthias Brugger <matthias.bgg@...il.com>,
 AngeloGioacchino Del Regno <angelogioacchino.delregno@...labora.com>,
 Jesper Dangaard Brouer <hawk@...nel.org>,
 Ilias Apalodimas <ilias.apalodimas@...aro.org>, linux-rdma@...r.kernel.org,
 linux-wireless@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
 linux-mediatek@...ts.infradead.org, Jonathan Lemon
 <jonathan.lemon@...il.com>, logang@...tatee.com,
 Bjorn Helgaas <bhelgaas@...gle.com>
Subject: Re: Memory providers multiplexing (Was: [PATCH net-next v4 4/5]
 page_pool: remove PP_FLAG_PAGE_FRAG flag)



On 17/07/2023 03.53, Mina Almasry wrote:
> On Fri, Jul 14, 2023 at 8:55 AM Jason Gunthorpe <jgg@...pe.ca> wrote:
>>
>> On Fri, Jul 14, 2023 at 07:55:15AM -0700, Mina Almasry wrote:
>>
>>> Once the skb frags with struct new_abstraction are in the TCP stack,
>>> they will need some special handling in code accessing the frags. But
>>> my RFC already addressed that somewhat because the frags were
>>> inaccessible in that case. In this case the frags will be both
>>> inaccessible and will not be struct pages at all (things like
>>> get_page() will not work), so more special handling will be required,
>>> maybe.
>>
>> It seems sort of reasonable, though there will be interesting concerns
>> about coherence and synchronization with generial purpose DMABUFs that
>> will need tackling.
>>
>> Still it is such a lot of churn and weridness in the netdev side, I
>> think you'd do well to present an actual full application as
>> justification.
>>
>> Yes, you showed you can stick unordered TCP data frags into GPU memory
>> sort of quickly, but have you gone further with this to actually show
>> it is useful for a real world GPU centric application?
>>
>> BTW your cover letter said 96% utilization, the usual server
>> configuation is one NIC per GPU, so you were able to hit 1500Gb/sec of
>> TCP BW with this?
>>
> 
> I do notice that the number of NICs is missing from our public
> documentation so far, so I will refrain from specifying how many NICs
> are on those A3 VMs until the information is public. But I think I can
> confirm that your general thinking is correct, the perf that we're
> getting is 96.6% line rate of each GPU/NIC pair, 

What do you mean by 96.6% "line rate".
Is is the Ethernet line-rate?

Is the measured throughput the measured TCP data "goodput"?
Assuming
  - MTU 1500 bytes (1514 on wire).
  - Ethernet header 14 bytes
  - IP header 20 bytes
  - TCP header 20 bytes

Due to header overhead the goodput will be approx 96.4%.
  - (1514-(14+20+20))/1514 = 0.9643
  - (Not taking Ethernet interframe gap into account).

Thus, maybe you have hit Ethernet wire line-rate already?

> and scales linearly
> for each NIC/GPU pair we've tested with so far. Line rate of each
> NIC/GPU pair is 200 Gb/sec.
> 
> So if we have 8 NIC/GPU pairs we'd be hitting 96.6% * 200 * 8 = 1545 GB/sec.

Lets keep our units straight.
Here you mean 1545 Gbit/sec, which is 193 GBytes/s

> If we have, say, 2 NIC/GPU pairs, we'd be hitting 96.6% * 200 * 2 = 384 GB/sec

Here you mean 384 Gbit/sec, which is 48 GBytes/sec.

> ...
> etc.
> 

These massive throughput numbers are important, because they *exceed*
the physical host RAM/DIMM memory speeds.

This is the *real argument* why software cannot afford to do a single
copy of the data from host-RAM into GPU-memory, because the CPU memory
throughput to DRAM/DIMM are insufficient.

My testlab CPU E5-1650 have 4 DIMM slots DDR4
  - Data Width: 64 bits (= 8 bytes)
  - Configured Memory Speed: 2400 MT/s
  - Theoretical maximum memory bandwidth: 76.8 GBytes/s (2400*8*4)

Even the theoretical max 76.8 GBytes/s (614 Gbit/s) is not enough for
the 193 GBytes/s or 1545 Gbit/s (8 NIC/GPU pairs).

When testing this with lmbench tool bw_mem, the results (below
signature) are in the area 14.8 GBytes/sec (118 Gbit/s), as soon as
exceeding L3 cache size.  In practice it looks like main memory is
limited to reading 118 Gbit/s *once*. (Mina's NICs run at 200 Gbit/s)

Given DDIO can deliver network packets into L3, I also tried to figure
out what the L3 read bandwidth, which I measured to be 42.4 GBits/sec
(339 Gbit/s), in hopes that it would be enough, but it was not.


--Jesper
(data below signature)

CPU under test:

  $ cat /proc/cpuinfo | egrep -e 'model name|cache size' | head -2
  model name	: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
  cache size	: 15360 KB


Providing some cmdline outputs from lmbench "bw_mem" tool.
(Output format is "%0.2f %.2f\n", megabytes, megabytes_per_second)

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M rd
256.00 14924.50

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M wr
256.00 9895.25

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M rdwr
256.00 9737.54

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M bcopy
256.00 12462.88

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M bzero
256.00 14869.89


Next output shows reducing size below L3 cache size, which shows an
increase in speed, likely the L3 bandwidth.

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 64M rd
64.00 14840.58

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 32M rd
32.00 14823.97

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 16M rd
16.00 24743.86

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 8M rd
8.00 40852.26

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 4M rd
4.00 42545.65

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 2M rd
2.00 42447.82

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 1M rd
1.00 42447.82