lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b8b7818a-e44b-45f5-91c2-d5eceaa5dd5b@kernel.org>
Date: Wed, 6 Nov 2024 16:57:25 +0100
From: Jesper Dangaard Brouer <hawk@...nel.org>
To: Yunsheng Lin <linyunsheng@...wei.com>,
 Toke Høiland-Jørgensen <toke@...hat.com>,
 davem@...emloft.net, kuba@...nel.org, pabeni@...hat.com
Cc: zhangkun09@...wei.com, fanghaiqing@...wei.com, liuyonglong@...wei.com,
 Robin Murphy <robin.murphy@....com>,
 Alexander Duyck <alexander.duyck@...il.com>, IOMMU <iommu@...ts.linux.dev>,
 Andrew Morton <akpm@...ux-foundation.org>, Eric Dumazet
 <edumazet@...gle.com>, Ilias Apalodimas <ilias.apalodimas@...aro.org>,
 linux-mm@...ck.org, linux-kernel@...r.kernel.org, netdev@...r.kernel.org,
 kernel-team <kernel-team@...udflare.com>, Viktor Malik <vmalik@...hat.com>
Subject: Re: [PATCH net-next v3 3/3] page_pool: fix IOMMU crash when driver
 has already unbound



On 06/11/2024 14.25, Jesper Dangaard Brouer wrote:
> 
> On 26/10/2024 09.33, Yunsheng Lin wrote:
>> On 2024/10/25 22:07, Jesper Dangaard Brouer wrote:
>>
>> ...
>>
>>>
>>>>> You and Jesper seems to be mentioning a possible fact that there might
>>>>> be 'hundreds of gigs of memory' needed for inflight pages, it would 
>>>>> be nice
>>>>> to provide more info or reasoning above why 'hundreds of gigs of 
>>>>> memory' is
>>>>> needed here so that we don't do a over-designed thing to support 
>>>>> recording
>>>>> unlimited in-flight pages if the driver unbound stalling turns out 
>>>>> impossible
>>>>> and the inflight pages do need to be recorded.
>>>>
>>>> I don't have a concrete example of a use that will blow the limit you
>>>> are setting (but maybe Jesper does), I am simply objecting to the
>>>> arbitrary imposing of any limit at all. It smells a lot of "640k ought
>>>> to be enough for anyone".
>>>>
>>>
>>> As I wrote before. In *production* I'm seeing TCP memory reach 24 GiB
>>> (on machines with 384GiB memory). I have attached a grafana screenshot
>>> to prove what I'm saying.
>>>
>>> As my co-worker Mike Freemon, have explain to me (and more details in
>>> blogposts[1]). It is no coincident that graph have a strange "sealing"
>>> close to 24 GiB (on machines with 384GiB total memory).  This is because
>>> TCP network stack goes into a memory "under pressure" state when 6.25%
>>> of total memory is used by TCP-stack. (Detail: The system will stay in
>>> that mode until allocated TCP memory falls below 4.68% of total memory).
>>>
>>>   [1] 
>>> https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/
>>
>> Thanks for the info.
> 
> Some more info from production servers.
> 
> (I'm amazed what we can do with a simple bpftrace script, Cc Viktor)
> 
> In below bpftrace script/oneliner I'm extracting the inflight count, for
> all page_pool's in the system, and storing that in a histogram hash.
> 
> sudo bpftrace -e '
>   rawtracepoint:page_pool_state_release { @cnt[probe]=count();
>    @cnt_total[probe]=count();
>    $pool=(struct page_pool*)arg0;
>    $release_cnt=(uint32)arg2;
>    $hold_cnt=$pool->pages_state_hold_cnt;
>    $inflight_cnt=(int32)($hold_cnt - $release_cnt);
>    @inflight=hist($inflight_cnt);
>   }
>   interval:s:1 {time("\n%H:%M:%S\n");
>    print(@cnt); clear(@cnt);
>    print(@inflight);
>    print(@cnt_total);
>   }'
> 
> The page_pool behavior depend on how NIC driver use it, so I've run this 
> on two prod servers with drivers bnxt and mlx5, on a 6.6.51 kernel.
> 
> Driver: bnxt_en
> - kernel 6.6.51
> 
> @cnt[rawtracepoint:page_pool_state_release]: 8447
> @inflight:
> [0]             507 |                                        |
> [1]             275 |                                        |
> [2, 4)          261 |                                        |
> [4, 8)          215 |                                        |
> [8, 16)         259 |                                        |
> [16, 32)        361 |                                        |
> [32, 64)        933 |                                        |
> [64, 128)      1966 |                                        |
> [128, 256)   937052 |@@@@@@@@@                               |
> [256, 512)  5178744 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [512, 1K)     73908 |                                        |
> [1K, 2K)    1220128 |@@@@@@@@@@@@                            |
> [2K, 4K)    1532724 |@@@@@@@@@@@@@@@                         |
> [4K, 8K)    1849062 |@@@@@@@@@@@@@@@@@@                      |
> [8K, 16K)   1466424 |@@@@@@@@@@@@@@                          |
> [16K, 32K)   858585 |@@@@@@@@                                |
> [32K, 64K)   693893 |@@@@@@                                  |
> [64K, 128K)  170625 |@                                       |
> 
> Driver: mlx5_core
>   - Kernel: 6.6.51
> 
> @cnt[rawtracepoint:page_pool_state_release]: 1975
> @inflight:
> [128, 256)         28293 |@@@@                               |
> [256, 512)        184312 |@@@@@@@@@@@@@@@@@@@@@@@@@@@        |
> [512, 1K)              0 |                                   |
> [1K, 2K)            4671 |                                   |
> [2K, 4K)          342571 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [4K, 8K)          180520 |@@@@@@@@@@@@@@@@@@@@@@@@@@@        |
> [8K, 16K)          96483 |@@@@@@@@@@@@@@                     |
> [16K, 32K)         25133 |@@@                                |
> [32K, 64K)          8274 |@                                  |
> 
> 
> The key thing to notice that we have up-to 128,000 pages in flight on
> these random production servers. The NIC have 64 RX queue configured,
> thus also 64 page_pool objects.
> 

I realized that we primarily want to know the maximum in-flight pages.

So, I modified the bpftrace oneliner to track the max for each page_pool 
in the system.

sudo bpftrace -e '
  rawtracepoint:page_pool_state_release { @cnt[probe]=count();
   @cnt_total[probe]=count();
   $pool=(struct page_pool*)arg0;
   $release_cnt=(uint32)arg2;
   $hold_cnt=$pool->pages_state_hold_cnt;
   $inflight_cnt=(int32)($hold_cnt - $release_cnt);
   $cur=@...light_max[$pool];
   if ($inflight_cnt > $cur) {
     @inflight_max[$pool]=$inflight_cnt;}
  }
  interval:s:1 {time("\n%H:%M:%S\n");
   print(@cnt); clear(@cnt);
   print(@inflight_max);
   print(@cnt_total);
  }'

I've attached the output from the script.
For unknown reason this system had 199 page_pool objects.

The 20 top users:

$ cat out02.inflight-max | grep inflight_max | tail -n 20
@inflight_max[0xffff88829133d800]: 26473
@inflight_max[0xffff888293c3e000]: 27042
@inflight_max[0xffff888293c3b000]: 27709
@inflight_max[0xffff8881076f2800]: 29400
@inflight_max[0xffff88818386e000]: 29690
@inflight_max[0xffff8882190b1800]: 29813
@inflight_max[0xffff88819ee83800]: 30067
@inflight_max[0xffff8881076f4800]: 30086
@inflight_max[0xffff88818386b000]: 31116
@inflight_max[0xffff88816598f800]: 36970
@inflight_max[0xffff8882190b7800]: 37336
@inflight_max[0xffff888293c38800]: 39265
@inflight_max[0xffff888293c3c800]: 39632
@inflight_max[0xffff888293c3b800]: 43461
@inflight_max[0xffff888293c3f000]: 43787
@inflight_max[0xffff88816598f000]: 44557
@inflight_max[0xffff888132ce9000]: 45037
@inflight_max[0xffff888293c3f800]: 51843
@inflight_max[0xffff888183869800]: 62612
@inflight_max[0xffff888113d08000]: 73203

Adding all values together:

  grep inflight_max out02.inflight-max | awk 'BEGIN {tot=0} {tot+=$2; 
printf "total:" tot "\n"}' | tail -n 1

total:1707129

Worst case we need a data structure holding 1,707,129 pages.
Fortunately, we don't need a single data structure as this will be split
between 199 page_pool's.

--Jesper

View attachment "out02.inflight-max" of type "text/plain" (8017 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ