[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <30ab6359-2ad6-4be0-bf73-59ae454811a9@huawei.com>
Date: Thu, 7 Nov 2024 19:09:52 +0800
From: Yunsheng Lin <linyunsheng@...wei.com>
To: Jesper Dangaard Brouer <hawk@...nel.org>,
Toke Høiland-Jørgensen <toke@...hat.com>,
<davem@...emloft.net>, <kuba@...nel.org>, <pabeni@...hat.com>
CC: <zhangkun09@...wei.com>, <fanghaiqing@...wei.com>,
<liuyonglong@...wei.com>, Robin Murphy <robin.murphy@....com>, Alexander
Duyck <alexander.duyck@...il.com>, IOMMU <iommu@...ts.linux.dev>, Andrew
Morton <akpm@...ux-foundation.org>, Eric Dumazet <edumazet@...gle.com>, Ilias
Apalodimas <ilias.apalodimas@...aro.org>, <linux-mm@...ck.org>,
<linux-kernel@...r.kernel.org>, <netdev@...r.kernel.org>, kernel-team
<kernel-team@...udflare.com>, Viktor Malik <vmalik@...hat.com>
Subject: Re: [PATCH net-next v3 3/3] page_pool: fix IOMMU crash when driver
has already unbound
On 2024/11/6 23:57, Jesper Dangaard Brouer wrote:
...
>>
>> Some more info from production servers.
>>
>> (I'm amazed what we can do with a simple bpftrace script, Cc Viktor)
>>
>> In below bpftrace script/oneliner I'm extracting the inflight count, for
>> all page_pool's in the system, and storing that in a histogram hash.
>>
>> sudo bpftrace -e '
>> rawtracepoint:page_pool_state_release { @cnt[probe]=count();
>> @cnt_total[probe]=count();
>> $pool=(struct page_pool*)arg0;
>> $release_cnt=(uint32)arg2;
>> $hold_cnt=$pool->pages_state_hold_cnt;
>> $inflight_cnt=(int32)($hold_cnt - $release_cnt);
>> @inflight=hist($inflight_cnt);
>> }
>> interval:s:1 {time("\n%H:%M:%S\n");
>> print(@cnt); clear(@cnt);
>> print(@inflight);
>> print(@cnt_total);
>> }'
>>
>> The page_pool behavior depend on how NIC driver use it, so I've run this on two prod servers with drivers bnxt and mlx5, on a 6.6.51 kernel.
>>
>> Driver: bnxt_en
>> - kernel 6.6.51
>>
>> @cnt[rawtracepoint:page_pool_state_release]: 8447
>> @inflight:
>> [0] 507 | |
>> [1] 275 | |
>> [2, 4) 261 | |
>> [4, 8) 215 | |
>> [8, 16) 259 | |
>> [16, 32) 361 | |
>> [32, 64) 933 | |
>> [64, 128) 1966 | |
>> [128, 256) 937052 |@@@@@@@@@ |
>> [256, 512) 5178744 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> [512, 1K) 73908 | |
>> [1K, 2K) 1220128 |@@@@@@@@@@@@ |
>> [2K, 4K) 1532724 |@@@@@@@@@@@@@@@ |
>> [4K, 8K) 1849062 |@@@@@@@@@@@@@@@@@@ |
>> [8K, 16K) 1466424 |@@@@@@@@@@@@@@ |
>> [16K, 32K) 858585 |@@@@@@@@ |
>> [32K, 64K) 693893 |@@@@@@ |
>> [64K, 128K) 170625 |@ |
>>
>> Driver: mlx5_core
>> - Kernel: 6.6.51
>>
>> @cnt[rawtracepoint:page_pool_state_release]: 1975
>> @inflight:
>> [128, 256) 28293 |@@@@ |
>> [256, 512) 184312 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
>> [512, 1K) 0 | |
>> [1K, 2K) 4671 | |
>> [2K, 4K) 342571 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> [4K, 8K) 180520 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
>> [8K, 16K) 96483 |@@@@@@@@@@@@@@ |
>> [16K, 32K) 25133 |@@@ |
>> [32K, 64K) 8274 |@ |
>>
>>
>> The key thing to notice that we have up-to 128,000 pages in flight on
>> these random production servers. The NIC have 64 RX queue configured,
>> thus also 64 page_pool objects.
>>
>
> I realized that we primarily want to know the maximum in-flight pages.
>
> So, I modified the bpftrace oneliner to track the max for each page_pool in the system.
>
> sudo bpftrace -e '
> rawtracepoint:page_pool_state_release { @cnt[probe]=count();
> @cnt_total[probe]=count();
> $pool=(struct page_pool*)arg0;
> $release_cnt=(uint32)arg2;
> $hold_cnt=$pool->pages_state_hold_cnt;
> $inflight_cnt=(int32)($hold_cnt - $release_cnt);
> $cur=@...light_max[$pool];
> if ($inflight_cnt > $cur) {
> @inflight_max[$pool]=$inflight_cnt;}
> }
> interval:s:1 {time("\n%H:%M:%S\n");
> print(@cnt); clear(@cnt);
> print(@inflight_max);
> print(@cnt_total);
> }'
>
> I've attached the output from the script.
> For unknown reason this system had 199 page_pool objects.
Perhaps some of those page_pool objects are per_cpu page_pool
objects from net_page_pool_create()?
It would be good if the pool_size for those page_pool objects
is printed too.
>
> The 20 top users:
>
> $ cat out02.inflight-max | grep inflight_max | tail -n 20
> @inflight_max[0xffff88829133d800]: 26473
> @inflight_max[0xffff888293c3e000]: 27042
> @inflight_max[0xffff888293c3b000]: 27709
> @inflight_max[0xffff8881076f2800]: 29400
> @inflight_max[0xffff88818386e000]: 29690
> @inflight_max[0xffff8882190b1800]: 29813
> @inflight_max[0xffff88819ee83800]: 30067
> @inflight_max[0xffff8881076f4800]: 30086
> @inflight_max[0xffff88818386b000]: 31116
> @inflight_max[0xffff88816598f800]: 36970
> @inflight_max[0xffff8882190b7800]: 37336
> @inflight_max[0xffff888293c38800]: 39265
> @inflight_max[0xffff888293c3c800]: 39632
> @inflight_max[0xffff888293c3b800]: 43461
> @inflight_max[0xffff888293c3f000]: 43787
> @inflight_max[0xffff88816598f000]: 44557
> @inflight_max[0xffff888132ce9000]: 45037
> @inflight_max[0xffff888293c3f800]: 51843
> @inflight_max[0xffff888183869800]: 62612
> @inflight_max[0xffff888113d08000]: 73203
>
> Adding all values together:
>
> grep inflight_max out02.inflight-max | awk 'BEGIN {tot=0} {tot+=$2; printf "total:" tot "\n"}' | tail -n 1
>
> total:1707129
>
> Worst case we need a data structure holding 1,707,129 pages.
For 64 bit system, that means about 54MB memory overhead for tracking those
inflight pages if 16 byte memory of metadata needed for each page, I guess
that is ok for those large systems.
> Fortunately, we don't need a single data structure as this will be split
> between 199 page_pool's.
It would be good to have an average value for the number of inflight pages,
so that we might be able to have a statically allocated memory to satisfy
the mostly used case, and use the dynamically allocated memory if/when
necessary.
>
> --Jesper
Powered by blists - more mailing lists