[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <55854a94-681e-4142-9160-98b22fa64d61@kernel.org>
Date: Mon, 6 May 2024 14:03:47 +0200
From: Jesper Dangaard Brouer <hawk@...nel.org>
To: Shakeel Butt <shakeel.butt@...ux.dev>
Cc: Waiman Long <longman@...hat.com>, tj@...nel.org, hannes@...xchg.org,
lizefan.x@...edance.com, cgroups@...r.kernel.org, yosryahmed@...gle.com,
netdev@...r.kernel.org, linux-mm@...ck.org, kernel-team@...udflare.com,
Arnaldo Carvalho de Melo <acme@...nel.org>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
Daniel Dao <dqminh@...udflare.com>, Ivan Babrou <ivan@...udflare.com>,
jr@...udflare.com
Subject: Re: [PATCH v1] cgroup/rstat: add cgroup_rstat_cpu_lock helpers and
tracepoints
On 03/05/2024 21.18, Shakeel Butt wrote:
> On Fri, May 03, 2024 at 04:00:20PM +0200, Jesper Dangaard Brouer wrote:
>>
>>
> [...]
>>>
>>> I may have mistakenly thinking the lock hold time refers to just the
>>> cpu_lock. Your reported times here are about the cgroup_rstat_lock.
>>> Right? If so, the numbers make sense to me.
>>>
>>
>> True, my reported number here are about the cgroup_rstat_lock.
>> Glad to hear, we are more aligned then :-)
>>
>> Given I just got some prod machines online with this patch
>> cgroup_rstat_cpu_lock tracepoints, I can give you some early results,
>> about hold-time for the cgroup_rstat_cpu_lock.
>
> Oh you have already shared the preliminary data.
>
>>
>> From this oneliner bpftrace commands:
>>
>> sudo bpftrace -e '
>> tracepoint:cgroup:cgroup_rstat_cpu_lock_contended {
>> @start[tid]=nsecs; @cnt[probe]=count()}
>> tracepoint:cgroup:cgroup_rstat_cpu_locked {
>> $now=nsecs;
>> if (args->contended) {
>> @wait_per_cpu_ns=hist($now-@...rt[tid]); delete(@start[tid]);}
>> @cnt[probe]=count(); @locked[tid]=$now}
>> tracepoint:cgroup:cgroup_rstat_cpu_unlock {
>> $now=nsecs;
>> @locked_per_cpu_ns=hist($now-@...ked[tid]); delete(@locked[tid]);
>> @cnt[probe]=count()}
>> interval:s:1 {time("%H:%M:%S "); print(@wait_per_cpu_ns);
>> print(@locked_per_cpu_ns); print(@cnt); clear(@cnt);}'
>>
>> Results from one 1 sec period:
>>
>> 13:39:55 @wait_per_cpu_ns:
>> [512, 1K) 3 | |
>> [1K, 2K) 12 |@ |
>> [2K, 4K) 390
>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> [4K, 8K) 70 |@@@@@@@@@ |
>> [8K, 16K) 24 |@@@ |
>> [16K, 32K) 183 |@@@@@@@@@@@@@@@@@@@@@@@@ |
>> [32K, 64K) 11 |@ |
>>
>> @locked_per_cpu_ns:
>> [256, 512) 75592 |@ |
>> [512, 1K) 2537357
>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> [1K, 2K) 528615 |@@@@@@@@@@ |
>> [2K, 4K) 168519 |@@@ |
>> [4K, 8K) 162039 |@@@ |
>> [8K, 16K) 100730 |@@ |
>> [16K, 32K) 42276 | |
>> [32K, 64K) 1423 | |
>> [64K, 128K) 89 | |
>>
>> @cnt[tracepoint:cgroup:cgroup_rstat_cpu_lock_contended]: 3 /sec
>> @cnt[tracepoint:cgroup:cgroup_rstat_cpu_unlock]: 3200 /sec
>> @cnt[tracepoint:cgroup:cgroup_rstat_cpu_locked]: 3200 /sec
>>
>>
>> So, we see "flush-code-path" per-CPU-holding @locked_per_cpu_ns isn't
>> exceeding 128 usec.
>
> Hmm 128 usec is actually unexpectedly high.
> How does the cgroup hierarchy on your system looks like?
I didn't design this, so hopefully my co-workers can help me out here?
(To @Daniel or @Jon)
My low level view is that, there are 17 top-level directories in
/sys/fs/cgroup/.
There are 649 cgroups (counting occurrence of memory.stat).
There are two directories that contain the major part.
- /sys/fs/cgroup/system.slice = 379
- /sys/fs/cgroup/production.slice = 233
- (production.slice have directory two levels)
- remaining 37
We are open to changing this if you have any advice?
(@Daniel and @Jon are actually working on restructuring this)
> How many cgroups have actual workloads running?
Do you have a command line trick to determine this?
> Can the network softirqs run on any cpus or smaller
> set of cpus? I am assuming these softirqs are processing packets from
> any or all cgroups and thus have larger cgroup update tree.
Softirq and specifically NET_RX is running half of the cores (e.g. 64).
(I'm looking at restructuring this allocation)
> I wonder if
> you comment out MEMCG_SOCK stat update and still see the same holding
> time.
>
It doesn't look like MEMCG_SOCK is used.
I deduct you are asking:
- What is the update count for different types of mod_memcg_state() calls?
// Dumped via BTF info
enum memcg_stat_item {
MEMCG_SWAP = 43,
MEMCG_SOCK = 44,
MEMCG_PERCPU_B = 45,
MEMCG_VMALLOC = 46,
MEMCG_KMEM = 47,
MEMCG_ZSWAP_B = 48,
MEMCG_ZSWAPPED = 49,
MEMCG_NR_STAT = 50,
};
sudo bpftrace -e 'kfunc:vmlinux:__mod_memcg_state{@[args->idx]=count()}
END{printf("\nEND time elapsed: %d sec\n", elapsed / 1000000000);}'
Attaching 2 probes...
^C
END time elapsed: 99 sec
@[45]: 17996
@[46]: 18603
@[43]: 61858
@[47]: 21398919
It seems clear that MEMCG_KMEM = 47 is the main "user".
- 21398919/99 = 216150 calls per sec
Could someone explain to me what this MEMCG_KMEM is used for?
>>
>> My latency requirements, to avoid RX-queue overflow, with 1024 slots,
>> running at 25 Gbit/s, is 27.6 usec with small packets, and 500 usec
>> (0.5ms) with MTU size packets. This is very close to my latency
>> requirements.
>>
>> --Jesper
>>
Powered by blists - more mailing lists