[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2212f172-def9-3ec7-b3d7-732c2b2c365e@redhat.com>
Date: Mon, 6 Nov 2023 15:37:16 -0500
From: Waiman Long <longman@...hat.com>
To: Yosry Ahmed <yosryahmed@...gle.com>
Cc: Tejun Heo <tj@...nel.org>, Zefan Li <lizefan.x@...edance.com>,
Johannes Weiner <hannes@...xchg.org>, cgroups@...r.kernel.org,
linux-kernel@...r.kernel.org, Joe Mario <jmario@...hat.com>,
Sebastian Jug <sejug@...hat.com>
Subject: Re: [PATCH v3 2/3] cgroup/rstat: Optimize cgroup_rstat_updated_list()
On 11/6/23 15:07, Yosry Ahmed wrote:
> On Fri, Nov 3, 2023 at 8:13 PM Waiman Long <longman@...hat.com> wrote:
>> The current design of cgroup_rstat_cpu_pop_updated() is to traverse
>> the updated tree in a way to pop out the leaf nodes first before
>> their parents. This can cause traversal of multiple nodes before a
>> leaf node can be found and popped out. IOW, a given node in the tree
>> can be visited multiple times before the whole operation is done. So
>> it is not very efficient and the code can be hard to read.
>>
>> With the introduction of cgroup_rstat_updated_list() to build a list
>> of cgroups to be flushed first before any flushing operation is being
>> done, we can optimize the way the updated tree nodes are being popped
>> by pushing the parents first to the tail end of the list before their
>> children. In this way, most updated tree nodes will be visited only
>> once with the exception of the subtree root as we still need to go
>> back to its parent and popped it out of its updated_children list.
>> This also makes the code easier to read.
>>
>> A parallel kernel build on a 2-socket x86-64 server is used as the
>> benchmarking tool for measuring the lock hold time. Below were the lock
>> hold time frequency distribution before and after the patch:
>>
>> Hold time Before patch After patch
>> --------- ------------ -----------
>> 0-01 us 13,738,708 14,594,545
>> 01-05 us 1,177,194 439,926
>> 05-10 us 4,984 5,960
>> 10-15 us 3,562 3,543
>> 15-20 us 1,314 1,397
>> 20-25 us 18 25
>> 25-30 us 12 12
>>
>> It can be seen that the patch pushes the lock hold time towards the
>> lower end.
>>
>> Signed-off-by: Waiman Long <longman@...hat.com>
>> ---
> I don't know why git decided to show this diff in the most confusing
> way possible.
I agree. The diff is really hard to look at. It will be easier to apply
the patch & looks at the actual rstat.c file.
>
>> kernel/cgroup/rstat.c | 132 ++++++++++++++++++++++--------------------
>> 1 file changed, 70 insertions(+), 62 deletions(-)
>>
>> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
>> index 1f300bf4dc40..d2b709cfeb2a 100644
>> --- a/kernel/cgroup/rstat.c
>> +++ b/kernel/cgroup/rstat.c
>> @@ -74,64 +74,90 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
>> }
>>
>> /**
>> - * cgroup_rstat_cpu_pop_updated - iterate and dismantle rstat_cpu updated tree
>> - * @pos: current position
>> - * @root: root of the tree to traversal
>> + * cgroup_rstat_push_children - push children cgroups into the given list
>> + * @head: current head of the list (= parent cgroup)
>> + * @prstatc: cgroup_rstat_cpu of the parent cgroup
>> * @cpu: target cpu
>> + * Return: A new singly linked list of cgroups to be flush
>> *
>> - * Walks the updated rstat_cpu tree on @cpu from @root. %NULL @pos starts
>> - * the traversal and %NULL return indicates the end. During traversal,
>> - * each returned cgroup is unlinked from the tree. Must be called with the
>> - * matching cgroup_rstat_cpu_lock held.
>> + * Recursively traverse down the cgroup_rstat_cpu updated tree and push
>> + * parent first before its children. The parent is pushed by the caller.
> I think it might be useful here (and elsewhere in the patch) where
> "push" is being used to elaborate that we push to the beginning in a
> stack-like fashion.
Right, I am thinking about a stack when I use the word "push". I will
clarify that in the comment.
>
>> + * The recursion depth is the depth of the current updated tree.
>> + */
>> +static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
>> + struct cgroup_rstat_cpu *prstatc, int cpu)
>> +{
>> + struct cgroup *child, *parent;
>> + struct cgroup_rstat_cpu *crstatc;
>> +
>> + parent = head;
>> + child = prstatc->updated_children;
>> + prstatc->updated_children = parent;
>> +
>> + /* updated_next is parent cgroup terminated */
>> + while (child != parent) {
>> + child->rstat_flush_next = head;
>> + head = child;
>> + crstatc = cgroup_rstat_cpu(child, cpu);
>> + if (crstatc->updated_children != parent)
> I think cgroup->updated_children is set to the cgroup itself if it's
> empty, right? Shouldn't this be crstatc->updated_children != child?
My mistake. Will fix it in the next version.
>
>> + head = cgroup_rstat_push_children(head, crstatc, cpu);
>> + child = crstatc->updated_next;
>> + crstatc->updated_next = NULL;
>> + }
>> + return head;
>> +}
>> +
>> +/**
>> + * cgroup_rstat_updated_list - return a list of updated cgroups to be flushed
>> + * @root: root of the cgroup subtree to traverse
>> + * @cpu: target cpu
>> + * Return: A singly linked list of cgroups to be flushed
>> + *
>> + * Walks the updated rstat_cpu tree on @cpu from @root. During traversal,
>> + * each returned cgroup is unlinked from the updated tree. Must be called
>> + * with the matching cgroup_rstat_cpu_lock held.
> This function takes care of holding the lock actually. I think that
> sentence should be applied to cgroup_rstat_push_children() above?
It is left over from before this patch. Will remove that.
>
>> *
>> * The only ordering guarantee is that, for a parent and a child pair
>> - * covered by a given traversal, if a child is visited, its parent is
>> - * guaranteed to be visited afterwards.
>> + * covered by a given traversal, the child is before its parent in
>> + * the list.
>> + *
>> + * Note that updated_children is self terminated while updated_next is
>> + * parent cgroup terminated except the cgroup root which can be self
>> + * terminated.
> IIUC updated_children and updated_next is the same list.
> updated_children is the head, and updated_next is how the list items
> are linked. This comment makes it seem like they are two different
> lists.
Thanks for the comment. I will rework the comment to clarify that a bit
more.
>
> I am actually wondering if it's worth using the singly linked list
> here. We are saving 8 bytes percpu, but the semantics are fairly
> confusing. Wouldn't this be easier to reason about if you just use
> list_head?
>
> updated_children would be replaced with LIST_HEAD (or similar), and
> the list would be NULL terminated instead of terminated by self/parent
> cgroup. IIUC the reason it's not NULL-terminated now is because we use
> cgroup->updated_next to check quickly if a cgroup is on the list or
> not. If we use list_heads, we can just use list_emtpy() IIUC.
>
> We can also simplify the semantics of unlinking @root from the updated
> tree below, it would just be list_del() IIUC, which is actually more
> performant as well. It seems like overall we would simplify a lot of
> things. When forming the updated_list, we can just walk the tree and
> splice the lists in the correct order.
>
> It seems to me that saving 8 bytes percpu is not worth the complexity
> of the custom list semantics here. Am I missing something here?
It will cost an additional 16 bytes of percpu memory if converted to
list_heads. Like other lists, there will be sibling and children
list_heads. There are also 2 pointers to update instead of one. Anyway,
I don't have an objection to convert them to list_heads if agreed by Tejun.
Cheers,
Longman
Powered by blists - more mailing lists