[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJD7tkYmSAg_T289jRczARsXu2sCW0GrR9VPyL04fQRKzCK0hg@mail.gmail.com>
Date: Mon, 6 Nov 2023 13:17:14 -0800
From: Yosry Ahmed <yosryahmed@...gle.com>
To: Waiman Long <longman@...hat.com>
Cc: Tejun Heo <tj@...nel.org>, Zefan Li <lizefan.x@...edance.com>,
Johannes Weiner <hannes@...xchg.org>, cgroups@...r.kernel.org,
linux-kernel@...r.kernel.org, Joe Mario <jmario@...hat.com>,
Sebastian Jug <sejug@...hat.com>
Subject: Re: [PATCH v3 2/3] cgroup/rstat: Optimize cgroup_rstat_updated_list()
On Mon, Nov 6, 2023 at 12:37 PM Waiman Long <longman@...hat.com> wrote:
>
> On 11/6/23 15:07, Yosry Ahmed wrote:
> > On Fri, Nov 3, 2023 at 8:13 PM Waiman Long <longman@...hat.com> wrote:
> >> The current design of cgroup_rstat_cpu_pop_updated() is to traverse
> >> the updated tree in a way to pop out the leaf nodes first before
> >> their parents. This can cause traversal of multiple nodes before a
> >> leaf node can be found and popped out. IOW, a given node in the tree
> >> can be visited multiple times before the whole operation is done. So
> >> it is not very efficient and the code can be hard to read.
> >>
> >> With the introduction of cgroup_rstat_updated_list() to build a list
> >> of cgroups to be flushed first before any flushing operation is being
> >> done, we can optimize the way the updated tree nodes are being popped
> >> by pushing the parents first to the tail end of the list before their
> >> children. In this way, most updated tree nodes will be visited only
> >> once with the exception of the subtree root as we still need to go
> >> back to its parent and popped it out of its updated_children list.
> >> This also makes the code easier to read.
> >>
> >> A parallel kernel build on a 2-socket x86-64 server is used as the
> >> benchmarking tool for measuring the lock hold time. Below were the lock
> >> hold time frequency distribution before and after the patch:
> >>
> >> Hold time Before patch After patch
> >> --------- ------------ -----------
> >> 0-01 us 13,738,708 14,594,545
> >> 01-05 us 1,177,194 439,926
> >> 05-10 us 4,984 5,960
> >> 10-15 us 3,562 3,543
> >> 15-20 us 1,314 1,397
> >> 20-25 us 18 25
> >> 25-30 us 12 12
> >>
> >> It can be seen that the patch pushes the lock hold time towards the
> >> lower end.
> >>
> >> Signed-off-by: Waiman Long <longman@...hat.com>
> >> ---
> > I don't know why git decided to show this diff in the most confusing
> > way possible.
> I agree. The diff is really hard to look at. It will be easier to apply
> the patch & looks at the actual rstat.c file.
Would the diff be simpler if patches 1 & 2 were squashed?
[..]
> >
> >> *
> >> * The only ordering guarantee is that, for a parent and a child pair
> >> - * covered by a given traversal, if a child is visited, its parent is
> >> - * guaranteed to be visited afterwards.
> >> + * covered by a given traversal, the child is before its parent in
> >> + * the list.
> >> + *
> >> + * Note that updated_children is self terminated while updated_next is
> >> + * parent cgroup terminated except the cgroup root which can be self
> >> + * terminated.
> > IIUC updated_children and updated_next is the same list.
> > updated_children is the head, and updated_next is how the list items
> > are linked. This comment makes it seem like they are two different
> > lists.
> Thanks for the comment. I will rework the comment to clarify that a bit
> more.
> >
> > I am actually wondering if it's worth using the singly linked list
> > here. We are saving 8 bytes percpu, but the semantics are fairly
> > confusing. Wouldn't this be easier to reason about if you just use
> > list_head?
> >
> > updated_children would be replaced with LIST_HEAD (or similar), and
> > the list would be NULL terminated instead of terminated by self/parent
> > cgroup. IIUC the reason it's not NULL-terminated now is because we use
> > cgroup->updated_next to check quickly if a cgroup is on the list or
> > not. If we use list_heads, we can just use list_emtpy() IIUC.
> >
> > We can also simplify the semantics of unlinking @root from the updated
> > tree below, it would just be list_del() IIUC, which is actually more
> > performant as well. It seems like overall we would simplify a lot of
> > things. When forming the updated_list, we can just walk the tree and
> > splice the lists in the correct order.
> >
> > It seems to me that saving 8 bytes percpu is not worth the complexity
> > of the custom list semantics here. Am I missing something here?
>
> It will cost an additional 16 bytes of percpu memory if converted to
> list_heads. Like other lists, there will be sibling and children
> list_heads. There are also 2 pointers to update instead of one. Anyway,
> I don't have an objection to convert them to list_heads if agreed by Tejun.
Yes you are right. It's definitely not free, but it's also not super
costly. It's just that every time I look at the rstat code I need to
remind myself of how updated_next and updated_children work. I will
let Tejun decide.
Powered by blists - more mailing lists