linux-kernel - Re: [PATCH v3 2/3] cgroup/rstat: Optimize cgroup_rstat_updated

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <659cc4ae-ca1e-fd97-b6e5-211738f7b7b9@redhat.com>
Date:   Tue, 7 Nov 2023 09:28:14 -0500
From:   Waiman Long <longman@...hat.com>
To:     Yosry Ahmed <yosryahmed@...gle.com>
Cc:     Tejun Heo <tj@...nel.org>, Zefan Li <lizefan.x@...edance.com>,
        Johannes Weiner <hannes@...xchg.org>, cgroups@...r.kernel.org,
        linux-kernel@...r.kernel.org, Joe Mario <jmario@...hat.com>,
        Sebastian Jug <sejug@...hat.com>
Subject: Re: [PATCH v3 2/3] cgroup/rstat: Optimize cgroup_rstat_updated_list()

On 11/6/23 16:17, Yosry Ahmed wrote:
> On Mon, Nov 6, 2023 at 12:37 PM Waiman Long <longman@...hat.com> wrote:
>> On 11/6/23 15:07, Yosry Ahmed wrote:
>>> On Fri, Nov 3, 2023 at 8:13 PM Waiman Long <longman@...hat.com> wrote:
>>>> The current design of cgroup_rstat_cpu_pop_updated() is to traverse
>>>> the updated tree in a way to pop out the leaf nodes first before
>>>> their parents. This can cause traversal of multiple nodes before a
>>>> leaf node can be found and popped out. IOW, a given node in the tree
>>>> can be visited multiple times before the whole operation is done. So
>>>> it is not very efficient and the code can be hard to read.
>>>>
>>>> With the introduction of cgroup_rstat_updated_list() to build a list
>>>> of cgroups to be flushed first before any flushing operation is being
>>>> done, we can optimize the way the updated tree nodes are being popped
>>>> by pushing the parents first to the tail end of the list before their
>>>> children. In this way, most updated tree nodes will be visited only
>>>> once with the exception of the subtree root as we still need to go
>>>> back to its parent and popped it out of its updated_children list.
>>>> This also makes the code easier to read.
>>>>
>>>> A parallel kernel build on a 2-socket x86-64 server is used as the
>>>> benchmarking tool for measuring the lock hold time. Below were the lock
>>>> hold time frequency distribution before and after the patch:
>>>>
>>>>        Hold time        Before patch       After patch
>>>>        ---------        ------------       -----------
>>>>          0-01 us        13,738,708         14,594,545
>>>>         01-05 us         1,177,194            439,926
>>>>         05-10 us             4,984              5,960
>>>>         10-15 us             3,562              3,543
>>>>         15-20 us             1,314              1,397
>>>>         20-25 us                18                 25
>>>>         25-30 us                12                 12
>>>>
>>>> It can be seen that the patch pushes the lock hold time towards the
>>>> lower end.
>>>>
>>>> Signed-off-by: Waiman Long <longman@...hat.com>
>>>> ---
>>> I don't know why git decided to show this diff in the most confusing
>>> way possible.
>> I agree. The diff is really hard to look at. It will be easier to apply
>> the patch & looks at the actual rstat.c file.
> Would the diff be simpler if patches 1 & 2 were squashed?
Maybe, but I prefer to keep them separate for now.
>
> [..]
>>>>     *
>>>>     * The only ordering guarantee is that, for a parent and a child pair
>>>> - * covered by a given traversal, if a child is visited, its parent is
>>>> - * guaranteed to be visited afterwards.
>>>> + * covered by a given traversal, the child is before its parent in
>>>> + * the list.
>>>> + *
>>>> + * Note that updated_children is self terminated while updated_next is
>>>> + * parent cgroup terminated except the cgroup root which can be self
>>>> + * terminated.
>>> IIUC updated_children and updated_next is the same list.
>>> updated_children is the head, and updated_next is how the list items
>>> are linked. This comment makes it seem like they are two different
>>> lists.
>> Thanks for the comment. I will rework the comment to clarify that a bit
>> more.
>>> I am actually wondering if it's worth using the singly linked list
>>> here. We are saving 8 bytes percpu, but the semantics are fairly
>>> confusing. Wouldn't this be easier to reason about if you just use
>>> list_head?
>>>
>>> updated_children would be replaced with LIST_HEAD (or similar), and
>>> the list would be NULL terminated instead of terminated by self/parent
>>> cgroup. IIUC the reason it's not NULL-terminated now is because we use
>>> cgroup->updated_next to check quickly if a cgroup is on the list or
>>> not. If we use list_heads, we can just use list_emtpy() IIUC.
>>>
>>> We can also simplify the semantics of unlinking @root from the updated
>>> tree below, it would just be list_del() IIUC, which is actually more
>>> performant as well. It seems like overall we would simplify a lot of
>>> things. When forming the updated_list, we can just walk the tree and
>>> splice the lists in the correct order.
>>>
>>> It seems to me that saving 8 bytes percpu is not worth the complexity
>>> of the custom list semantics here. Am I missing something here?
>> It will cost an additional 16 bytes of percpu memory if converted to
>> list_heads. Like other lists, there will be sibling and children
>> list_heads. There are also 2 pointers to update instead of one. Anyway,
>> I don't have an objection to convert them to list_heads if agreed by Tejun.
> Yes you are right. It's definitely not free, but it's also not super
> costly. It's just that every time I look at the rstat code I need to
> remind myself of how updated_next and updated_children work. I will
> let Tejun decide.

After further thought, changing it to list_head may not be possible with 
the current design since the actual linkage is like:

update_next -> cgroup + cpu --> update_next --> ...

So unless we change the design to link cgroup_rstat_cpu directly to each 
other for a given CPU, not via a cgroup intermediary, we will not be 
able to use list_head and the associated list_add() & list_del() macros. 
Direct linkage, however, requires a cgroup back pointer. So the real 
cost will be 24 bytes instead.

Cheers,
Longman