[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <152f65f1-5b6d-1795-1bc9-3ea318f0bae0@bytedance.com>
Date: Fri, 14 Jul 2023 20:32:55 +0800
From: Peng Zhang <zhangpeng.00@...edance.com>
To: "Liam R. Howlett" <Liam.Howlett@...cle.com>
Cc: Peng Zhang <zhangpeng.00@...edance.com>,
maple-tree@...ts.infradead.org, linux-kernel@...r.kernel.org,
linux-mm@...ck.org
Subject: Re: Maple Tree Work
在 2023/7/13 23:36, Liam R. Howlett 写道:
> * Peng Zhang <zhangpeng.00@...edance.com> [230713 04:05]:
>>
>>
>> 在 2023/7/12 23:27, Liam R. Howlett 写道:
>>> Dropping Danilo from the Cc..
>>>
>>> I was asked to add linux-mm to the list, so I did that as well.
>>>
>>> If anyone else is interested in seeing the full list, it's on lkml [1] and
>>> the maple-tree list [2].
>>>
>>> Thank you Peng for looking at the list and taking time to think about
>>> the items.
>>>
>>> * Peng Zhang <zhangpeng.00@...edance.com> [230712 07:49]:
>>>>
>>>>
>>>> 在 2023/7/8 00:38, Liam R. Howlett 写道:
>>>>> - Fork & Dup tree + Delete DONT_COPY
>>>>> This is to optimize dup_mmap() in kernel/fork.c, but other
>>>>> users may want faster duplications of a tree.
>>>>> This should be faster than building a tree in bulk mode. The
>>>>> idea is to just copy each node and replace the pointers in,
>>>>> probably, a BFS order. Once the leaves are reached, the VMAs
>>>>> will be replaced by the copies made in fork, unless DONT_COPY is
>>>>> set, in which case the VMA will be deleted from the copied tree.
>>>>> DONT_COPY is not common and since the tree isn't visible, all
>>>>> nodes should be available for reuse (no RCU worries).
>>>> If DONT_COPY is set, this method will be complicated, because the gaps
>>>> adjacent to it need to be merged, and the gaps of all ancestor nodes need to
>>>> be updated.
>>>
>>> My understanding is that this is a rare event; there aren't many VMAs
>>> marked this way. The store operation already does all the necessary
>>> work for deleting an entry. The gap tracking is also updated, and that
>>> would only happen if the new gap is larger. Are you concerned about the
>>> performance/runtime of handling the DONT_COPY in this way?
>> Yes, if no DONT_COPY is set, copying all nodes and replacing the
>> pointers must be the fastest way. I was just thinking if there is
>> a faster way if DONT_COPY exists. Using the store operation to
>> delete unnecessary entries will cause additional overhead sometimes.
>> To give an example:
>>
>> Suppose a leaf node layout of the maple tree is as follows:
>> [VMA1][VMA2][NULL][VMA3]
>>
>> If VMA2 has DONT_COPY set, we need to change the node layout as
>> follows to delete it:
>> [VMA1'][NULL][VMA3']
>>
>> At least we need to copy this node to replace it to make this
>> delete operation, even need to rebalance. However, this is a
>> rare case. In most cases, there is no gap between VMAs, so it
>> will not cause changes in the node layout.
>
> Remember that at this point, there is no readers so we could edit the
> node without a copy. It would require new code, but it's just moving
> data left.. The bigger worry is a rebalance, as you said, and that can
> get complicated. We know we have (more than) enough room to store the
> data, but editing in place isn't done in this version of the code. We
> do allow for node re-use by pushing back onto the mas->alloc, so the
> data requirements won't be too high. Without any readers, the parent
> pivots could be changed directly between two leaves.
>
>>>
>>>>
>>>> I have another idea to build a tree, if inserted in order, we only
>>>> insert at the leaf node. All leaf nodes are connected using a linked
>>>> list. In the end we get a linked list with only leaf nodes. Then we
>>>> construct non-leaf nodes layer by layer from bottom to top. I think
>>>> this is also faster than bulk mode. Another advantage of this method
>>>> is that we are applicable to more scenarios, do not need the original
>>>> tree, only need to know the ranges inserted in order. I don't know
>>>> how fast this method is, so we can discuss it.
>>>
>>> What is the advantage of a linked list over just building the tree as
>>> you go? Considering the non-leaf nodes are just a list of nodes
>>> already, and you will have to do the same work of setting pivots,
>>> allocating nodes, and filling them after you have the linked list.
>>>
>>> What work do you avoid that would make a linked list faster than bulk
>>> insert or a tree copy/replace VMAs? I was thinking that we could avoid
>>> a lot of the work involved with splitting/pushing and the parent
>>> construction by using memcpy of each node, replace each slot (and
>>> parent) with a new memcpy of the mirrored tree, then have a minimum
>>> amount of modifications to delete the DONT_COPY during the VMA
>>> replacement. BFS copy would ensure we wouldn't modify the source tree
>>> during VMA replacement and deletion (DONT_COPY). So the rebalancing (in
>>> bulk insert), pivot calculations, metadata, and gaps are (mostly) saved
>>> by using memcpy.
>> Your analysis is correct.
>>>
>>> From what I understand from your linked list idea, we would need to
>>> construct the child node by examining each entry and know that a certain
>>> entry is a DONT_COPY (ie: VMA logic would be needed in the maple tree
>>> code or a callback?). We really can't have VMA logic in the maple tree
>>> code, so we could do some sort of loop through the VMAs to add the entry
>>> to the list if desired.
>>>
>>> Once we have a linked list, we still need to figure out each pivot for
>>> the parent (I guess we won't use the last slot so there is a pivot to
>>> check?), and each max gap in each child to construct the upper layers
>>> of the tree. Is this correct?
>>>
>>> I guess we would still need to adjust the last node to ensure sufficient
>>> entries as well, so as we add items we may need to rebalance the last
>>> leaf with the second-last leaf.
>> Yes, the last two leaves need to check to make sure they have enough
>> items.
>>>
>>> The bulk store currently adjusts the split to favour splitting
>>> left, could you get the same result by strictly filling the nodes? This
>>> would have to have special handling to rebalance the last one - which we
>>> have a pretty good idea of when it's going to happen as we have a count
>>> (and the DONT_COPY is rare).
>>>
>>> Are you thinking you could compact the tree at the same time to gain
>>> efficiency?
>>>
>>> What would you consider a sufficient packed tree? It's best to keep
>>> some space around for expansion/contraction. This works out since, I
>>> think, we would need to keep that last slot free so we have a pivot to
>>> check in your linked list plan. Initial development with strict
>>> split/join rules resulted in a 'jittering' of nodes as the number of
>>> entries in a node shifted just above/below the threshold so we relaxed
>>> the merging of nodes to avoid such a scenario.
>> We can control the number of entries of nodes, for example, let this
>> number be (min + max)/2, so as to avoid making a node too 'dense' or
>> too 'loose'.
>
> By the way, in the VMA case, we also know the number of VMAs in the
> tree. Unfortunately, we don't know how many are DONT_COPY VMAs. I
> wonder if it would be worth while to balance each VMA with its neighbour
> during this operation, at least within one tree level? It would reduce
> the possibility of a larger rebalance on a DONT_COPY. It's probably not
> worth it because it would slow down our fast path.
>
>>>
>>> Let me know if you would like me to put your name beside the Fork & Dup
>>> Tree item in the list of work.
>> You can put my name on this one and I'll do it.
>
> Sounds good, thanks!
>
>> I use the method of copying all nodes, so I will implement an interface
>> to get a mirrored tree.
>>
>> But I can't think of a good way to replace old VMAs, it can't be done
>> during the copy tree process, because maybe some VMAs have DONT_COPY
>> flag.
>
> I think it could be done during the copy, instead of a 'replace' it
> would be a 'delete'. I think this is why we need to use a BFS-like
> duplication. So once we reach the leaves, then we can modify the tree
> knowing that the above state has already been copied and so it's going
> to be altering a copy of the data and so we are at a point where it can
> be mutated. You could detect that the lock-step next/insert is out of
> sync and delete the range between the old_mas and the new_mas.
>
>> It seems that we can only scan all VMAs in the source tree again
>> to update the new tree. We have already traversed the source tree once
>> in the process of copying the tree. Is there any way to avoid traversing
>> it again?
>
> Well, we haven't visited the VMAs in the copy, but we have visited the
> leaf nodes with all the pointers to the VMAs. I get what you are
> saying thought, we will have to duplicate the leaves then re-visit the
> leaves to replace the VMAs. I am not sure we can avoid this since a
> rebalance may occur, and it would be very tricky to rebalance with old
> and new data in the tree - it's probably best to just revisit, at least
> to begin with.
>
> Depending on how you implement it, you could make the copying of the
> tree end on the first leaf node by using the height of the tree to
> figure out which way to sweep (left to right or right to left) on the
> first level. Not strictly BFS, but you could end the maple state in the
> correct location to start replacing the VMAs. Would that work?
>
> We also have a reverse iterator, so we could just run through the tree
> from the right-most node to the start.
>
> I was thinking that we would make the first 'duplication store' detect
> an empty tree and duplicate the tree, ending on the left-most leaf and
> then replace the first entry (and possibly delete up to the first
> store). Each subsequent store would do the same. We would need a
> 'duplication complete' that would remove any entries beyond the last
> store and rebalance, if necessary.
>
> Feel free to use some or none of the ideas. I hope some of this helps
> with what you are trying to work out. Let me know if you have any more
> questions.
Thank you for providing so much information, I am doing this feature.
Thanks,
Peng
>
> Thanks,
> Liam
>
Powered by blists - more mailing lists