linux-kernel - Re: [PATCH drm-next v2 05/16] drm: manager to keep track of GPUs VA mappings

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20230320191633.wf3qdjtpy3xxtyz7@revolver>
Date:   Mon, 20 Mar 2023 15:16:33 -0400
From:   "Liam R. Howlett" <Liam.Howlett@...cle.com>
To:     Danilo Krummrich <dakr@...hat.com>
Cc:     airlied@...il.com, daniel@...ll.ch, tzimmermann@...e.de,
        mripard@...nel.org, corbet@....net, christian.koenig@....com,
        bskeggs@...hat.com, matthew.brost@...el.com,
        boris.brezillon@...labora.com, alexdeucher@...il.com,
        ogabbay@...nel.org, bagasdotme@...il.com, willy@...radead.org,
        jason@...kstrand.net, dri-devel@...ts.freedesktop.org,
        nouveau@...ts.freedesktop.org, linux-doc@...r.kernel.org,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        Dave Airlie <airlied@...hat.com>
Subject: Re: [PATCH drm-next v2 05/16] drm: manager to keep track of GPUs VA
 mappings

* Danilo Krummrich <dakr@...hat.com> [230313 19:46]:
> On 3/7/23 23:43, Liam R. Howlett wrote:
> > * Danilo Krummrich <dakr@...hat.com> [230306 10:46]:
> > > On 3/2/23 03:38, Liam R. Howlett wrote:
> > > > * Danilo Krummrich <dakr@...hat.com> [230227 08:17]:
> > > > 
> > > > ...
> > > > > > > Would this variant be significantly more efficient?
> > > > > > 
> > > > > > Well, what you are doing is walking the tree to see if there's anything
> > > > > > there... then re-walking the tree to store it.  So, yes, it's much more
> > > > > > efficient..  However, writing is heavier.  How much of the time is spent
> > > > > > walking vs writing depends on the size of the tree, but it's rather easy
> > > > > > to do this in a single walk of the tree so why wouldn't you?
> > > > > 
> > > > > I will, I was just curious about how much of an impact it has.
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > Also, would this also work while already walking the tree?
> > > > > > 
> > > > > > Yes, to an extent.  If you are at the correct location in the tree, you
> > > > > > can write to that location.  If you are not in the correct location and
> > > > > > try to write to the tree then things will go poorly..  In this scenario,
> > > > > > we are very much walking the tree and writing to it in two steps.
> > > > > > 
> > > > > > > 
> > > > > > > To remove an entry while walking the tree I have a separate function
> > > > > > > drm_gpuva_iter_remove(). Would I need something similar for inserting
> > > > > > > entries?
> > > > > > 
> > > > > > I saw that.  Your remove function uses the erase operation which is
> > > > > > implemented as a walk to that location and a store of a null over the
> > > > > > range that is returned.  You do not need a function to insert an entry
> > > > > > if the maple state is at the correct location, and that doesn't just
> > > > > > mean setting mas.index/mas.last to the correct value.  There is a node &
> > > > > > offset saved in the maple state that needs to be in the correct
> > > > > > location.  If you store to that node then the node may be replaced, so
> > > > > > other iterators that you have may become stale, but the one you used
> > > > > > execute the store operation will now point to the new node with the new
> > > > > > entry.
> > > > > > 
> > > > > > > 
> > > > > > > I already provided this example in a separate mail thread, but it may makes
> > > > > > > sense to move this to the mailing list:
> > > > > > > 
> > > > > > > In __drm_gpuva_sm_map() we're iterating a given range of the tree, where the
> > > > > > > given range is the size of the newly requested mapping. __drm_gpuva_sm_map()
> > > > > > > invokes a callback for each sub-operation that needs to be taken in order to
> > > > > > > fulfill this mapping request. In most cases such a callback just creates a
> > > > > > > drm_gpuva_op object and stores it in a list.
> > > > > > > 
> > > > > > > However, drivers can also implement the callback, such that they directly
> > > > > > > execute this operation within the callback.
> > > > > > > 
> > > > > > > Let's have a look at the following example:
> > > > > > > 
> > > > > > >         0     a     2
> > > > > > > old: |-----------|       (bo_offset=n)
> > > > > > > 
> > > > > > >               1     b     3
> > > > > > > req:       |-----------| (bo_offset=m)
> > > > > > > 
> > > > > > >         0  a' 1     b     3
> > > > > > > new: |-----|-----------| (a.bo_offset=n,b.bo_offset=m)
> > > > > > > 
> > > > > > > This would result in the following operations.
> > > > > > > 
> > > > > > > __drm_gpuva_sm_map() finds entry "a" and calls back into the driver
> > > > > > > suggesting to re-map "a" with the new size. The driver removes entry "a"
> > > > > > > from the tree and adds "a'"
> > > > > > 
> > > > > > What you have here won't work.  The driver will cause your iterators
> > > > > > maple state to point to memory that is freed.  You will either need to
> > > > > > pass through your iterator so that the modifications can occur with that
> > > > > > maple state so it remains valid, or you will need to invalidate the
> > > > > > iterator on every modification by the driver.
> > > > > > 
> > > > > > I'm sure the first idea you have will be to invalidate the iterator, but
> > > > > > that is probably not the way to proceed.  Even ignoring the unclear
> > > > > > locking of two maple states trying to modify the tree, this is rather
> > > > > > inefficient - each invalidation means a re-walk of the tree.  You may as
> > > > > > well not use an iterator in this case.
> > > > > > 
> > > > > > Depending on how/when the lookups occur, you could still iterate over
> > > > > > the tree and let the driver modify the ending of "a", but leave the tree
> > > > > > alone and just store b over whatever - but the failure scenarios may
> > > > > > cause you grief.
> > > > > > 
> > > > > > If you pass the iterator through, then you can just use it to do your
> > > > > > writes and keep iterating as if nothing changed.
> > > > > 
> > > > > Passing through the iterater clearly seems to be the way to go.
> > > > > 
> > > > > I assume that if the entry to insert isn't at the location of the iterator
> > > > > (as in the following example) we can just keep walking to this location my
> > > > > changing the index of the mas and calling mas_walk()?
> > > > 
> > > > no.  You have to mas_set() to the value and walk from the top of the
> > > > tree.  mas_walk() walks down, not from side to side - well, it does go
> > > > forward within a node (increasing offset), but if you hit the node limit
> > > > then you have gotten yourself in trouble.
> > > > 
> > > > > This would also imply
> > > > > that the "outer" tree walk continues after the entry we just inserted,
> > > > > right?
> > > > 
> > > > I don't understand the "outer" tree walk statement.
> > > 
> > > I think I could have phrased this better. I just mean "my" iterator walking
> > > each tree entry rather than an internal tree walk, as it happens in e.g.
> > > mas_walk() or mas_find().
> > > 
> > > > 
> > > > > 
> > > > >              1     a     3
> > > > > old:       |-----------| (bo_offset=n)
> > > > > 
> > > > >        0     b     2
> > > > > req: |-----------|       (bo_offset=m)
> > > > > 
> > > > >        0     b     2  a' 3
> > > > > new: |-----------|-----| (b.bo_offset=m,a.bo_offset=n+2)
> > > > > 
> > > > > Again, after finding "a", we want to remove it and insert "a'" instead.
> > > > 
> > > > Ah, so you could walk to 0, see that it's NULL from 0 - 1, call
> > > > mas_next() and get "a" from 1 - 3, write "a'" from 2 - 3:
> > > > 
> > > >           0     1  a   2  a' 3
> > > > broken: |-----|------|-----| (a is broken in this 1/2 step)
> > > > 
> > > > mas_set_range(&mas, 0, 2); /* Resets the tree location to MAS_START */
> > > > mas_store(&mas, b);
> > > >           0     b     2  a' 3
> > > > new:    |-----------|-----| (b.bo_offset=m,a.bo_offset=n+2)
> > > > 
> > > > 
> > > > You can *probably* also get away with this:
> > > > 
> > > > walk to 0, see that it's NULL from 0 - 1, call mas_next() and get "a"
> > > > from 1 - 3, write "a'" from 2 - 3:
> > > > 
> > > >           0     1  a   2  a' 3
> > > > broken: |-----|------|-----| (a is broken in this 1/2 step)
> > > > 
> > > > mas_prev(&mas, 0); /* Looking at broken a from 1-2.
> > > > mas_store(&mas, NULL); /* NULL is expanded on write to 0-2.
> > > >               0    NULL   2  a' 3
> > > > broken':    |-----------|-----| (b.bo_offset=m,a.bo_offset=n+2)
> > > > 
> > > > mas_store(&mas, b);
> > > >           0     b     2  a' 3
> > > > new:    |-----------|-----| (b.bo_offset=m,a.bo_offset=n+2)
> > > > 
> > > > You may want to iterate backwards and do the writes as you go until you
> > > > have enough room.. it really depends how you want to go about doing
> > > > things.
> > > 
> > > I see, again thanks for explaining.
> > > 
> > > I think I would prefer to either (1) have generic insert() function with a
> > > similar behavior as when iterating through a list or (2) have a function
> > > dedicated to the "split" use case.
> > > 
> > > 1) When iterating the tree inserting entries at arbitrary locations should
> > > not influence the next iteration step. Unless the new entry really is the
> > > next entry, but that'd be optional. I don't see a use case for that.
> > > 
> > > 2) Similar to how you broke it down above I could imagine a function
> > > dedicated to the split operation. This would be similar to what you mention
> > > for mmap below. However, it wouldn't be a single operation.
> > > 
> > > The GPUVA manager provides sub-operations to the driver for a single mapping
> > > request. Those can be an arbitrary amount of unmaps (for mappings "in the
> > > way", as you say below), one or two remaps (for splits at the beginning or
> > > end or both) and exactly one map (which is the last sub-operation adding the
> > > newly requested mapping).
> > > 
> > > Remaps consist out of the mapping to unmap and one or two new mappings to
> > > map. The only case where a remap sub-op has two new mappings to map is when
> > > the newly requested mapping is enclosed by a single existing mapping. If we
> > > overlap a mapping at the beginning and another one at the end this would be
> > > two separate remap sub-ops. Of course, between the two remaps there could be
> > > an arbitrary amount of unmap sub-ops.
> > > 
> > > Unmap sub-ops are simple, I just need to remove a single entry in the tree.
> > > drm_gpuva_iter_remove() should be fine for that.
> > > 
> > > For remap sub-ops, I would need a function that removes an entry and then
> > > adds one or two new entries within the range of the removed one. The next
> > > loop iteration should then continue at the entry (is any) after the range of
> > > the removed one.
> > > 
> > > However, I'm unsure how to implement this. Would I need to just do a
> > > mas_store() of the new entry/entries (since the nodes should already be
> > > allocated) and then clean up the nodes that are left with mas_erase()?
> > > 
> > > Let's say there is an entry A = [0 - 5] and I want to replace it with B = [0
> > > - 1] and C = [4 - 5].
> > > 
> > > Could I just store B and C and then somehow clean up the range [2 - 3]?
> > 
> > The most efficient way:
> > mas_set(&mas, 0);
> > // Walk down to 0
> > mas_walk(&mas);
> > // We are now pointing at A (index = 0, last = 5)
> > mas.last = 1;
> > // No walk here.
> > mas_store(&mas, B);
> > // Going to the next entry is very fast.
> > mas_next(&mas)
> > // We are now pointing at a fragment of A (index = 2, last = 5)
> > mas.last = 3;
> > // No walk here.
> > mas_store(&mas, NULL);
> > // Going to the next entry is very fast
> > mas_next(&mas);
> > // We are now pointing at a fragment of A (index = 4, last = 5)
> > mas_store(&mas, C);
> > 
> > Less efficient, but still fine:
> > // Walk down to 0 and store
> > mas_set_range(&mas, 0, 1);
> > mas_store(&mas, B);
> > // Reset to the top of the tree
> > mas_set_range(&mas, 4, 5);
> > // Walk down to 4 and store
> > mas_store(&mas, C);
> > // Reset to the top of the tree
> > mas_set_range(&mas, 2, 3);
> > // Walk down to 2 and store
> > mas_store(&mas, NULL);
> > 
> > 
> > > 
> > > Maybe 1) would be the most flexible way, however, if 2) can be implemented
> > > more efficiently that's perfectly fine too.
> > 
> > You can do anything you want, but the more you can use the same maple
> > state and save walking from the top the more efficient it will be.
> > Every level is another dereference down the tree..  We do have a
> > branching factor of 16 here, so I don't know the size of your tree and
> > how worth the effort it is for you.
> 
> I think it could be worth taking the first approach and providing functions
> that are tied specifically to the use cases of the GPUVA manager, rather
> than generalizing them too much and re-walk the tree more than necessary. I
> think the size of the tree can be up to a couple 100k.

A couple 100k VMAs?  As in 2 trees of 100k VMAs or 200k VMAs in a single
tree?  So that's 5 dereferences to walk from the root to the VMA.

> 
> Since some operations may be executed from dma-fence signalling critical
> sections I have a use case for mas_preallocate(). I was wondering if I can
> ignore the "entry" argument of mas_preallocate() and just pass NULL, since
> it's actually never used. What's the purpose of this argument? Or is it bug?

It existed to optimize the preallocations, but that functionality was
never completed.  It is slated to be dropped by a patch [1] in the
mm-unstable branch.  I am not sure it's worth doing the optimization
after the zeroing fix [2] of the maple nodes.  If you find the
preallocations are too large and causing issues, we can revisit.. but
with a 5 level tree, we will allocate 16 nodes and almost always have
extras - we get 16 nodes per page.

If you have sparse data, then I would start to get concerned after ~524K
VMAs, then we'd be looking for 2 pages.  More compact data can run up to
~1.04M before needing 2 pages.  Then again, two pages doesn't seem like
a lot for such a large task.

How sparse is your data, on average?

[1] https://lore.kernel.org/all/20230110154211.1758562-1-vernon2gm@gmail.com/T/#u
[2] https://lore.kernel.org/all/20230105160427.2988454-1-Liam.Howlett@oracle.com/ 

Thanks,
Liam

...