[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bfdef75d-4343-2734-2723-d8546df37c69@oracle.com>
Date: Wed, 16 Feb 2022 15:57:09 +1100
From: Imran Khan <imran.f.khan@...cle.com>
To: Al Viro <viro@...iv.linux.org.uk>
Cc: tj@...nel.org, gregkh@...uxfoundation.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v6 6/7] kernfs: Introduce hashed rw-sem to replace per-fs
kernfs_rwsem.
Hello Al Viro,
Thanks for reviewing this.
On 16/2/22 12:46 pm, Al Viro wrote:
> On Mon, Feb 14, 2022 at 11:03:21PM +1100, Imran Khan wrote:
>
>> +static inline void down_write_kernfs_rwsem_for_two_nodes(struct kernfs_node *kn1,
>> + struct kernfs_node *kn2)
>> +{
>> + struct rw_semaphore *rwsem1 = kernfs_rwsem_ptr(kn1);
>> + struct rw_semaphore *rwsem2 = kernfs_rwsem_ptr(kn2);
>> +
>> + if (rwsem1 == rwsem2)
>> + down_write_nested(rwsem1, 0);
>> + else {
>> + if (rwsem1 < rwsem2) {
>> + down_write_nested(rwsem1, 0);
>> + down_write_nested(rwsem2, 1);
>> + } else {
>> + down_write_nested(rwsem2, 0);
>> + down_write_nested(rwsem1, 1);
>> + }
>> + }
>> + kernfs_get(kn1);
>> + kernfs_get(kn2);
>
> What's the last bit for? Do you expect the caller to grab that,
> then drop references, then pass the same references to up_write_.....?
> Explain, please. Is there any caller that would *not* have the matching
> up_write_... in the same function, BTW?
Every caller of down_write_ has a matching up_write_. I am grabbing the
reference for cases where node gets deleted between down_write_ and
up_write_. For example currently kernfs_remove invokes
down_write/up_write on per-fs rwsem. This change is using rwsem
corresponding to kernfs_node being removed so we may get a situation
when kernfs_remove --> __kernfs_remove --> kernfs_put results in
deletion of kernfs_node but subsequent up_write_ from kernfs_remove
ends up using freed kernfs_node.
>
> Just in case - "defensive programming" is not a good answer (to just about any
> question, TBH)...
>
>> +static inline void down_write_kernfs_rwsem(struct kernfs_node *kn,
>> + enum kernfs_rwsem_lock_pattern ptrn)
>> +{
>> + struct rw_semaphore *p_rwsem = NULL;
>> + struct rw_semaphore *rwsem = kernfs_rwsem_ptr(kn);
>> + int lock_parent = 0;
>> +
>> + if (ptrn == KERNFS_RWSEM_LOCK_SELF_AND_PARENT && kn->parent)
>> + lock_parent = 1;
>> +
>> + if (lock_parent)
>> + p_rwsem = kernfs_rwsem_ptr(kn->parent);
>
> Er... And what's to prevent ->parent changing right under you? For that
> matter, why bother with that at all - what's wrong with your "lock two
> nodes" primitive? AFAICS, your current series has only one caller passing
> that flag, so...
>
Agree. "lock two nodes" primitive could have been used here to make sure
that I work with same parent. I will have this change in next version.
> Note that ->unlock_parent thing is also dubious - the matching up_write_...
> for that sole caller is in the same function, so it could bloody well
> choose between single-up and double-up on its own. Would make the whole
> thing simpler...
Tejun has raised similar concern in his last feedback and I am working
towards making use of "double node locking" primitives for cases where
both parent and node need locking.
>
>> +/**
>> + * down_read_kernfs_rwsem() - Acquire hashed rwsem
>> + *
>> + * @kn: kernfs_node for which hashed rwsem needs to be taken
>> + * @ptrn: locking pattern i.e whether to lock only given node or to lock
>> + * node and its parent as well
>> + *
>> + * In case of nested locking, rwsem with lower address is acquired first.
>> + *
>> + * Return: void
>> + */
>
> ... and this one, AFAICS, doesn't need ptrn at all - no callers that would ask to
> lock parent.
Sure. I will get rid of this in next version.
>
>> +static inline void kernfs_swap_rwsems(struct rw_semaphore **array, int i, int j)
[...]
>> +}
>> +
>> +/**
>> + * down_write_kernfs_rwsem_rename_ns() - take hashed rwsem during
>> + * rename or similar operations.
>> + *
>> + * @kn: kernfs_node of interest
>> + * @current_parent: current parent of kernfs_node of interest
>> + * @new_parent: about to become new parent of kernfs_node
>> + *
>> + * During rename or similar operations the parent of a node changes,
>> + * and this means we will see different parents of a kernfs_node at
>> + * the time of taking and releasing its or its parent's hashed rwsem.
>> + * This function separately takes locks corresponding to node, and
>> + * corresponding to its current and future parents (if needed).
>> + *
>> + * Return: void
>> + */
>> +static inline void down_write_kernfs_rwsem_rename_ns(struct kernfs_node *kn,
>> + struct kernfs_node *current_parent,
>> + struct kernfs_node *new_parent)
>> +{
>> + struct rw_semaphore *array[3];
>> +
>> + array[0] = kernfs_rwsem_ptr(kn);
>> + array[1] = kernfs_rwsem_ptr(current_parent);
>> + array[2] = kernfs_rwsem_ptr(new_parent);
>> +
>> + if (array[0] == array[1] && array[0] == array[2]) {
>> + /* All 3 nodes hash to same rwsem */
>> + down_write_nested(array[0], 0);
>> + } else {
>> + /**
>> + * All 3 nodes are not hashing to the same rwsem, so sort the
>> + * array.
>> + */
>> + kernfs_sort_rwsems(array);
>> +
>> + if (array[0] == array[1] || array[1] == array[2]) {
>> + /**
>> + * Two nodes hash to same rwsem, and these
>> + * will occupy consecutive places in array after
>> + * sorting.
>> + */
>> + down_write_nested(array[0], 0);
>> + down_write_nested(array[2], 1);
>> + } else {
>> + /* All 3 nodes hashe to different rwsems */
>> + down_write_nested(array[0], 0);
>> + down_write_nested(array[1], 1);
>> + down_write_nested(array[2], 2);
>> + }
>> + }
>> +
>> + kernfs_get(kn);
>> + kernfs_get(current_parent);
>> + kernfs_get(new_parent);
>> +}
>
> Holy shit... And _that_ is supposed to be inlined? Complete with sorting the
> array, etc.?
>
Sorry about this. I will correct this in next version. I have got some
suggestions from Tejun about how to make these interfaces more compact
and I will make these multi node interfaces non-inline as well.
> <looks at the callers in the next patch>
>
> - root = kernfs_root(kn);
> - down_write(&root->kernfs_rwsem);
> + down_write_kernfs_rwsem_rename_ns(kn, kn->parent, new_parent);
>
> This is rename. The very thing that changes kn->parent. Just what would
> stop *ANOTHER* rename from modifying said kn->parent while you'd been
> sleeping waiting for that rwsem? It's not even a narrow race window...
>
> I could believe that similar question about stability of ->parent in
> case of KERNFS_RWSEM_LOCK_SELF_AND_PARENT handling might be dealt with
> by something along the lines of "nothing is supposed to rename a node
> while it's fed to kernfs_remove_self(), which is the only user of that",
> but here we definitely *can* have racing renames.
>
Agree. I missed the point of changing parent during wait of rwsem. Could
you please let me know if following approach is acceptable in this case:
1. Note kn->parent
2. down_write_
3. After acquiring the rwsems check if current kn->parent is same as the
earlier noted one and if so proceed otherwise invoke down_write_ again
as per current kn->parent.
Once we have taken the rwsems and found that kn->parent did not change
during wait we can proceed safely till corresponding up_write_
Regarding kernfs_remove_self, yes that was the idea but as mentioned
earlier in the next version I will use "double node locking" primitives.
>> +/**
>> + * down_read_kernfs_rwsem_rename_ns() - take hashed rwsem during
>> + * rename or similar operations.
>> + *
>> + * @kn: kernfs_node of interest
>> + * @current_parent: current parent of kernfs_node of interest
>> + * @new_parent: about to become new parent of kernfs_node
>> + *
>> + * During rename or similar operations the parent of a node changes,
>> + * and this means we will see different parents of a kernfs_node at
>> + * the time of taking and releasing its or its parent's hashed rwsem.
>> + * This function separately takes locks corresponding to node, and
>> + * corresponding to its current and future parents (if needed).
>> + *
>> + * Return: void
>> + */
>> +static inline void down_read_kernfs_rwsem_rename_ns(struct kernfs_node *kn,
>> + struct kernfs_node *current_parent,
>> + struct kernfs_node *new_parent)
>
> <tries to guess what use could *that* possibly have>
> <fails>
> <looks for callers (in a different mail, damnit)>
> <none>
> ...
>
> TBH, this is rather uninspiring. I can easily believe that you've got
> contention visibly reduced on a benchmark. That'd only be interesting
> if we had a clear proof of correctness. And you don't even bother to
> discuss your locking rules anywhere I can see in the series, cover letter
> included...
>
I have run my tests with lockdep enabled as well. Could you please let
me know if there is something that can be done as proof of correctness.
For sanity of patches, my current tests include LTP sysfs tests, CPU
hotplugging and cgroup access/creation/removal. I am booting oracle
linux as well which involves cgroupfs access(via systemd). If you could
suggest some other tests I can execute those as well.
Also regarding locking rules, I am not sure where to mention it. If I
put accompanying comments, at all the places where I am taking hashed
rwsems, to explain why lock corresponding to a node or corresponding to
its parent is being taken, will that be enough as description of locking
rules.
> I agree that sysfs locking had always been an atrocity - "serialize
> everything, we can't be arsed to think of that" approach had been there
> from the day one. However, that's precisely the reason why replacement
> needs a very careful analysis. For all I know, you might've done just
> that, but you are asking reviewers to reproduce that work independently.
> Or to hope that you've gotten it right and nod through the entire thing.
> Pardon me, but I know how easy it is to miss something while doing that
> kind of work - been there, done that.
I understand your concern but I am not asking reviewers to reproduce
this work independently :). If I can get to know what things can
be/should be tested further in this regard, I can do those tests.
Since the start of this work, I have been also running my tests with
lockdep and KASAN enabled as well and those tests have flagged some
issues which I have addressed.
I will certainly address your review comments in next version but in the
meantime if you have some suggestions regarding testing or description
of locking rules, please let me know.
Thanks a lot for having a look.
-- Imran
Powered by blists - more mailing lists