[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BANLkTinLH=s419RVQXSUTjBSdH0XjdpZWg@mail.gmail.com>
Date: Tue, 5 Apr 2011 20:46:41 +0800
From: huang ying <huang.ying.caritas@...il.com>
To: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc: Huang Ying <ying.huang@...el.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Andi Kleen <andi@...stfloor.org>,
"lenb@...nel.org" <lenb@...nel.org>,
"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: About lock-less data structure patches
On Tue, Apr 5, 2011 at 12:42 PM, Mathieu Desnoyers
[snip]
>> >> >>> Because this single-linked-list works like a stack (with "push"
>> >> >>> operation for llist_add, "pop" operation for llist_del_first), I would
>> >> >>> recommend to rename it accordingly (as a stack rather than "list"). If
>> >> >>> we think about other possible users of this kind of lock-free list, such
>> >> >>> as call_rcu(), a "queue" would be rather more appropriate than a "stack"
>> >> >>> (with enqueue/dequeue operations). So at the very least I would like to
>> >> >>> make sure this API keeps room for lock-free queue implementations that
>> >> >>> won't be confused with this stack API. It would also be important to
>> >> >>> figure out if what we really want is a stack or a queue. Some naming
>> >> >>> ideas follow (maybe they are a bit verbose, comments are welcome).
>> >> >>>
>> >> >>> We should note that this list implements "lock-free" push and pop
>> >> >>> operations (cmpxchg loops), and a "wait-free" "llist_del_all" operation
>> >> >>> (using xchg) (only really true for architectures with "true" xchg
>> >> >>> operation though, not those using LL/SC). We should think about the real
>> >> >>> use-case requirements put on this lockless stack to decide which variant
>> >> >>> is most appropriate. We can either have, with the implementation you
>> >> >>> propose:
>> >> >>>
>> >> >>> - Lock-free push
>> >> >>> - Pop protected by mutex
>> >> >>> - Wait-free pop all
>> >> >>>
>> >> >>> Or, as an example of an alternative structure (as Paul and I implemented
>> >> >>> in the userspace RCU library):
>> >> >>>
>> >> >>> - Wait-free push (stronger real-time guarantees provided by xchg())
>> >> >>> - Blocking pop/pop all (use cmpxchg and busy-wait for very short time
>> >> >>> periods)
>> >> >>>
>> >> >>> (there are others, with e.g. lock-free push, lock-free pop, lock-free
>> >> >>> pop all, but this one requires RCU read lock across the pop/pop/pop all
>> >> >>> operations and that memory reclaim of the nodes is only performed after
>> >> >>> a RCU grace-period has elapsed. This deals with ABA issues of concurrent
>> >> >>> push/pop you noticed without requiring mutexes protecting pop operations.)
>> >> >>>
>> >> >>> So it all boils down to which are the constraints of the push/pop
>> >> >>> callers. Typically, I would expect that the "push" operation has the
>> >> >>> most strict real-time constraints (and is possibly executed the most
>> >> >>> often, thus would also benefit from xchg() which is typically slightly
>> >> >>> faster than cmpxchg()), which would argue in favor of a wait-free
>> >> >>> push/blocking pop. But maybe I'm lacking understanding of what you are
>> >> >>> trying to do with this stack. Do you need to ever pop from a NMI
>> >> >>> handler ?
>> >> >>
>> >> >> In my user case, I don't need to pop in a NMI handler, just push. But
>> >> >> we need to pop in a IRQ handler, so we can not use blocking pop. Please
>> >> >> take a look at the user case patches listed later.
>> >> >
>> >> > Actually, I marked that as a "blocking" in our implementation because I
>> >> > have a busy-loop in there, and because my algorithm was implemented in
>> >> > user-space, I added an adaptative delay to make sure not to busy-loop
>> >> > waiting for a preempted thread to come back.
>> >> >
>> >> > But by disabling preemption and using a real busy-loop in the kernel,
>> >> > the pop can trivially be made non-blocking, and thus OK for interrupt
>> >> > handlers.
>> >>
>> >> OK. I see. Thanks for clarification. Takes a look at your "wfs"
>> >> implementation. It seems that we need to disable interrupt in wf push
>> >> side to make it possible to use pop in interrupt handler. Do you think so?
>> >
>> > It would be appropriate, yes, because the irq handler doing "pop" can
>> > busy-wait during the short period of time during which it sees a "NULL"
>> > value. We don't want to have a softirq running on top of the push that
>> > would make the irq handler busy-wait for longer than the duration of
>> > interrupt handlers. In your case, the push is done by a NMI handler,
>> > which already has interrupts disabled, so no worry for this specific
>> > use-case.
>>
>> The semantics of irq_work doesn't restrict the push must be in NMI
>> handler, it is possible for the push to be called in process context.
>> And we are working on library code, so I don't think it is a good idea
>> to restrict that the priority of context of push must be higher (such
>> as NMI vs IRQ) than that of pop.
>
> OK. Anyway, the busy-looping on the pop side is an implementation
> choice. We have the luxury to choose between either xchg on
> push/busy-loop on pop or a cmpxchg on push (as in your implementation),
> which removes the need for busy-loop on pop. Maybe going for a cmpxchg
> on push might make more sense, because it would not require to disable
> interrupts.
Yes. It is luxury to have two choices. :-)
>> >> >>> Some ideas for API identifiers:
>> >> >>>
>> >> >>> struct llist_head -> slist_stack_head
>> >> >>> struct llist_node -> slist_stack_node
>> >> >>
>> >> >> Why call it a stack and a list? Because it is a stack implemented with
>> >> >> single list? I think it is better to name after usage instead of
>> >> >> implementation.
>> >> >>
>> >> >> The next question is whether it should be named as stack or list. I
>> >> >> think from current user's point of view, they think they are using a
>> >> >> list instead of stack. There are 3 users so far as follow.
>> >> >>
>> >> >> https://lkml.org/lkml/2011/1/17/14
>> >> >> https://lkml.org/lkml/2011/1/17/15
>> >> >> https://lkml.org/lkml/2011/2/21/16
>> >> >
>> >> > Hrm, I'm concerned about the impact of handing the irq work from one
>> >> > execution context to another in the reverse order (because a stack is
>> >> > implemented here). Can this have unwanted side-effects ?
>> >>
>> >> For irq work, this is not an issue. irq work users should not depend on
>> >> the order of raising. But if it is necessary, we can reserve the order
>> >> again in irq_work_run to keep the original order. After deleting all
>> >> nodes from lock-less list, we can operate on them at will.
>> >
>> > Ouch. This seems counter-intuitive. I'd rather prefer if we can change
>> > the way this lock-less list operates to use a "queue" fifo scheme rather
>> > than "stack" (lifo). This would seem much more intuitive: the first
>> > entry added to the list is the first to be removed. And for iteration on
>> > multiple entries removed at once, rather than iterating in reverse
>> > order, it would iterate from the oldest to the newest entry, which again
>> > seems more natural.
>>
>> I don't think "lifo" is a big issue here, although "fifo" may be more
>> natural. After all, this is a very simple data structure with very
>> simple code.
>
> Yessss, although I would never apply the "simple" qualifier to lockless
> code. ;-)
>
>>
>> >> >> And if we named this data structure as list, we can still use "queue"
>> >> >> for another data structure. Do you think so?
>> >> >
>> >> > Well, the "delete first entry" is really unclear if we call all this a
>> >> > "list". Which is the first entry ? The first added to the list, or the
>> >> > last one ? The natural reflex would be to think it is the first added,
>> >> > but in this case, it's the opposite. This is why I think using stack or
>> >> > lifo (with push/pop) would be clearer than "list" (with add/del).
>> >>
>> >> The llist is ordered list and has only one list "head", so the first
>> >> entry is the one pointed to by the "head". Is it clear?
>> >>
>> >> > So maybe "lockless stack" could be palatable ? The name "lockless lifo"
>> >> > came to my mind in the past days too.
>> >>
>> >> But what we need is just a lock-less list, not a stack. It behaves like
>> >> a stack just because we can not implement other functions such as
>> >> llist_add_tail etc in a lock-less way yet. And we will add these
>> >> functions when we know how to do that.
>> >
>> > Well, we do already know how to do that ;-) ;-) Please have a look at
>> > wfqueue.h and wfqueue-static.h in Userspace RCU library. In this
>> > implementation too, the "blocking" busy-wait can be replaced by an
>> > active busy-looping in pretty much the same way I explained for the
>> > stack implementation.
>>
>> Find your wfqeueu implementation, that is great! I just have some
>> concern about the performance.
>
> I guess you are referring to the need to disable interrupts. As I stated
> above, we can choose to go for a cmpxchg-based push instead, which
> removes the busy-loop from dequeue altogether.
That is great!
>> It seems hard to implement the
>> dequeue_all,
>
> Actually, there is already one implemented :)
>
> See urcu-call-rcu.c: call_rcu_thread()
Have not understand the whole algorithm fully. Will continue to study
it. I found there is a synchronize_rcu() in call_crc_thread(). Must
"dequeue_all" use that too?
> Please note that this "dequeue all" implementation does not support
> concurrent "dequeue single node" operation (because we did not need it
> in the specific call_rcu() use-case), but it could possibly be adapted
> by handling the dummy node more carefully, although I'm not sure it's
> worth the effort (unless the combined use of dequeue one and dequeue all
> becomes an important use-case for a given list).
>
>> and IRQ may need to be turned off in enqueue side.
>
> Not if we use a cmpxchg-based enqueue.
>
>>
>> >> >>> * For your lock-free push/pop + wait-free pop_all implementation:
>> >> >>>
>> >> >>> llist_add -> slist_stack_push_lf (lock-free)
>> >> >>> llist_del_first -> _slist_stack_pop (needs mutex protection)
>> >> >>> llist_del_all -> slist_stack_pop_all_wf (wait-free)
>> >> >>
>> >> >> Do we really need to distinguish between lock-free and wait-free from
>> >> >> interface?
>> >> >
>> >> > I don't think it is needed, and it's probably even just more confusing
>> >> > for API users.
>> >> >
>> >> >> Will we implement both slist_stack_push_lf and
>> >> >> slist_stack_push_wf for one data structure?
>> >> >
>> >> > Those will possibly even require different data structures. It might be
>> >> > less confusing to find out clearly what our needs are, and only propose
>> >> > a single behavior, otherwise it will be confusing for users of these
>> >> > APIs without good reason.
>> >>
>> >> Yes. I agree. The semantics of my original lock-less list has already
>> >> been used by irq work and xlist of net rds. My original implementation
>> >> can be seen as generalizing the existing implementation. How about take
>> >> my original implementation as the first step, that is, generalizing,
>> >> then consider how to optimize it after that?
>> >>
>> >> But if you want to compare the two semantics above, I am OK too.
>> >
>> > In the spirit of not providing more APIs than needed, I would argue that
>> > it would be better to settle on the question of stack vs queue before
>> > this API gets into mainline.
>> >
>> > Unless we can show that you have users that need your list to behave
>> > like a stack, I would be in favor of moving it to a queue
>> > implementation and provide the semantic of "enqueue/dequeue", with list
>> > iteration order from oldest to newest node. Trying to change this
>> > afterward could bring us surprises, so it might be better to do it right
>> > now before this gets introduced.
>> >
>> > Moreover, we're not talking about that many lines of code to port from
>> > the userspace implementation to kernel-space. It just needs very
>> > thorough testing.
>>
>> The current users don't need a stack semantics. They are USING the
>> stack semantics and implementation. So I am NOT writing a new data
>> structure, I just generalize some existing data structure used by more
>> than one user into lib directory.
>
> I see. That's the part of the big picture I was missing.
>
>>
>> So I think the lock-less list data structure should be done in 2 steps.
>>
>> 1. Generalize the existing lock-less list data structure into lib.
>> Because exactly same implementation is used, there will be no big
>> issue here. And if there are some issue in 2, we can revert to here
>> easily.
>>
>> 2. Change the implementation/semantics of lock-less list data
>> structure and change all users, this can based on your "queue"
>> implementation. This involves performance compromise and semantics
>> changing, so will need more testing and more review.
>>
>> Do you agree?
>
> Yep. What I was missing is that you took existing implementations and
> merged them together and did not create your own implementation from
> scratch. Considering that there are already users of these
> implementation, it indeed makes sense to go through this intermediate
> step.
Thanks.
>> >> >> mutex is needed between multiple "_slist_stack_pop", but not needed
>> >> >> between slist_stack_push_lf and _slist_stack_pop. I think it is hard to
>> >> >> explain that clearly via function naming.
>> >> >
>> >> > Good point. A ascii-art table might be appropriate here, e.g.:
>> >> >
>> >> >
>> >> > M: Mutual exclusion needed to protect one from another.
>> >> > -: lockless.
>> >> >
>> >> > | push | pop | pop_all
>> >> > push | - | - | -
>> >> > pop | | L | L
>> >> > pop_all | | | -
>> >> >
>> >> > How about adding this (or something prettier) to the header ?
>> >>
>> >> Cool! I will add that to header.
>> >>
>> >> >>> * If we choose to go with an alternate wait-free push implementation:
>> >> >>>
>> >> >>> llist_add -> slist_stack_push_wf (wait-free)
>> >> >>> llist_del_first -> slist_stack_pop_blocking (blocking)
>> >> >>> llist_del_all -> slist_stack_pop_all_blocking (blocking)
>> >> >>
>> >> >> We need non-blocking pop, so maybe you need implement another data
>> >> >> structure which has these interface. I think there can be multiple
>> >> >> lock-less data structure in kernel.
>> >> >
>> >> > As I noted earlier, the blocking was only due to our user-level
>> >> > implementation. It can be turned in a very short-lived busy loop
>> >> > instead.
>> >> >
>> >> >>
>> >> >>>> + *
>> >> >>>> + * If there are multiple producers and multiple consumers, llist_add
>> >> >>>> + * can be used in producers and llist_del_all can be used in
>> >> >>>> + * consumers. They can work simultaneously without lock. But
>> >> >>>> + * llist_del_first can not be used here. Because llist_del_first
>> >> >>>> + * depends on list->first->next does not changed if list->first is not
>> >> >>>> + * changed during its operation, but llist_del_first, llist_add,
>> >> >>>> + * llist_add sequence in another consumer may violate that.
>> >> >>>
>> >> >>> You did not seem to define the locking rules when using both
>> >> >>>
>> >> >>> llist_del_all
>> >> >>> and
>> >> >>> llist_del_first
>> >> >>>
>> >> >>> in parallel. I expect that a mutex is needed, because a
>> >> >>>
>> >> >>> llist_del_all, llist_add, llist_add
>> >> >>>
>> >> >>> in parallel with
>> >> >>>
>> >> >>> llist_del_first
>> >> >>>
>> >> >>> could run into the same ABA problem as described above.
>> >> >>
>> >> >> OK. I will add that.
>> >> >>
>> >> >>>> + *
>> >> >>>> + * If there are multiple producers and one consumer, llist_add can be
>> >> >>>> + * used in producers and llist_del_all or llist_del_first can be used
>> >> >>>> + * in the consumer.
>> >> >>>> + *
>> >> >>>> + * The list entries deleted via llist_del_all can be traversed with
>> >> >>>> + * traversing function such as llist_for_each etc. But the list
>> >> >>>> + * entries can not be traversed safely before deleted from the list.
>> >> >>>
>> >> >>> Given that this is in fact a stack, specifying the traversal order of
>> >> >>> llist_for_each and friends would be appropriate.
>> >> >>
>> >> >> Ok. I will add something like "traversing from head to tail" in the
>> >> >> comments.
>> >> >
>> >> > Traversing from last element pushed to first element pushed would
>> >> > probably be clearer.
>> >>
>> >> head and tail describe the current state, while "last/first element
>> >> pushed" describe the history. I think both are understandable.
>> >
>> > head and tail, in a list implemented as a stack (but that is not
>> > clearly advertised as such), don't convey the meaning of "we're
>> > iterating from the newest element pushed to the oldest one". This
>> > counter-intuitiveness is part of why I would really like to see this
>> > turned into a queue.
>>
>> OK. I will change the comments, adding these semantics explanation.
>> The user should be warned :)
>
> Yes, that makes sense. After this generalization step, if you're ok with
> this, we could aim at moving the implementation from a stack to a queue
> and provide fifo semantic rather than lifo, so that other users (e.g.
> call_rcu in the kernel) can start benefiting from it.
I think that is good to move from stack to queue.
I will send out changed lock-less data structure patchset soon. And
we can continue to work on the new lock-less queue at the same time.
Best Regards,
Huang Ying
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists