[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4e4a5c89-e9ec-421e-8245-f076044866c9@paulmck-laptop>
Date: Mon, 26 Aug 2024 09:01:54 -0700
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Kent Overstreet <kent.overstreet@...ux.dev>
Cc: rcu@...r.kernel.org, linux-kernel@...r.kernel.org, vbabka@...e.cz,
roman.gushchin@...ux.dev, 42.hyeyoo@...il.com,
akpm@...ux-foundation.org, cl@...ux.com, mhocko@...nel.org,
urezki@...il.com, neeraj.upadhyay@...nel.org
Subject: Re: [PATCH 6/9] rcu: rcu_pending
On Mon, Aug 26, 2024 at 11:17:29AM -0400, Kent Overstreet wrote:
> On Mon, Aug 26, 2024 at 07:44:12AM GMT, Paul E. McKenney wrote:
> > I have gotten requests for a variant of SRCU that dispenses with the
> > read-side memory barriers, which would allow you to dispense with RCU.
> > Maybe an srcu_read_lock_lite() and srcu_read_unlock_lite(), similar to
> > the _nmisafe() variants. There is always a price, and in this case the
> > price would be that srcu_read_lock_lite() and srcu_read_unlock_lite()
> > be restricted to code regions where RCU is watching. But from what I
> > know, fs code must already abide by that restriction.
> >
> > Would that help?
>
> I don't think that helps here, but that is something I'd be interested
> in.
>
> What I would like for this is:
> - a single #define for RCU_NR_OLDSTATES for both RCU and SRCU
>
> - a way of getting the current RCU gp sequence number without a
> function call, since we hit that on every single enqueue(). One of my
> development versions added a function to RCU and SRCU for getting a
> pointer to the internal sequence number, which we'd call at init
> time; this lets us skip the function call and a branch.
>
> Another thing that might make the code a bit easier to reason about is
> if rcu_read_lock() also worked as an srcu_read_lock() - is something the
> SRCU variant you're talking about previously would provide?
Yes, srcu_read_lock_lite() would return an value that you later pass
to the corresponding srcu_read_unlock_lite(), just as srcu_read_lock()
and srcu_read_unlock() do now.
And they would have the same grace-period algorithm, and thus the same
value for NUM_ACTIVE_SRCU_POLL_OLDSTATE, as requested above.
But get_state_synchronize_srcu() is still a function call. But the
offer of a function that checks multiple grace-period-state cookies in
one call still stands.
> In my current version of the code, we call __get_state_synchronize_rcu()
> after we've disabled interrupts; we know the rcu gp seq isn't going to
> tick while we're in the critical path. But this doesn't apply if it's
> for SRCU, and I don't want to add if (src) srcu_read_lock() branches to
> it.
Actually, disabling interrupts does *not* prevent RCU's grace-period
sequence number from changing. For example, the following really can
happen:
o RCU notices that the current grace period can end.
o A CPU disables interrupts here.
o A call to get_state_synchronize_rcu() returns a cookie
corresponding to the end of the grace period following the
current one.
o RCU ends the current grace period, therefore updating the
grace-period sequence number.
o RCU starts a new grace period, therefore updating the
grace-period sequence number once again.
o RCU cannot complete this new grace period until that CPU
re-enables interrupts, but it has already updated its grace-period
sequence number not once, but twice.
So if your code knows that RCU's grace-period sequence number cannot
change while a given CPU has interrupts disabled, that code has a bug.
A low-probability bug, perhaps, but if your code is running on enough
systems, it will make its presence known.
> Not at all essential, the races that result from this are harmless, but
> if we e.g. decide it's worthwhile to only kick off a gp if it hasn't
> ticked (i.e. only kick rcu if we're still on seq of the object being
> enqueued) it'd be nice.
Why not call get_state_synchronize_rcu(), and ask for a new grace period
if the value returned is different than that from the previous call?
> > > On the enqueue side, which is the fastpath, this uses a cursor - we're
> > > not walking the radix tree every time.
> > >
> > > On the processing side, where we're kfree_bulk()ing entire radix tree
> > > nodes, the radix tree is going to have better cache locality than a list
> > > of pages.
> >
> > Interesting. What testing or analysis did you do in the course of
> > arriving at this conclusion? In the meantime I will assume that the
> > analysis involves your radix-tree nodes being more than one page in size.
>
> No, the radix tree nodes are 512 bytes; that's been sufficient in my
> testing.
>
> (also, please don't refer to PAGE_SIZE outside of mm/ code without a
> _good_ reason; that's something we've been trying to clean up.)
Moving kfree_rcu() into mm/ will fix that problem. Plus we did discuss
this with the mm folks back in the day, especially about the GFP flags,
so apparently there was a sufficiently good reason. Maybe still is.
> I'll try to post some performance numbers when I have some time.
I look forward to seeing what you come up with.
> > It might or might not be reasonable, depending on what is going on in the
> > rest of the system. The current kfree_rcu() code can run the system out
> > of full pages, but it won't impede other code allocating smaller blocks
> > of memory. We could easily change it to allocate individual rcu_head
> > structures, but doing so would not necessarily be a win in OOM conditions,
> > again, depending on what else is going on.
>
> As long as the thread calling kvfree_rcu_mightsleep() can block without
> blocking memory reclaim it'll be safe. We might want to tweak the GFP
> flags so to tell the allocator "don't try too hard, bail out so we can
> check if the gp has expired".
There can be quite a difference between "safe" and "does well in actual
workloads".
> > I took a quick look at __rcu_pending_enqueue() and my eyes are still
> > bleeding. Concatenating linked list of pages is way simpler, way faster,
> > and way more robust.
>
> Funny, I had the same thoughts trying to read your code... :)
Amazing how much easier it is to generate new code than to understand
existing code, isn't it? ;-)
> But, most of __rcu_pending_enqueue() is slowpaths; the fastpath is just
>
> objs = get_object_radix(p, seq); /* lookup seq in p->objs */
>
> *objs->cursor++ = ptr ?: head;
> /* zero cursor if we hit the end of a radix tree node: */
> if (!(((ulong) objs->cursor) & (GENRADIX_NODE_SIZE - 1)))
> objs->cursor = NULL;
> start_gp = !objs->nr;
> objs->nr++;
>
> So I think the performance concerns are moot, and for robustness -
> memory allocation failure always turns into "use the linked lists of
> objects", which works similarly to the old code.
Bugs live on slowpaths as well as on fastpaths. In fact, they tend to
accumulate there.
> > > But yes, using it for call_rcu() would need more performance
> > > justification. I haven't seen workloads where call_rcu() performance
> > > particularly matters, so it's not something I'm pushing for - I included
> > > that because it's minimal code and other people might know of workloads
> > > where we do want it.
> >
> > Plus, unlike kfree_rcu() post-grace-period handling, call_rcu() callback
> > functions usually access the memory block passed to them, which means
> > that they are incurring that per-element cache miss in any case.
>
> True. But this would allow us to prefetch those objects (several
> iterations in advance).
I need to see a CPU on which this actually make a significant difference
before adding this sort of complexity.
> > > > > Ideally we would be getting a callback every time a grace period
> > > > > completes for which we have objects, but that would require multiple
> > > > > rcu_heads in flight, and since the number of gp sequence numbers with
> > > > > uncompleted callbacks is not bounded, we can't do that yet.
> > > >
> > > > Doesn't the call from __rcu_pending_enqueue() to process_finished_items()
> > > > serve this purpose? After all, the case that causes trouble is the one
> > > > where lots of things are being very frequently queued.
> > >
> > > No, because that's unpredictable, and we don't process pending items in
> > > enqueue() unless we know we can sleep (i.e., we don't do it if the
> > > caller is latency sensitive).
> >
> > It is possible to do this in an extremely lightweight fashion in the
> > common case.
>
> Just processing a few items? hmm, would we want to though, when
> otherwise we'd be calling kfree_bulk()? I think icache locality means
> we'd generally prefer not to.
You might not want to yet, but you eventually would want this.
Thanx, Paul
Powered by blists - more mailing lists