linux-kernel - Re: [PATCH] Percpu tag allocator

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20130612200311.7f9d938a.akpm@linux-foundation.org>
Date:	Wed, 12 Jun 2013 20:03:11 -0700
From:	Andrew Morton <akpm@...ux-foundation.org>
To:	Kent Overstreet <koverstreet@...gle.com>
Cc:	linux-kernel@...r.kernel.org, Tejun Heo <tj@...nel.org>,
	Oleg Nesterov <oleg@...hat.com>,
	Christoph Lameter <cl@...ux-foundation.org>,
	Ingo Molnar <mingo@...hat.com>,
	Andi Kleen <andi@...stfloor.org>, Jens Axboe <axboe@...nel.dk>,
	"Nicholas A. Bellinger" <nab@...ux-iscsi.org>
Subject: Re: [PATCH] Percpu tag allocator

On Wed, 12 Jun 2013 19:05:36 -0700 Kent Overstreet <koverstreet@...gle.com> wrote:

> ...
>
> > Why can't we use ida_get_new_above()?
> > 
> >    If it is "speed" then how bad is the problem and what efforts have
> >    been made to address them within the idr code?  (A per-cpu magazine
> >    frontend to ida_get_new_above() might suit).
> > 
> >    If it is "functionality" then what efforts have been made to
> >    suitably modify the ida code?
> 
> Originally, it was because every time I opened idr.[ch] I was confronted
> with an enormous pile of little functions, with very little indication
> in the way of what they were all trying to do or which ones I might want
> to start with.
> 
> Having finally read enough of the code to maybe get a handle on what
> it's about - performance is a large part of it, but idr is also a more
> complicated api that does more than what I wanted.

They all sound like pretty crappy reasons ;) If the idr/ida interface
is nasty then it can be wrapped to provide the same interface as the
percpu tag allocator.

I could understand performance being an issue, but diligence demands
that we test that, or at least provide a convincing argument.

>
> ...
>
> > > +		remote = per_cpu_ptr(pool->tag_cpu, cpu);
> > > +
> > > +		if (remote == tags)
> > > +			continue;
> > > +
> > > +		clear_bit(cpu, pool->cpus_have_tags);
> > > +
> > > +		nr_free = xchg(&remote->nr_free, TAG_CPU_STEALING);
> > 
> > (looks to see what TAG_CPU_STEALING does)
> > 
> > OK, this looks pretty lame.  It adds a rare fallback to the central
> > allocator, which makes that path hard to test.  And it does this at
> > basically the same cost of plain old spin_lock().  I do think it would
> > be better to avoid the underministic code and use plain old
> > spin_lock().  I appreciate the lock ranking concern, but you're a
> > cleaver chap ;)
> 
> Yeah, the cmpxchg() stuff turned out trickier than I'd hoped - it's
> actually the barriers (guarding against a race with percpu_tag_free())
> that concern me more than that fallback.
> 
> I did torture test this code quite a bit and I'm not terribly eager to
> futz with it more, but I may try switching to spinlocks for the percpu
> freelists to see how it works out - I had the idea that I might be able
> to avoid some writes to shared cachelines if I can simplify that stuff,
> which would probably make it worth it.

The nice thing about a lock per cpu is that the stealing CPU can grab
it and then steal a bunch of tags without releasing the lock: less
lock-taking.  Maybe the current code does that amortisation as well;
my intent-reverse-engineering resources got depleted.

Also, with a lock-per-cpu the stolen-from CPU just spins, so the ugly
TAG_CPU_STEALING fallback-to-global-allocator thing goes away.

> > Also, I wonder if this was all done in the incorrect order.  Why make
> > alloc_local_tag() fail?  steal_tags() could have just noodled off and
> > tried to steal from the next CPU if alloc_local_tag() was in progress
> > on this CPU?
> 
> steal_tags() can't notice that alloc_local_tag() is in progress,

Yes it can - the stolen-from CPU sets a flag in its cpu-local object
while in its critical section.  The stealing CPU sees that and skips.

I think all this could be done with test_and_set_bit() and friends,
btw.  xchg() hurts my brain.

> alloc_local_tag() just uses cmpxchg to update nr_free - there's no
> sentinal value or anything.
> 
> OTOH, if alloc_local_tag() sees TAG_CPU_STEALING - the next thing that's
> going to happen is steal_tags() is going to set nr_free to 0, and the
> only way tags are going to end up back on our cpu's freelist is if we
> fail and do something else - we've got irqs disabled so
> percpu_tag_free() can't do it.
> 
> ...
>
> > What actions can a driver take if an irq-time tag allocation attempt
> > fails?  How can the driver author test those actions?
> 
> You mean if a GFP_NOWAIT allocation fails? It's the same as any other
> allocation, I suppose.
> 
> Did you have something else in mind that could be implemented? I don't
> want to add code for a reserve to this code - IMO, if a reserve is
> needed it should be done elsewhere (like how mempools work on top of
> existing allocators).

Dunno, really - I'm just wondering what the implications of an
allocation failure will be.  I suspect it's EIO?  Which in some
circumstances could be as serious as total system failure (loss of
data), so reliability/robustness is a pretty big deal.


Another thing: CPU hot-unplug.  If it's "don't care" then the
forthcoming changelog should thoroughly explain why, please.  Otherwise
it will need a notifier to spill back any reservation.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/