linux-kernel - Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20070608212054.GB641@linux-os.sc.intel.com>
Date:	Fri, 8 Jun 2007 14:20:54 -0700
From:	"Keshavamurthy, Anil S" <anil.s.keshavamurthy@...el.com>
To:	Andreas Kleen <ak@...e.de>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	"Keshavamurthy, Anil S" <anil.s.keshavamurthy@...el.com>,
	linux-kernel@...r.kernel.org, gregkh@...e.de, muli@...ibm.com,
	asit.k.mallick@...el.com, suresh.b.siddha@...el.com,
	arjan@...ux.intel.com, ashok.raj@...el.com, shaohua.li@...el.com,
	davem@...emloft.net
Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Fri, Jun 08, 2007 at 10:43:10PM +0200, Andreas Kleen wrote:
> Am Fr 08.06.2007 21:01 schrieb Andrew Morton
> <akpm@...ux-foundation.org>:
> 
> > On Fri, 8 Jun 2007 11:21:57 -0700
> > "Keshavamurthy, Anil S" <anil.s.keshavamurthy@...el.com> wrote:
> >
> > > On Thu, Jun 07, 2007 at 04:27:26PM -0700, Andrew Morton wrote:
> > > > On Wed, 06 Jun 2007 11:57:00 -0700
> > > > anil.s.keshavamurthy@...el.com wrote:
> > > >
> > > > > Signed-off-by: Anil S Keshavamurthy
> > > > > <anil.s.keshavamurthy@...el.com>
> > > >
> > > > That was a terse changelog.
> > > >
> > > > Obvious question: how does this differ from mempools, and would it
> > > > be
> > > > better to fill in any gaps in mempool functionality instead of
> > > > implementing something similar-looking?
> > >
> > > Very good question. Mempool pre-allocates the elements
> > > to the required minimum count size during its initilization time.
> > > However when mempool_alloc() is called it tries to obtain the
> > > element from OS and if that fails then it looks for the element in
> > > its pool. If there are no elements in its pool and if the gpf_t
> > > flags says it can wait then it waits untill someone puts the element
> > > back to pool, else if gpf_t flag say it can;t wait then it returns
> > > NULL.
> > > In other words, mempool acts as *emergency* pool, i.e only if the OS
> > > fails
> > > to allocate the required memory, then the pool object is used.
> > >
> > >
> > > In the IOMMU case, we need exactly opposite of what mempool
> > > provides,
> > > i.e we always want to look for the element in the pool and if the
> > > pool
> > > has no element then go to OS as a worst case. This resource pool
> > > library routines do the same. Again, this resource pools
> > > grows and shrinks automatically to maintain the minimum pool
> > > elements in the background. I am not sure whether this totally
> > > opposite functionality of mempools and resource pools can be
> > > merged.
> >
> > Confused.
> >
> > If resource pools are not designed to provide extra robustness via an
> > emergency pool, then what _are_ they designed for? (Boy this is a hard
> > way
> > to write a changelog!)
> 
> mempools are designed to manage a limited resource pool by sleeping
> if necessary until someone else frees a resource. It's basically similar
> how to main VM works with a sleeping allocation, just in a "private user
> group"
> 
> In the IOMMU case sleeping is not allowed because pci_map_* typically
> happens inside spinlocks.  But the IOMMU code might need to allocate
> new page tables and other datastructures in there.
> 
> This means mempools don't work for those (the previous version had non
> sensical
> constructs like GFP_ATOMIC mempool calls)
> 
>  I haven't looked at Anil's code, but I suspect the only really robust
> way to handle this case is to always preallocate everything. But I'm not
> sure
> why that would need new library functions; it should be just some simple
> lists that could be open coded.

Since it is practically impossible to predicit how much to preallocate,
we have a min_count+grow_count of object allocated and we always use from this
pool. If the object count goes below certain low threshold(which acts as 
emergency pool from this point), the pool grows by allocating and 
adding the newly allocated object into the pool in the
worker (keventd) thread. 
Again, once the IO pressure is over, the PCI driver
does the unmap calls and we put back the objects back to preallocate pools.
The smartness is builtin to the pool as the elements are put back to the 
pool it detects that the pool count is greater then the threshold and 
it automagically queues the work to free the objects and bring back the 
pre-allocated object count back to minimum threshold.
Thus this preallocated pool grow and shrinks based on the demand, while
acting as both pre-allocated pools and as emergency pool.

Currently I have made this as a libray functions, if that is not correct,
we can pull this and make it part of the Intel IOMMU driver itself.
Please do let me know your suggestions.

> 
> If it needs to fall back to the OS for any non pre allocation then it
> will
> likely be flakey under high load. Now that might be ok in some cases
> -- apparently block layer is much better at handling this than it used
> to be and networking has to handle it anyways, but it might be still
> a unpleasant surprise for many drivers. One generic problem is that
> there are no upcalls when such resources become avaialable again
> so the upper layers would need to poll to know when to resubmit
> a request.
> 
> It's a pretty messy problem unfortunately.
I agree, worst case if the element is not available in the 
pool we need to fall back to the OS and if OS fails then it
is tough luck.

> 
> One relatively easy way out would be to just preallocate
> a static aperture fully and always map into it. Not sure
> how much memory that would need -- when it's too large
> it might take a lot of memory for page tables always and when it's
> too small it might overflow under high load.
Yup, and since we need to use the same driver from 
desktop to servers, preallocation count for servers 
many not be suitable for desktops.

> 
> > That's what kmem_cache_alloc() is for?!?!
> 
> Tradtionally that was not allowed in block layer path. Not sure
> it is fully obsolete with the recent dirty tracking work, probably not.
> 
> Besides it would need to be GFP_ATOMIC and the default
> atomic pools are not that big.
> 
> -Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/