linux-kernel - Re: [GIT PULL] scheduler fixes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20090518202031.GA26549@elte.hu>
Date:	Mon, 18 May 2009 22:20:31 +0200
From:	Ingo Molnar <mingo@...e.hu>
To:	Linus Torvalds <torvalds@...ux-foundation.org>,
	"H. Peter Anvin" <hpa@...or.com>,
	Pekka Enberg <penberg@...helsinki.fi>,
	Yinghai Lu <yinghai@...nel.org>
Cc:	Jeff Garzik <jgarzik@...ox.com>,
	Alexander Viro <viro@....linux.org.uk>,
	Rusty Russell <rusty@...tcorp.com.au>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>
Subject: Re: [GIT PULL] scheduler fixes


* Linus Torvalds <torvalds@...ux-foundation.org> wrote:

> On Mon, 18 May 2009, Ingo Molnar wrote:
> > 
> > Something like the patch below. It also fixes ->span[] which has 
> > a similar problem.
> 
> Patch looks good to me.

ok. I've queued it up for .31, with your Acked-by. (which i assume 
your reply implies?)

> > But ... i think this needs further clean-ups really. Either go 
> > fully static, or go fully dynamic.
> 
> I do agree that it would probably be good to try to avoid this 
> static allocation, and allocate these data structures dynamically. 
> However, if we end up having to use two different allocators 
> anyway (one for bootup, and one for regular uptimes), then I think 
> that would be an overall loss (compared to just the simplicity of 
> statically doing this in a couple of places), rather than an 
> overall win.
> 
> > Would be nice if bootmem_alloc() was extended with such 
> > properties - if SLAB is up (and bootmem is down) it would return 
> > kmalloc(GFP_KERNEL) memory buffers.
> 
> I would rather say the other way around: no "bootmem_alloc()" at 
> all, but just have a regular alloc() that ends up working like the 
> "SMP alternatives" code, but instead of being about SMP, it would 
> be about how early in the boot sequence it is.
> 
> That said, if there are just a couple of places like this that 
> care, I don't think it's worth it. The static allocation isn't 
> that horrible. I'd rather have a few ugly static allocations with 
> comments about _why_ they look the way they do, than try to 
> over-design things to look "clean".
> 
> Simplicity is a good thing - even if it can then end up meaning 
> special cases like this.
> 
> That said, if we could move the kmalloc initialization up some 
> more (and get at least the "boot node" data structures set up, and 
> avoid any bootmem alloc issues _entirely_, then that would be 
> good.
> 
> I hate that stupid bootmem allocator. I suspect we seriously 
> over-use it, and that we _should_ be able to do the SL*B init 
> earlier.

Hm, tempting thought - not sure how to pull it off though.

One of the biggest user of bootmem is the mem_map[] hierarchies and 
the page allocator bitmaps. Not sure we can get rid of bootmem there 
- those areas are really large, physical memory is often fragmented 
and we need a good NUMA sense for them as well.

We might also have a 22-architectures-to-fix problem as well, before 
we can get rid of bootmem:

  $ git grep alloc_bootmem arch/ | wc -l
  168

On x86 we recently switched some (but not all) early-pagetable 
allocations to the 'early brk' method (which is an utterly simple 
early linear allocator, for limited early dynamic allocations), but 
even with that we still have ugly bootmem use - for example see the 
after_bootmem hacks in arch/x86/mm/init_64.c.

So we have these increasingly more complete layers of allocators, 
which bootstrap each other gradually:

  - static, build-time allocations

  - early-brk (see extend_brk(), RESERVE_BRK and direct use of 
    _brk_end in assembly code)

  - e820 based early allocator (reserve_early()) to bootstrap bootmem

  - bootmem - to bootstrap the page allocator [NUMA aware]

  - page allocator - to bootstrap SLAB

  - SLAB

that's 5 layers until we get to SLAB. Each layer has to be aware of 
its own limits, has to interact with pagetable setup and has to end 
up with a NUMA-aware dynamic allocations as early as possible.

And all this complexity definitely _feels_ utterly wrong, as we 
really know it pretty early on what kind of memory we have, how it's 
laid out amongst nodes. In the end we really just want to have the 
page allocator and SL[AOQU]B.

Looks daunting.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/