[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <50F01EC2.1000908@redhat.com>
Date: Fri, 11 Jan 2013 09:16:34 -0500
From: Prarit Bhargava <prarit@...hat.com>
To: Rusty Russell <rusty@...tcorp.com.au>
CC: linux-kernel@...r.kernel.org, Mike Galbraith <efault@....de>,
Josh Triplett <josh@...htriplett.org>,
Tim Abbott <tabbott@...lice.com>
Subject: Re: [PATCH] module, fix percpu reserved memory exhaustion
On 01/10/2013 10:48 PM, Rusty Russell wrote:
> Prarit Bhargava <prarit@...hat.com> writes:
>> [ 15.478160] kvm: Could not allocate 304 bytes percpu data
>> [ 15.478174] PERCPU: allocation failed, size=304 align=32, alloc
>> from reserved chunk failed
> ...
>> What is happening is systemd is loading an instance of the kvm module for
>> each cpu found (see commit e9bda3b). When the module load occurs the kernel
>> currently allocates the modules percpu data area prior to checking to see
>> if the module is already loaded or is in the process of being loaded. If
>> the module is already loaded, or finishes load, the module loading code
>> releases the current instance's module's percpu data.
>
> Wow, what a cool bug! Classic unforseen side-effect.
>
> I'd prefer not to do relocations with the module_lock held: it can be
> relatively slow. Yet we can't do relocations before the per-cpu
> allocation, obviously. Did you do boot timings before and after?
Heh ... I did! :) I had a lot of concerns about moving the mutex around so I
put in print at the end of boot to see how long the boot time actually was.
>From stock kernel:
[ 22.893015] PRARIT: FINAL BOOT MESSAGE
>From stock kernel + my patch:
[ 22.673214] PRARIT: FINAL BOOT MESSAGE
Both kernel boots showed the problem with kvm loading. A quick grep through my
bootlogs of stock kernel + my patch don't show anything greater than 23.539392
and less than 20.980321. Those numbers are similar to the numbers from the
stock kernel (23.569450 - 20.898321).
ie) I don't think there's an increase due to calling the relocation under the
module mutex, and if there is it is definitely lost within the noise of boot.
The timing were similar. I didn't see any huge delays, etc. Can the
relocations really cause a long delay? I thought we were pretty much writing
values to memory...
[I should point out that I'm booting a 32 physical/64 logical, with 64GB of memory]
>
> An alternative would be to put the module into the list even earlier
> (say, just after layout_and_allocate) so we could block on concurrent
> loads at that point. But then we have to make sure noone looks in the
> module too early before it's completely set up, and that's complicated
> and error-prone too. A separate list is kind of icky.
Yeah -- that was my first attempt actually, and it got very complex very
quickly. I abandoned that approach in favor of moving the percpu allocations
under the lock. I thought that was likely the easiest approach.
>
> We currently have PERCPU_MODULE_RESERVE set at 8k: in my 32-bit
> allmodconfig build, there are only three modules with per-cpu data,
> totalling 328 bytes. So it's not reasonable to increase that number to
> paper over this.
I've been thinking about that. The problem is that at the same time the kvm
problem occurs I'm attempting to load a debug module that I've written to debug
some cpu timer issues that allocates a large amount of percpu data (~.5K/cpu).
While extending PERCPU_MODULE_RESERVE to 10k might work now, it might not work
tomorrow if I have the need to increase the size of my log buffer.
... that is ;), I prefer your and my approach of fixing this problem.
>
> This is what a new boot state looks like (pains not to break ksplice).
> It's two patches, but I'll just post them back to back:
>
> module: add new state MODULE_STATE_UNFORMED
>
> You should never look at such a module, so it's excised from all paths
> which traverse the modules list.
>
> We add the state at the end, to avoid gratuitous ABI break (ksplice).
>
> Signed-off-by: Rusty Russell <rusty@...tcorp.com.au>
>
<snip patch>
Sure, but I'm always nervous about expanding any state machine ;). That's just
me though :).
>
> module: put modules in list much earlier.
>
> Prarit's excellent bug report:
>> In recent Fedora releases (F17 & F18) some users have reported seeing
>> messages similar to
>>
>> [ 15.478160] kvm: Could not allocate 304 bytes percpu data
>> [ 15.478174] PERCPU: allocation failed, size=304 align=32, alloc from
>> reserved chunk failed
>>
>> during system boot. In some cases, users have also reported seeing this
>> message along with a failed load of other modules.
>>
>> What is happening is systemd is loading an instance of the kvm module for
>> each cpu found (see commit e9bda3b). When the module load occurs the kernel
>> currently allocates the modules percpu data area prior to checking to see
>> if the module is already loaded or is in the process of being loaded. If
>> the module is already loaded, or finishes load, the module loading code
>> releases the current instance's module's percpu data.
>
> Now we have a new state MODULE_STATE_UNFORMED, we can insert the
> module into the list (and thus guarantee its uniqueness) before we
> allocate the per-cpu region.
>
> Reported-by: Prarit Bhargava <prarit@...hat.com>
> Signed-off-by: Rusty Russell <rusty@...tcorp.com.au>
>
<snip patch>
Tested-by: Prarit Bhargava <prarit@...hat.com>
Rusty, you can change that to an Acked-by if you prefer that. I know some
engineers prefer one over the other. I'll also continue doing some reboot
testing and will email back in a few days to let you know what the timing looks
like.
Thanks!,
P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists