[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <50B824DE.40702@jp.fujitsu.com>
Date: Fri, 30 Nov 2012 12:15:42 +0900
From: Yasuaki Ishimatsu <isimatu.yasuaki@...fujitsu.com>
To: Jiang Liu <jiang.liu@...wei.com>
CC: Mel Gorman <mgorman@...e.de>, "H. Peter Anvin" <hpa@...or.com>,
"Luck, Tony" <tony.luck@...el.com>,
Tang Chen <tangchen@...fujitsu.com>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
"rob@...dley.net" <rob@...dley.net>,
"laijs@...fujitsu.com" <laijs@...fujitsu.com>,
"wency@...fujitsu.com" <wency@...fujitsu.com>,
"linfeng@...fujitsu.com" <linfeng@...fujitsu.com>,
"yinghai@...nel.org" <yinghai@...nel.org>,
"kosaki.motohiro@...fujitsu.com" <kosaki.motohiro@...fujitsu.com>,
"minchan.kim@...il.com" <minchan.kim@...il.com>,
"rientjes@...gle.com" <rientjes@...gle.com>,
"rusty@...tcorp.com.au" <rusty@...tcorp.com.au>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>,
"linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
Len Brown <lenb@...nel.org>,
"Wang, Frank" <frank.wang@...el.com>
Subject: Re: [PATCH v2 0/5] Add movablecore_map boot option
Hi Jiang,
2012/11/30 11:56, Jiang Liu wrote:
> Hi Mel,
> Thanks for your great comments!
>
> On 2012-11-29 19:00, Mel Gorman wrote:
>> On Wed, Nov 28, 2012 at 01:38:47PM -0800, H. Peter Anvin wrote:
>>> On 11/28/2012 01:34 PM, Luck, Tony wrote:
>>>>>
>>>>> 2. use boot option
>>>>> This is our proposal. New boot option can specify memory range to use
>>>>> as movable memory.
>>>>
>>>> Isn't this just moving the work to the user? To pick good values for the
>>>> movable areas, they need to know how the memory lines up across
>>>> node boundaries ... because they need to make sure to allow some
>>>> non-movable memory allocations on each node so that the kernel can
>>>> take advantage of node locality.
>>>>
>>>> So the user would have to read at least the SRAT table, and perhaps
>>>> more, to figure out what to provide as arguments.
>>>>
>>>> Since this is going to be used on a dynamic system where nodes might
>>>> be added an removed - the right values for these arguments might
>>>> change from one boot to the next. So even if the user gets them right
>>>> on day 1, a month later when a new node has been added, or a broken
>>>> node removed the values would be stale.
>>>>
>>>
>>> I gave this feedback in person at LCE: I consider the kernel
>>> configuration option to be useless for anything other than debugging.
>>> Trying to promote it as an actual solution, to be used by end users in
>>> the field, is ridiculous at best.
>>>
>>
>> I've not been paying a whole pile of attention to this because it's not an
>> area I'm active in but I agree that configuring ZONE_MOVABLE like
>> this at boot-time is going to be problematic. As awkward as it is, it
>> would probably work out better to only boot with one node by default and
>> then hot-add the nodes at runtime using either an online sysfs file or
>> an online-reserved file that hot-adds the memory to ZONE_MOVABLE. Still
>> clumsy but better than specifying addresses on the command line.
>>
>> That said, I also find using ZONE_MOVABLE to be a problem in itself that
>> will cause problems down the road. Maybe this was discussed already but
>> just in case I'll describe the problems I see.
>>
>> If any significant percentage of memory is in ZONE_MOVABLE then the memory
>> hotplug people will have to deal with all the lowmem/highmem problems
>> that used to be faced by 32-bit x86 with PAE enabled. As a simple example,
>> metadata intensive workloads will not be able to use all of memory because
>> the kernel allocations will be confined to a subset of memory. A more
>> complex example is that page table page allocations are also restricted
>> meaning it's possible that a process will not even be able to mmap() a high
>> percentage of memory simply because it cannot allocate the page tables to
>> store the mappings. ZONE_MOVABLE works up to a *point*, but it's a hack. It
>> was a hack when it was introduced but at least then the expectation was
>> that ZONE_MOVABLE was going to be used for huge pages and there at least
>> an expectation that it would not be available for normal usage.
>>
>> Fundamentally the reason one would want to use ZONE_MOVABLE is because
>> we cannot migrate a lot of kernel memory -- slab pages, page table pages,
>> device-allocated buffers etc. My understanding is that other OS's get around
>> this by requiring that subsystems and drivers have callbacks that allow the
>> core VM to force certain memory to be released but that may be impractical
>> for Linux. I don't know for sure though, this is just what I heard.
> As I know, one other OS limits immovable pages at low end, and the limit
> will increase on demand. But the drawback of this solution is serious
> performance drop (average about 10%) because it essentially disable NUMA
> optimization for kernel/DMA memory allocations.
>
>> For Linux, the hotplug people need to start thinking about how to get
>> around this migration problem. The first problem faced is the memory model
>> and how it maps virt->phys addresses. We have a 1:1 mapping because it's
>> fast but not because it's a fundamental requirement. Start considering
>> what happens if the memory model is changed to allow some sections to have
>> fast lookup for virt_to_phys and other sections to have slow lookups. On
>> hotplug, try and empty all the sections. If the section cannot be emptied
>> because of kernel pages then the section gets marked as "offline-migrated"
>> or something. Stop the whole machine (yes, I mean stop_machine), copy
>> those unmovable pages to another location, update the kernel virt->phys
>> mapping for the section being offlined so the virt addresses point to the
>> new physical addresses and resume. Virt->phys lookups are going to be
>> a lot slower because a full section lookup will be necessary every time
>> effectively breaking SPARSE_VMEMMAP and there will be a performance penalty
>> but it should work. This will cover some slab pages where the data is only
>> accessed via the virtual address -- inode caches, dcache etc.
>>
>> It will not work where the physical address is used. The obvious example
>> is page table pages. For page tables, during stop machine you will have to
>> walk all processes page tables looking for references to the page you're
>> trying to move and update them. It is possible to just plain migrate
>> page table pages but when it was last implemented years ago there was a
>> constant performance penalty for everybody and it was not popular. Taking a
>> heavy-handed approach just during memory hot-remove might be more palatable.
>>
>> For the remaining pages such as those that have been handed to devices
>> or are pinned for DMA then your options become more limited. You may
>> still have to restrict allocating these pages (where possible) to a
>> region that cannot be hot-removed but at least this will be relatively
>> few pages.
>>
>> The big downside of this proposal is that it's unproven, not designed,
>> would be extremely intrusive and I expect it would be a *massive* amount
>> of development effort that will be difficult to get right. The upside is
>> configuring it will be a lot easier because all you'll need is a variation
>> of kernelcore= to reserve a percentage of memory for allocations we *really*
>> cannot migrate because the physical pages are owned by a device that cannot
>> release them, potentially forever. The other upside is that it does not
>> hit crazy lowmem/highmem style problems.
>>
>> ZONE_MOVABLE at least will all a node to be removed very quickly but
>> because it will paste you into a corner there should be a plan on what
>> you're going to replace it with.
>
> I have some thoughts here. The basic idea is that it needs cooperation
> between OS, BIOS and hardware to implement a flexible memory hotplug
> solution.
>
> As you have mentioned, ZONE_MOVABLE is a quick but a little dirty
> solution. It's quick because we could rely on existing mechanism
> to configure movable zone and no changes to the memory model needed.
> It's a little dirty because:
> 1) We need to handle cases of running out of immovable pages. The hotplug
> implementation shouldn't cause extra service interruption when normal zones
> are under pressure. Otherwise it's really a joke that some service
> interruptions are really caused by features trying to improve service
> availabilities.
> 2) We still can't handle normal kernel pages used by kernel, device etc.
> 3) It may cause serious performance drop if we configure all memory
> on a NUMA node as ZONE_MOVABLE.
>
> For the first issue, I think we could automatically convert pages
> from movable zones into normal zones. Congyan from Fujitsu has provided
> a patchset to manually convert pages from movable zones into normal zones,
> I think we could extend that mechanism to automatically convert when
> normal zones are under pressure by hooking into the slow page allocation
> path.
>
> We rely on hardware features to solve the second and third issues.
> Some new platforms provide a new RAS feature called "hardware memory
> migration", which transparent migrate memory from one memory device
> to another. With hardware memory migration, we could configure one
> memory device on a NUMA node to host normal zone, and the other memory
> devices to host movable zone. By this configuration, it won't cause
> performance drop because each NUMA node still has local normal zone.
> When trying to remove a memory device hosting normal zone, we just
> need to find another spare memory device and use hardware memory migration
> to transparently migrate memory content to the spare one. The drawback
> is we have strong dependency on hardware features so it's not a common
> solution for all architectures.
I agree with you. If BIOS and hardware support memory hotplug, OS should
use them. But if OS cannot use them, we need to solve in OS. I think
that our proposal which used ZONE_MOVABLE is first step for supporting
memory hotplug.
Thanks,
Yasuaki Ishimatsu
>
> Regards!
> Gerry
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists