[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Pine.LNX.4.64.0609010928010.25057@skynet.skynet.ie>
Date: Fri, 1 Sep 2006 09:33:19 +0100 (IST)
From: Mel Gorman <mel@....ul.ie>
To: Keith Mannthey <kmannth@...il.com>
Cc: akpm@...l.org, tony.luck@...el.com,
Linux Memory Management List <linux-mm@...ck.org>,
ak@...e.de, bob.picco@...com,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
linuxppc-dev@...abs.org
Subject: Re: [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes
On Thu, 31 Aug 2006, Keith Mannthey wrote:
> On 8/31/06, Mel Gorman <mel@....ul.ie> wrote:
>> On Thu, 31 Aug 2006, Keith Mannthey wrote:
>> > On 8/31/06, Mel Gorman <mel@...net.ie> wrote:
>> >> On (30/08/06 13:57), Keith Mannthey didst pronounce:
>> >> > On 8/21/06, Mel Gorman <mel@....ul.ie> wrote:
>> >> > >
>
>> Can you confirm that happens by applying the patch I sent to you and
>> checking the output? When the reserve fails, it should print out what
>> range it actually checked. I want to be sure it's not checking the
>> addresses 0->0x1070000000
>
> See below
>
Perfect, thanks a lot. I should have enough to reproduce without a test
machine what is going on and develop the required patches.
>> >> > >@@ -329,6 +330,8 @@ acpi_numa_memory_affinity_init(struct ac
>> >> > >
>> >> > > printk(KERN_INFO "SRAT: Node %u PXM %u %Lx-%Lx\n", node, pxm,
>> >> > > nd->start, nd->end);
>> >> > >+ e820_register_active_regions(node, nd->start >> PAGE_SHIFT,
>> >> > >+ nd->end >>
>> PAGE_SHIFT);
>> >> >
>> >> > A node chunk in this section of code may be a hot-pluggable zone. With
>> >> > MEMORY_HOTPLUG_SPARSE we don't want to register these regions.
>> >> >
>> >>
>> >> The ranges should not get registered as active memory by
>> >> e820_register_active_regions() unless they are marked E820_RAM. My
>> >> understanding is that the regions for hotadd would be marked "reserved"
>> >> in the e820 map. Is that wrong?
>> >
>> > This is wrong. In a mult-node system that last node add area will not
>> > be marked reserved by the e820. The e820 only defines memory <
>> > end_pfn. the last node add area is > end_pfn.
>> >
>>
>> ok, that should still be fine. As long as the ranges are not marked
>> "usable", add_active_range() will not be called and the holes should be
>> counted correctly with the patch I sent you.
>>
>> > With RESERVE based add-memory you want the add-areas repored by the
>> > srat to be setup during boot like all the other pages.
>> >
>>
>> So, do you actally expect a lot of unused mem_map to be allocated with
>> struct pages that are inactive until memory is hot-added in an
>> x86_64-specific manner? The arch-independent stuff currently will not do
>> that. It sets up memmap for where memory really exists. If that is not
>> what you expect, it will hit issues at hotadd time which is not the
>> current issue but one that can be fixed.
>
> Yes. RESERVED based is a big waste of mem_map space.
Right, it's all very clear now. At some point in the future, I'd like to
visit why SPARSEMEM-based hot-add is not always used but it's a separate
issue.
> The add areas
> are marked as RESERVED during boot and then later onlined during add.
That explains the reserve_bootmem_node()
> It might be ok. I will play with tomorrow. I might just need to
> call add_active_range in the right spot :)
>
I'll play with this from the opposite perspective - what is required for
any arch using this API to have spare memmap for reserve-based hot-add.
>> >> > > if (ma->flags.hot_pluggable && !reserve_hotadd(node, start,
>> end)
>> >> <
>> >> > > 0) {
>> >> > > /* Ignore hotadd region. Undo damage */
>> >> >
>> >> > I have but the e820_register_active_regions as a else to this
>> >> > statment the absent pages check fails.
>> >> >
>> >>
>> >> The patch below omits this change because I think
>> >> e820_register_active_regions() will still have got called by the time
>> >> you encounter a hotplug area.
>> >
>> > called but then removed in setup arch.
>>
>> By "removed", I assume you mean the active regions removed by the call
>> to remove_all_active_regions() in setup_arch(). Before reserve_hotadd() is
>> called, e820_register_active_regions() will have reregistered the active
>> regions with the NUMA node id.
>
> I see e820_register_active_regions is acting as a filter against the e820
>
Yes. A range of pfn's in given to e820_register_active_regions() and it
reads the e820 for E820_RAM sections within that range.
>> >> > Also nodes_cover_memory and alot of these check were based against
>> >> > comparing the srat data against the e820. Now all this code is
>> >> > comparing SRAT against SRAT....
>> >> >
>> >>
>> >> I don't see why. The SRAT table passes a range to
>> >> e820_register_active_regions() so should be comparing SRAT to e820
>> >
>> > let me go off and look at e820_register_active_regions() some more.
> Things get clear :)
>
> Should be ok.
>
nice.
>> > Sure thing. It is just the hot-add area I am guessing it is an off by
>> > one error of some sort.
>> >
> See below. I do my e820_register_active_area as an else to to if
> (hotplug.....!reserve) and the prink is easy to sort out.
>
yep, it should be easy to put this into a test program.
> I see your pfn are in base 10. Looks like it considers the last
> addres to be a present page. (off by one thing).
>
Probably. I'll start poking now. Thanks a lot.
> Thanks,
> Keith
>
> Output below
> disabling early console
> Linux version 2.6.18-rc4-mm3-smp (root@...3a153) (gcc version 4.1.0
> (SUSE Linux)) #6 SMP Thu Aug 31 22:06:00 EDT 2006
> Command line: root=/dev/sda3
> ip=9.47.66.153:9.47.66.169:9.47.66.1:255.255.255.0 resume=/dev/sda2
> showopts earlyprintk=ttyS0,115200 console=ttyS0,115200 console=tty0
> debug numa=hotadd=100
> BIOS-provided physical RAM map:
> BIOS-e820: 0000000000000000 - 0000000000098400 (usable)
> BIOS-e820: 0000000000098400 - 00000000000a0000 (reserved)
> BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
> BIOS-e820: 0000000000100000 - 000000007ff85e00 (usable)
> BIOS-e820: 000000007ff85e00 - 000000007ff98880 (ACPI data)
> BIOS-e820: 000000007ff98880 - 0000000080000000 (reserved)
> BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
> BIOS-e820: 0000000100000000 - 0000000470000000 (usable)
> BIOS-e820: 0000001070000000 - 0000001160000000 (usable)
> Entering add_active_range(0, 0, 152) 0 entries of 3200 used
> Entering add_active_range(0, 256, 524165) 1 entries of 3200 used
> Entering add_active_range(0, 1048576, 4653056) 2 entries of 3200 used
> Entering add_active_range(0, 17235968, 18219008) 3 entries of 3200 used
> end_pfn_map = 18219008
> DMI 2.3 present.
> ACPI: RSDP (v000 IBM ) @ 0x00000000000fdcf0
> ACPI: RSDT (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @
> 0x000000007ff98800
> ACPI: FADT (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @
> 0x000000007ff98780
> ACPI: MADT (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @
> 0x000000007ff98600
> ACPI: SRAT (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @
> 0x000000007ff983c0
> ACPI: HPET (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @
> 0x000000007ff98380
> ACPI: SSDT (v001 IBM VIGSSDT0 0x00001000 INTL 0x20030122) @
> 0x000000007ff90780
> ACPI: SSDT (v001 IBM VIGSSDT1 0x00001000 INTL 0x20030122) @
> 0x000000007ff88bc0
> ACPI: DSDT (v001 IBM EXA01ZEU 0x00001000 INTL 0x20030122) @
> 0x0000000000000000
> SRAT: PXM 0 -> APIC 0 -> Node 0
> SRAT: PXM 0 -> APIC 1 -> Node 0
> SRAT: PXM 0 -> APIC 2 -> Node 0
> SRAT: PXM 0 -> APIC 3 -> Node 0
> SRAT: PXM 0 -> APIC 38 -> Node 0
> SRAT: PXM 0 -> APIC 39 -> Node 0
> SRAT: PXM 0 -> APIC 36 -> Node 0
> SRAT: PXM 0 -> APIC 37 -> Node 0
> SRAT: PXM 1 -> APIC 64 -> Node 1
> SRAT: PXM 1 -> APIC 65 -> Node 1
> SRAT: PXM 1 -> APIC 66 -> Node 1
> SRAT: PXM 1 -> APIC 67 -> Node 1
> SRAT: PXM 1 -> APIC 102 -> Node 1
> SRAT: PXM 1 -> APIC 103 -> Node 1
> SRAT: PXM 1 -> APIC 100 -> Node 1
> SRAT: PXM 1 -> APIC 101 -> Node 1
> SRAT: Node 0 PXM 0 0-80000000
> Entering add_active_range(0, 0, 152) 0 entries of 3200 used
> Entering add_active_range(0, 256, 524165) 1 entries of 3200 used
> SRAT: Node 0 PXM 0 0-470000000
> Entering add_active_range(0, 0, 152) 2 entries of 3200 used
> Entering add_active_range(0, 256, 524165) 2 entries of 3200 used
> Entering add_active_range(0, 1048576, 4653056) 2 entries of 3200 used
> SRAT: Node 0 PXM 0 0-1070000000
> reserve_hotadd called with node 0 sart 470000000 end 1070000000
> SRAT: Hotplug area has existing memory
> Entering add_active_range(0, 0, 152) 3 entries of 3200 used
> Entering add_active_range(0, 256, 524165) 3 entries of 3200 used
> Entering add_active_range(0, 1048576, 4653056) 3 entries of 3200 used
> SRAT: Node 1 PXM 1 1070000000-1160000000
> Entering add_active_range(1, 17235968, 18219008) 3 entries of 3200 used
> SRAT: Node 1 PXM 1 1070000000-3200000000
> reserve_hotadd called with node 1 sart 1160000000 end 3200000000
> SRAT: Hotplug area has existing memory
> Entering add_active_range(1, 17235968, 18219008) 4 entries of 3200 used
> NUMA: Using 28 for the hash shift.
> Bootmem setup node 0 0000000000000000-0000001070000000
> Bootmem setup node 1 0000001070000000-0000001160000000
> Zone PFN ranges:
> DMA 0 -> 4096
> DMA32 4096 -> 1048576
> Normal 1048576 -> 18219008
> early_node_map[4] active PFN ranges
> 0: 0 -> 152
> 0: 256 -> 524165
> 0: 1048576 -> 4653056
> 1: 17235968 -> 18219008
> On node 0 totalpages: 4128541
> 0 pages used for SPARSE memmap
> 1149 pages DMA reserved
> DMA zone: 2843 pages, LIFO batch:0
> 0 pages used for SPARSE memmap
> DMA32 zone: 520069 pages, LIFO batch:31
> 0 pages used for SPARSE memmap
> Normal zone: 3604480 pages, LIFO batch:31
> On node 1 totalpages: 983040
> 0 pages used for SPARSE memmap
> 0 pages used for SPARSE memmap
> 0 pages used for SPARSE memmap
> Normal zone: 983040 pages, LIFO batch:31
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists