lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 7 Jul 2015 16:57:03 +0800
From:	Tang Chen <tangchen@...fujitsu.com>
To:	Yasuaki Ishimatsu <yasu.isimatu@...il.com>
CC:	Xishi Qiu <qiuxishi@...wei.com>, <tglx@...utronix.de>,
	<mingo@...hat.com>, <hpa@...or.com>, <akpm@...ux-foundation.org>,
	<tj@...nel.org>, <dyoung@...hat.com>,
	<isimatu.yasuaki@...fujitsu.com>, <lcapitulino@...hat.com>,
	<will.deacon@....com>, <tony.luck@...el.com>,
	<vladimir.murzin@....com>, <fabf@...net.be>,
	<kuleshovmail@...il.com>, <bhe@...hat.com>, <x86@...nel.org>,
	<linux-kernel@...r.kernel.org>, <linux-mm@...ck.org>
Subject: Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.


On 07/07/2015 12:42 AM, Yasuaki Ishimatsu wrote:
> On Fri, 3 Jul 2015 09:26:05 +0800
> Tang Chen <tangchen@...fujitsu.com> wrote:
>
>> On 07/02/2015 11:02 PM, Yasuaki Ishimatsu wrote:
>>> Hi Tang,
>>>
>>>> On my box, if I run lscpu, the output looks like this:
>>>>
>>>> NUMA node0 CPU(s):     0-14,128-142
>>>> NUMA node1 CPU(s):     15-29,143-157
>>>> NUMA node2 CPU(s):
>>>> NUMA node3 CPU(s):
>>>> NUMA node4 CPU(s):     62-76,190-204
>>>> NUMA node5 CPU(s):     78-92,206-220
>>>>
>>>> Node 2 and 3 are not exist, but they are online.
>>> According your description of patch, node 4 and 5 are mistakenly
>> Not node 4 and 5, it is node 2 and 3 which are mistakenly set online.
> Please add the results of lscpu before/after applyinig the patch into
> description of your patch.
>
> Feel free to add my
> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@...fujitsu.com>

Thanks for reviewing. Will update the patch soon.

Thanks.

>
> Thanks,
> Yasuaki Ishimatsu
>
>>> set to online. Why does lscpu show the above result?
>> Well, actually not only lscpu gives the strange result, under
>> /sys/device/system/node,
>> interfaces for node 2 and 3 are also created.
>>
>> I haven't read lscpu code, so I'm not sure how lscpu handles nodes. But
>> obviously,
>> node 2 and 3 are set online, which is incorrect.
>>
>> For now, I only found that in numa_cleanup_meminfo(), memory above
>> max_pfn is removed,
>> but holes between nodes are not removed.
>>
>> I think libraries are not able to handle this problem since nodes are
>> set online in kernel.
>> Seeing from user space, there is no hole.
>>
>> Thanks.
>>
>>> Thanks,
>>> Yasuaki Ishimatsu
>>>
>>> On Wed, 1 Jul 2015 15:55:30 +0800
>>> Tang Chen <tangchen@...fujitsu.com> wrote:
>>>
>>>> On 07/01/2015 02:25 PM, Xishi Qiu wrote:
>>>>> On 2015/7/1 11:16, Tang Chen wrote:
>>>>>
>>>>>> When parsing SRAT, all memory ranges are added into numa_meminfo.
>>>>>> In numa_init(), before entering numa_cleanup_meminfo(), all possible
>>>>>> memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes
>>>>>> all ranges over max_pfn or empty.
>>>>>>
>>>>>> But, this only works if the nodes are continuous. Let's have a look
>>>>>> at the following example:
>>>>>>
>>>>>> We have an SRAT like this:
>>>>>> SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff]
>>>>>> SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff]
>>>>>> SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff]
>>>>>> SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug
>>>>>> SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug
>>>>>> SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug
>>>>>> SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug
>>>>>> SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug
>>>>>> SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug
>>>>>>
>>>>>> On boot, only node 0,1,2,3 exist.
>>>>>>
>>>>>> And the numa_meminfo will look like this:
>>>>>> numa_meminfo.nr_blks = 9
>>>>>> 1. on node 0: [0, 60000000]
>>>>>> 2. on node 0: [100000000, 20000000000]
>>>>>> 3. on node 1: [20000000000, 40000000000]
>>>>>> 4. on node 4: [40000000000, 60000000000]
>>>>>> 5. on node 5: [60000000000, 80000000000]
>>>>>> 6. on node 2: [80000000000, a0000000000]
>>>>>> 7. on node 3: [a0000000000, a0800000000]
>>>>>> 8. on node 6: [c0000000000, a0800000000]
>>>>>> 9. on node 7: [e0000000000, a0800000000]
>>>>>>
>>>>>> And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because
>>>>>> the end address is over max_pfn, which is a0800000000. But 4 and 5
>>>>>> are not removed because their end addresses are less then max_pfn.
>>>>>> But in fact, node 4 and 5 don't exist.
>>>>>>
>>>>>> In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.
>>>>>>
>>>>>> Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(),
>>>>>> node 4 and 5 will be mistakenly set to online.
>>>>>>
>>>>>> In this patch, we use memblock_overlaps_region() to check if ranges in
>>>>>> numa_meminfo overlap with ranges in memory_block. Since memory_block contains
>>>>>> all available memory at boot time, if they overlap, it means the ranges
>>>>>> exist. If not, then remove them from numa_meminfo.
>>>>>>
>>>>> Hi Tang Chen,
>>>>>
>>>>> What's the impact of this problem?
>>>>>
>>>>> Command "numactl --hard" will show an empty node(no cpu and no memory,
>>>>> but pgdat is created), right?
>>>> On my box, if I run lscpu, the output looks like this:
>>>>
>>>> NUMA node0 CPU(s):     0-14,128-142
>>>> NUMA node1 CPU(s):     15-29,143-157
>>>> NUMA node2 CPU(s):
>>>> NUMA node3 CPU(s):
>>>> NUMA node4 CPU(s):     62-76,190-204
>>>> NUMA node5 CPU(s):     78-92,206-220
>>>>
>>>> Node 2 and 3 are not exist, but they are online.
>>>>
>>>> Thanks.
>>>>
>>>>> Thanks,
>>>>> Xishi Qiu
>>>>>
>>>>>> Signed-off-by: Tang Chen <tangchen@...fujitsu.com>
>>>>>> ---
>>>>>>     arch/x86/mm/numa.c       | 6 ++++--
>>>>>>     include/linux/memblock.h | 2 ++
>>>>>>     mm/memblock.c            | 2 +-
>>>>>>     3 files changed, 7 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
>>>>>> index 4053bb5..0c55cc5 100644
>>>>>> --- a/arch/x86/mm/numa.c
>>>>>> +++ b/arch/x86/mm/numa.c
>>>>>> @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
>>>>>>     		bi->start = max(bi->start, low);
>>>>>>     		bi->end = min(bi->end, high);
>>>>>>     
>>>>>> -		/* and there's no empty block */
>>>>>> -		if (bi->start >= bi->end)
>>>>>> +		/* and there's no empty or non-exist block */
>>>>>> +		if (bi->start >= bi->end ||
>>>>>> +		    memblock_overlaps_region(&memblock.memory,
>>>>>> +			bi->start, bi->end - bi->start) == -1)
>>>>>>     			numa_remove_memblk_from(i--, mi);
>>>>>>     	}
>>>>>>     
>>>>>> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
>>>>>> index 0215ffd..3bf6cc1 100644
>>>>>> --- a/include/linux/memblock.h
>>>>>> +++ b/include/linux/memblock.h
>>>>>> @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size);
>>>>>>     int memblock_free(phys_addr_t base, phys_addr_t size);
>>>>>>     int memblock_reserve(phys_addr_t base, phys_addr_t size);
>>>>>>     void memblock_trim_memory(phys_addr_t align);
>>>>>> +long memblock_overlaps_region(struct memblock_type *type,
>>>>>> +			      phys_addr_t base, phys_addr_t size);
>>>>>>     int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
>>>>>>     int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
>>>>>>     int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
>>>>>> diff --git a/mm/memblock.c b/mm/memblock.c
>>>>>> index 1b444c7..55b5f9f 100644
>>>>>> --- a/mm/memblock.c
>>>>>> +++ b/mm/memblock.c
>>>>>> @@ -91,7 +91,7 @@ static unsigned long __init_memblock memblock_addrs_overlap(phys_addr_t base1, p
>>>>>>     	return ((base1 < (base2 + size2)) && (base2 < (base1 + size1)));
>>>>>>     }
>>>>>>     
>>>>>> -static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
>>>>>> +long __init_memblock memblock_overlaps_region(struct memblock_type *type,
>>>>>>     					phys_addr_t base, phys_addr_t size)
>>>>>>     {
>>>>>>     	unsigned long i;
>>>>> .
>>>>>
>>> .
>>>
> .
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ