[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090901142729.GA5022@balbir.in.ibm.com>
Date: Tue, 1 Sep 2009 19:57:29 +0530
From: Balbir Singh <balbir@...ux.vnet.ibm.com>
To: Ankita Garg <ankita@...ibm.com>
Cc: linuxppc-dev@...abs.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] Fix fake numa on ppc
* Ankita Garg <ankita@...ibm.com> [2009-09-01 14:54:07]:
> Hi Balbir,
>
> On Tue, Sep 01, 2009 at 11:27:53AM +0530, Balbir Singh wrote:
> > * Ankita Garg <ankita@...ibm.com> [2009-09-01 10:33:16]:
> >
> > > Hello,
> > >
> > > Below is a patch to fix a couple of issues with fake numa node creation
> > > on ppc:
> > >
> > > 1) Presently, fake nodes could be created such that real numa node
> > > boundaries are not respected. So a node could have lmbs that belong to
> > > different real nodes.
> > >
> > > 2) The cpu association is broken. On a JS22 blade for example, which is
> > > a 2-node numa machine, I get the following:
> > >
> > > # cat /proc/cmdline
> > > root=/dev/sda6 numa=fake=2G,4G,,6G,8G,10G,12G,14G,16G
> > > # cat /sys/devices/system/node/node0/cpulist
> > > 0-3
> > > # cat /sys/devices/system/node/node1/cpulist
> > > 4-7
> > > # cat /sys/devices/system/node/node4/cpulist
> > >
> > > #
> > >
> > > So, though the cpus 4-7 should have been associated with node4, they
> > > still belong to node1. The patch works by recording a real numa node
> > > boundary and incrementing the fake node count. At the same time, a
> > > mapping is stored from the real numa node to the first fake node that
> > > gets created on it.
> > >
> >
> > Some details on how you tested it and results before and after would
> > be nice. Please see git commit 1daa6d08d1257aa61f376c3cc4795660877fb9e3
> > for example
> >
> >
>
> Thanks for the quick review of the patch. Here is some information on
> the testing:
>
> Tested the patch with the following commandlines:
> numa=fake=2G,4G,6G,8G,10G,12G,14G,16G
> numa=fake=3G,6G,10G,16G
> numa=fake=4G
> numa=fake=
>
> For testing if the fake nodes respect the real node boundaries, I added
> some debug printks in the node creation path. Without the patch, for the
> commandline numa=fake=2G,4G,6G,8G,10G,12G,14G,16G, this is what I got:
>
> fake id: 1 nid: 0
> fake id: 1 nid: 0
> ...
> fake id: 2 nid: 0
> fake id: 2 nid: 0
> ...
> fake id: 2 nid: 0
> created new fake_node with id 3
> fake id: 3 nid: 0
> fake id: 3 nid: 0
> ...
> fake id: 3 nid: 0
> fake id: 3 nid: 0
> fake id: 3 nid: 1
> fake id: 3 nid: 1
> ...
> created new fake_node with id 4
> fake id: 4 nid: 1
> fake id: 4 nid: 1
> ...
>
> and so on. So, fake node 3 encompasses real node 0 & 1. Also,
>
> # cat /sys/devices/system/node/node3/meminfo
> Node 0 MemTotal: 2097152 kB
> ...
> # # cat /sys/devices/system/node/node4/meminfo
> Node 0 MemTotal: 2097152 kB
> ...
>
>
> With the patch, I get:
>
> fake id: 1 nid: 0
> fake id: 1 nid: 0
> ...
> fake id: 2 nid: 0
> fake id: 2 nid: 0
> ...
> fake id: 2 nid: 0
> created new fake_node with id 3
> fake id: 3 nid: 0
> fake id: 3 nid: 0
> ...
> fake id: 3 nid: 0
> fake id: 3 nid: 0
> created new fake_node with id 4
> fake id: 4 nid: 1
> fake id: 4 nid: 1
> ...
>
> and so on. With the patch, the fake node sizes are slightly different
> from that specified by the user.
>
> # cat /sys/devices/system/node/node3/meminfo
> Node 3 MemTotal: 1638400 kB
> ...
> # cat /sys/devices/system/node/node4/meminfo
> Node 4 MemTotal: 458752 kB
> ...
>
> CPU association was tested as mentioned in the previous mail:
>
> Without the patch,
>
> # cat /proc/cmdline
> root=/dev/sda6 numa=fake=2G,4G,,6G,8G,10G,12G,14G,16G
> # cat /sys/devices/system/node/node0/cpulist
> 0-3
> # cat /sys/devices/system/node/node1/cpulist
> 4-7
> # cat /sys/devices/system/node/node4/cpulist
>
> #
>
> With the patch,
>
> # cat /proc/cmdline
> root=/dev/sda6 numa=fake=2G,4G,,6G,8G,10G,12G,14G,16G
> # cat /sys/devices/system/node/node0/cpulist
> 0-3
> # cat /sys/devices/system/node/node1/cpulist
>
Oh! interesting.. cpuless nodes :) I think we need to fix this in the
longer run and distribute cpus between fake numa nodes of a real node
using some acceptable heuristic.
> # cat /sys/devices/system/node/node4/cpulist
> 4-7
>
> > >
> > > Signed-off-by: Ankita Garg <ankita@...ibm.com>
> > >
> > > Index: linux-2.6.31-rc5/arch/powerpc/mm/numa.c
> > > ===================================================================
> > > --- linux-2.6.31-rc5.orig/arch/powerpc/mm/numa.c
> > > +++ linux-2.6.31-rc5/arch/powerpc/mm/numa.c
> > > @@ -26,6 +26,11 @@
> > > #include <asm/smp.h>
> > >
> > > static int numa_enabled = 1;
> > > +static int fake_enabled = 1;
> > > +
> > > +/* The array maps a real numa node to the first fake node that gets
> > > +created on it */
> >
> > Coding style is broken
> >
>
> Fixed.
>
> > > +int fake_numa_node_mapping[MAX_NUMNODES];
> > >
> > > static char *cmdline __initdata;
> > >
> > > @@ -49,14 +54,24 @@ static int __cpuinit fake_numa_create_ne
> > > unsigned long long mem;
> > > char *p = cmdline;
> > > static unsigned int fake_nid;
> > > + static unsigned int orig_nid = 0;
> >
> > Should we call this prev_nid?
> >
>
> Yes, makes sense.
> > > static unsigned long long curr_boundary;
> > >
> > > /*
> > > * Modify node id, iff we started creating NUMA nodes
> > > * We want to continue from where we left of the last time
> > > */
> > > - if (fake_nid)
> > > + if (fake_nid) {
> > > + if (orig_nid != *nid) {
> >
> > OK, so this is called when the real NUMA node changes - comments would
> > be nice
> >
>
> Thanks, have added the comment.
>
> > > + fake_nid++;
> > > + fake_numa_node_mapping[*nid] = fake_nid;
> > > + orig_nid = *nid;
> > > + *nid = fake_nid;
> > > + return 0;
> > > + }
> > > *nid = fake_nid;
> > > + }
> > > +
> > > /*
> > > * In case there are no more arguments to parse, the
> > > * node_id should be the same as the last fake node id
> > > @@ -440,7 +455,7 @@ static int of_drconf_to_nid_single(struc
> > > */
> > > static int __cpuinit numa_setup_cpu(unsigned long lcpu)
> > > {
> > > - int nid = 0;
> > > + int nid = 0, new_nid;
> > > struct device_node *cpu = of_get_cpu_node(lcpu, NULL);
> > >
> > > if (!cpu) {
> > > @@ -450,8 +465,15 @@ static int __cpuinit numa_setup_cpu(unsi
> > >
> > > nid = of_node_to_nid_single(cpu);
> > >
> > > + if (fake_enabled && nid) {
> > > + new_nid = fake_numa_node_mapping[nid];
> > > + if (new_nid > 0)
> > > + nid = new_nid;
> > > + }
> > > +
> > > if (nid < 0 || !node_online(nid))
> > > nid = any_online_node(NODE_MASK_ALL);
> > > +
> > > out:
> > > map_cpu_to_node(lcpu, nid);
> > >
> > > @@ -1005,8 +1027,11 @@ static int __init early_numa(char *p)
> > > numa_debug = 1;
> > >
> > > p = strstr(p, "fake=");
> > > - if (p)
> > > + if (p) {
> > > cmdline = p + strlen("fake=");
> > > + if (numa_enabled)
> > > + fake_enabled = 1;
> >
> > Have you tried passing just numa=fake= without any commandline?
> > That should enable fake_enabled, but I wonder if that negatively
> > impacts numa_setup_cpu(). I wonder if you should look at cmdline
> > to decide on fake_enabled.
> >
>
> fake_enabled does get set even for numa=fake=. However, it does not
> impact numa_setup_cpu, since fake_numa_node_mapping array would have no
> mapping stored and there is a condition there already to check for the
> value of the mapping. I confirmed this by booting with the above
> parameter as well.
>
> > > + }
> > >
> > > return 0;
> > > }
> > >
> >
> > Overall, I think this is the right thing to do, we need to move in
> > this direction.
> >
>
> Heres the updated patch:
>
> Signed-off-by: Ankita Garg <ankita@...ibm.com>
>
> Index: linux-2.6.31-rc5/arch/powerpc/mm/numa.c
> ===================================================================
> --- linux-2.6.31-rc5.orig/arch/powerpc/mm/numa.c
> +++ linux-2.6.31-rc5/arch/powerpc/mm/numa.c
> @@ -26,6 +26,13 @@
> #include <asm/smp.h>
>
> static int numa_enabled = 1;
> +static int fake_enabled = 1;
> +
> +/*
> + * The array maps a real numa node to the first fake node that gets
> + * created on it
> + */
> +int fake_numa_node_mapping[MAX_NUMNODES];
>
> static char *cmdline __initdata;
>
> @@ -49,14 +56,29 @@ static int __cpuinit fake_numa_create_ne
> unsigned long long mem;
> char *p = cmdline;
> static unsigned int fake_nid;
> + static unsigned int prev_nid = 0;
> static unsigned long long curr_boundary;
>
> /*
> * Modify node id, iff we started creating NUMA nodes
> * We want to continue from where we left of the last time
> */
> - if (fake_nid)
> + if (fake_nid) {
> + /*
> + * Moved over to the next real numa node, increment fake
> + * node number and store the mapping of the real node to
> + * the fake node
> + */
> + if (prev_nid != *nid) {
> + fake_nid++;
> + fake_numa_node_mapping[*nid] = fake_nid;
> + prev_nid = *nid;
> + *nid = fake_nid;
> + return 0;
> + }
> *nid = fake_nid;
> + }
> +
> /*
> * In case there are no more arguments to parse, the
> * node_id should be the same as the last fake node id
> @@ -440,7 +462,7 @@ static int of_drconf_to_nid_single(struc
> */
> static int __cpuinit numa_setup_cpu(unsigned long lcpu)
> {
> - int nid = 0;
> + int nid = 0, new_nid;
> struct device_node *cpu = of_get_cpu_node(lcpu, NULL);
>
> if (!cpu) {
> @@ -450,8 +472,15 @@ static int __cpuinit numa_setup_cpu(unsi
>
> nid = of_node_to_nid_single(cpu);
>
> + if (fake_enabled && nid) {
> + new_nid = fake_numa_node_mapping[nid];
> + if (new_nid > 0)
> + nid = new_nid;
> + }
> +
> if (nid < 0 || !node_online(nid))
> nid = any_online_node(NODE_MASK_ALL);
> +
> out:
> map_cpu_to_node(lcpu, nid);
>
> @@ -1005,8 +1034,12 @@ static int __init early_numa(char *p)
> numa_debug = 1;
>
> p = strstr(p, "fake=");
> - if (p)
> + if (p) {
> cmdline = p + strlen("fake=");
> + if (numa_enabled) {
> + fake_enabled = 1;
> + }
> + }
>
> return 0;
> }
>
Looks good to me
Reviewed-by: Balbir Singh <balbir@...ux.vnet.ibm.com>
--
Balbir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists