[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20100714173005.41950c92.kamezawa.hiroyu@jp.fujitsu.com>
Date: Wed, 14 Jul 2010 17:30:05 +0900
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
Cc: Nathan Fontenot <nfont@...tin.ibm.com>,
linux-kernel@...r.kernel.org, linuxppc-dev@...abs.org,
Dave Hansen <dave@...ux.vnet.ibm.com>
Subject: Re: [PATCH 4/7] Allow sysfs memory directories to be split
On Wed, 14 Jul 2010 12:25:03 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com> wrote:
> On Tue, 13 Jul 2010 22:18:03 -0500
> Nathan Fontenot <nfont@...tin.ibm.com> wrote:
>
> > On 07/13/2010 07:35 PM, KAMEZAWA Hiroyuki wrote:
> > > On Tue, 13 Jul 2010 10:51:58 -0500
> > > Nathan Fontenot <nfont@...tin.ibm.com> wrote:
> > >
> > >>>
> > >>> And for what purpose this interface is ? Does this split memory block into 2 pieces
> > >>> of the same size ?? sounds __very__ strange interface to me.
> > >>
> > >> Yes, this splits the memory_block into two blocks of the same size. This was
> > >> suggested as something we may want to do. From ppc perspective I am not sure we
> > >> would use this.
> > >>
> > >> The split functionality is not required. The main goal of the patch set is to
> > >> reduce the number of memory sysfs directories created. From a ppc perspective
> > >> the split functionality is not really needed.
> > >>
> > >
> > > Okay, this is an offer from me.
> > >
> > > 1. I think you can add an boot option as "don't create memory sysfs".
> > > please do.
> >
> > I posted a patch to do that a week or so ago, it didn't go over very well.
> >
> > >
> > > 2. I'd like to write a configfs module for handling memory hotplug even when
> > > sysfs directroy is not created.
> > > Because configfs support rmdir/mkdir, the user (ppc's daemon?) has to do
> > >
> > > When offlining section X.
> > > # insmod configfs_memory.ko
> > > # mount -t configfs none /configfs
> > > # mkdir /configfs/memoryX
> > > # echo offline > /configfs/memoryX/state
> > > # rmdir /configfs/memoryX
> > >
> > > And making this operation as the default bahavior for all arch's memory hotplug may
> > > be better...
> > >
> > > Dave, how do you think ? Because ppc guys uses "probe" interface already,
> > > this can be handled... no ?
> >
> > ppc would still require the existance of the 'probe' interface.
> >
> > Are you objecting to the 'split' functionality?
> yes.
>
> > If so I do not see any reason from ppc
> > perspective that it is needed. This was something Dave suggested, unless I am missing
> > something.
> >
> > Since ppc needs the 'probe' interface in sysfs, and for ppc having mutliple
> > memory_block_sections reside under a single memory_block makes memory hotplug
> > simpler. On ppc we do emory hotplug operations on an LMB size basis. With my
> > patches this now lets us set each memory_block to span an LMB's worth of
> > memory. Now we could do emory hotplug in a single operation instead of multiple
> > operations to offline/online all of the memory sections in an LMB.
> >
>
> Why per-section memory offlining is provided is for allowing good success-rate of
> memory offlining. Because memory-hotplug has to "migrate or free" all used page
> under a section, possibility of memory unplug depends on usage of memory.
> If a section contains unmovable page(kernel page), we can't offline sectin.
>
> For example, comparing
> 1. offlining 128MB of memory at once
> 2. offlining 8 chunks of 16MB memory
> "2" can get very good possibility and system-busy time can be much reduced.
>
> IIUC, ppc's 1st requirement is "resizing" not "hot-removing some memory device",
> "2" is much welcomed. So, some fine-grained interface to section_size is
> appreciated. So, "multiple operations" is much better than single operation.
>
> As I posted show/hide patch, I'm writing it in configfs. I think it meets IBM's
> requirements.
> _But_, it's IBM's issue not Fujitsu's. So, final decistion will depend on you guys.
>
> Anyway, I don't like a too fancy interface as "split".
>
This is a sample configfs for handling memory hotplug.
I wrote this just for my fun and study. code-duplication was not as
big as expected...most of codes are for configfs management.
you can ignore this. but please avoid changing existing interace in fancy way.
==
[root@...extal kamezawa]# mount -t configfs none /configfs/
[root@...extal kamezawa]# mkdir /configfs/memory/72
[root@...extal kamezawa]# cat /configfs/memory/72/phys_index
00000048
[root@...extal kamezawa]# cat /sys/devices/system/memory/memory72/phys_index
00000048
[root@...extal kamezawa]# echo offline > /configfs/memory/72/state
[root@...extal kamezawa]# cat /configfs/memory/72/state
offline
[root@...extal kamezawa]# cat /sys/devices/system/memory/memory72/state
offline
[root@...extal kamezawa]# echo online > /configfs/memory/72/state
[root@...extal kamezawa]# cat /sys/devices/system/memory/memory72/state
online
No sign.
---
drivers/base/Makefile | 2
drivers/base/memory.c | 87 +++++++++++++++++--
drivers/base/memory_config.c | 192 +++++++++++++++++++++++++++++++++++++++++++
include/linux/memory.h | 10 ++
mm/Kconfig | 1
5 files changed, 280 insertions(+), 12 deletions(-)
Index: mmotm-2.6.35-0701/drivers/base/memory.c
===================================================================
--- mmotm-2.6.35-0701.orig/drivers/base/memory.c
+++ mmotm-2.6.35-0701/drivers/base/memory.c
@@ -23,12 +23,15 @@
#include <linux/mutex.h>
#include <linux/stat.h>
#include <linux/slab.h>
+#include <linux/radix-tree.h>
#include <asm/atomic.h>
#include <asm/uaccess.h>
#define MEMORY_CLASS_NAME "memory"
+
+
static struct sysdev_class memory_sysdev_class = {
.name = MEMORY_CLASS_NAME,
};
@@ -104,17 +107,57 @@ unregister_memory(struct memory_block *m
sysdev_unregister(&memory->sysdev);
}
+/* routine for remember memory's status when configs is not used. */
+
+RADIX_TREE(hidden_mems, GFP_KERNEL);
+DEFINE_MUTEX(hidden_mems_mutex);
+int record_memory_state(unsigned long section_nr, int status)
+{
+ int ret = -ENOMEM;
+ long lstat = status << 8; /* for avoid radix'trees special handling */
+ mutex_lock(&hidden_mems_mutex);
+ radix_tree_delete(&hidden_mems, section_nr);
+ if (radix_tree_preload(GFP_KERNEL))
+ goto out;
+ ret = radix_tree_insert(&hidden_mems, section_nr, (void*)lstat);
+ radix_tree_preload_end();
+out:
+ mutex_unlock(&hidden_mems_mutex);
+ return ret;
+}
+
+int lookup_memory_state(unsigned long section_nr)
+{
+ void *ptr;
+ /* we already have big mutex */
+ ptr= radix_tree_lookup(&hidden_mems, section_nr);
+ /* treate not-recorded mems'state as ONLINE...? */
+ return ((long)ptr) >> 8;
+}
+
+void forget_memory_state(unsigned long section_nr)
+{
+ radix_tree_delete(&hidden_mems, section_nr);
+}
+
+
/*
* use this as the physical section index that this memsection
* uses.
*/
+ssize_t show_memoryblock_phys_index(struct memory_block *mem,
+ char *buf)
+{
+ return sprintf(buf, "%08lx\n", mem->phys_index);
+}
+
static ssize_t show_mem_phys_index(struct sys_device *dev,
struct sysdev_attribute *attr, char *buf)
{
struct memory_block *mem =
container_of(dev, struct memory_block, sysdev);
- return sprintf(buf, "%08lx\n", mem->phys_index);
+ return show_memoryblock_phys_index(mem, buf);
}
/*
@@ -136,11 +179,8 @@ static ssize_t show_mem_removable(struct
/*
* online, offline, going offline, etc.
*/
-static ssize_t show_mem_state(struct sys_device *dev,
- struct sysdev_attribute *attr, char *buf)
+ssize_t show_memoryblock_state(struct memory_block *mem, char *buf)
{
- struct memory_block *mem =
- container_of(dev, struct memory_block, sysdev);
ssize_t len = 0;
/*
@@ -167,6 +207,15 @@ static ssize_t show_mem_state(struct sys
return len;
}
+ssize_t show_mem_state(struct sys_device *dev,
+ struct sysdev_attribute *attr, char *buf)
+{
+ struct memory_block *mem =
+ container_of(dev, struct memory_block, sysdev);
+ mem->state = lookup_memory_state(mem->phys_index);
+ return show_memoryblock_state(mem, buf);
+}
+
int memory_notify(unsigned long val, void *v)
{
return blocking_notifier_call_chain(&memory_chain, val, v);
@@ -218,11 +267,14 @@ memory_block_action(struct memory_block
break;
case MEM_OFFLINE:
mem->state = MEM_GOING_OFFLINE;
+ record_memory_state(mem->phys_index, mem->state);
start_paddr = page_to_pfn(first_page) << PAGE_SHIFT;
ret = remove_memory(start_paddr,
PAGES_PER_SECTION << PAGE_SHIFT);
if (ret) {
mem->state = old_state;
+ record_memory_state(mem->phys_index,
+ mem->state);
break;
}
break;
@@ -241,6 +293,8 @@ static int memory_block_change_state(str
int ret = 0;
mutex_lock(&mem->state_mutex);
+ mem->state = lookup_memory_state(mem->phys_index);
+
if (mem->state != from_state_req) {
ret = -EINVAL;
goto out;
@@ -249,21 +303,18 @@ static int memory_block_change_state(str
ret = memory_block_action(mem, to_state);
if (!ret)
mem->state = to_state;
-
+ record_memory_state(mem->phys_index, mem->state);
out:
mutex_unlock(&mem->state_mutex);
return ret;
}
-static ssize_t
-store_mem_state(struct sys_device *dev,
- struct sysdev_attribute *attr, const char *buf, size_t count)
+ssize_t store_memoryblock_state(struct memory_block *mem,
+ const char *buf, size_t count)
{
- struct memory_block *mem;
unsigned int phys_section_nr;
int ret = -EINVAL;
- mem = container_of(dev, struct memory_block, sysdev);
phys_section_nr = mem->phys_index;
if (!present_section_nr(phys_section_nr))
@@ -279,6 +330,16 @@ out:
return count;
}
+static ssize_t
+store_mem_state(struct sys_device *dev,
+ struct sysdev_attribute *attr, const char *buf, size_t count)
+{
+ struct memory_block *mem;
+
+ mem = container_of(dev, struct memory_block, sysdev);
+ return store_memoryblock_state(mem, buf, count);
+}
+
/*
* phys_device is a bad name for this. What I really want
* is a way to differentiate between memory ranges that
@@ -451,6 +512,8 @@ static int add_memory_block(int nid, str
start_pfn = section_nr_to_pfn(mem->phys_index);
mem->phys_device = arch_get_memory_phys_device(start_pfn);
+ record_memory_state(mem->phys_index, state);
+
ret = register_memory(mem, section);
if (!ret)
ret = mem_create_simple_file(mem, phys_index);
@@ -505,6 +568,7 @@ int remove_memory_block(unsigned long no
struct memory_block *mem;
mem = find_memory_block(section);
+ forget_memory_state(mem->phys_index);
unregister_mem_sect_under_nodes(mem);
mem_remove_simple_file(mem, phys_index);
mem_remove_simple_file(mem, state);
@@ -573,3 +637,4 @@ out:
printk(KERN_ERR "%s() failed: %d\n", __func__, ret);
return ret;
}
+
Index: mmotm-2.6.35-0701/drivers/base/memory_config.c
===================================================================
--- /dev/null
+++ mmotm-2.6.35-0701/drivers/base/memory_config.c
@@ -0,0 +1,192 @@
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/memory.h>
+#include <linux/slab.h>
+#include <linux/radix-tree.h>
+#include <linux/configfs.h>
+
+
+
+struct memory_hp_block {
+ struct config_group group;
+ struct memory_block mb;
+};
+struct memory_hp_attribute {
+ struct configfs_attribute attr;
+ ssize_t (*show)(struct memory_hp_block *, char *);
+ ssize_t (*store)(struct memory_hp_block *, const char *);
+};
+
+static struct memory_hp_block *to_mhp_block(struct config_item *item)
+{
+ if (!item)
+ return NULL;
+ return container_of(to_config_group(item),
+ struct memory_hp_block, group);
+}
+
+static ssize_t
+memory_hp_phys_index_read(struct memory_hp_block *mhb, char *page)
+{
+ return show_memoryblock_phys_index(&mhb->mb, page);
+}
+
+static struct memory_hp_attribute memory_hp_phys_index = {
+ .attr = { .ca_owner = THIS_MODULE,
+ .ca_name = "phys_index",
+ .ca_mode = S_IRUGO },
+ .show = memory_hp_phys_index_read,
+};
+
+static ssize_t
+memory_hp_state_read(struct memory_hp_block *mhb, char *page)
+{
+ /* synchronize */
+ printk("lookup section %ld\n", mhb->mb.phys_index);
+ mhb->mb.state = lookup_memory_state(mhb->mb.phys_index);
+ return show_memoryblock_state(&mhb->mb, page);
+}
+
+static ssize_t
+memory_hp_state_store(struct memory_hp_block *mhb, const char *page)
+{
+ int len = strnlen(page, 8);
+ printk("length %d str %s\n", len, page);
+ if (len > 8) /* online/offline */
+ return -EINVAL;
+ /* synchronize */
+ mhb->mb.state = lookup_memory_state(mhb->mb.phys_index);
+ return store_memoryblock_state(&mhb->mb, page, len);
+}
+
+static struct memory_hp_attribute memory_hp_state = {
+ .attr = { .ca_owner = THIS_MODULE,
+ .ca_name = "state",
+ .ca_mode = S_IRUGO|S_IWUSR },
+ .show = memory_hp_state_read,
+ .store = memory_hp_state_store,
+};
+
+static struct configfs_attribute *memory_hp_attrs[] = {
+ &memory_hp_phys_index.attr,
+ &memory_hp_state.attr,
+ NULL,
+};
+
+static ssize_t memory_hp_attr_show(struct config_item *item,
+ struct configfs_attribute *attr,
+ char *page)
+{
+ struct memory_hp_block *mhb = to_mhp_block(item);
+ struct memory_hp_attribute *memhp_attr =
+ container_of(attr, struct memory_hp_attribute, attr);
+ ssize_t ret = 0;
+
+ if (memhp_attr->show)
+ ret = memhp_attr->show(mhb, page);
+ return ret;
+}
+
+static ssize_t memory_hp_attr_store(struct config_item *item,
+ struct configfs_attribute *attr,
+ const char *page, size_t count)
+{
+ struct memory_hp_block *mhb = to_mhp_block(item);
+ struct memory_hp_attribute *memhp_attr =
+ container_of(attr, struct memory_hp_attribute, attr);
+ ssize_t ret = 0;
+
+ if (memhp_attr->store)
+ ret = memhp_attr->store(mhb, page);
+ else
+ ret = -EINVAL;
+ return ret;
+}
+
+static struct configfs_item_operations memory_hp_item_ops = {
+ .show_attribute = memory_hp_attr_show,
+ .store_attribute = memory_hp_attr_store,
+};
+
+static struct config_item_type memory_hp_type = {
+ .ct_item_ops = &memory_hp_item_ops,
+ .ct_attrs = memory_hp_attrs,
+ .ct_owner = THIS_MODULE,
+};
+
+static struct config_group *
+memory_hp_make_group(struct config_group *group, const char *name)
+{
+ struct memory_hp_block *mhb;
+ unsigned long long section_id;
+
+
+ if (strict_strtoull(name, 10, §ion_id))
+ return ERR_PTR(-EINVAL);
+
+ if (!valid_section_nr(section_id))
+ return ERR_PTR(-EINVAL);
+
+ mhb = kzalloc(sizeof(*mhb), GFP_KERNEL);
+ if (!mhb)
+ return NULL;
+
+ config_group_init_type_name(&mhb->group, name, &memory_hp_type);
+
+ mhb->mb.phys_index = section_id;
+ mutex_init(&mhb->mb.state_mutex);
+ mhb->mb.state = lookup_memory_state(section_id);
+
+ return &mhb->group;
+}
+
+static void memory_hp_drop_item(struct config_group *group,
+ struct config_item *item)
+{
+ struct memory_hp_block *mhb;
+
+ mhb = container_of(group, struct memory_hp_block, group);
+ config_item_put(item);
+}
+
+static struct configfs_group_operations memory_hp_group_ops = {
+ .make_group = memory_hp_make_group,
+ .drop_item = memory_hp_drop_item,
+};
+
+static struct config_item_type memory_hp_subsys_type = {
+ .ct_group_ops = &memory_hp_group_ops,
+ .ct_owner = THIS_MODULE,
+};
+
+static struct configfs_subsystem memory_hp_subsys = {
+ .su_group = {
+ .cg_item = {
+ .ci_namebuf = "memory",
+ .ci_type = &memory_hp_subsys_type,
+ },
+ },
+};
+
+static int __init memory_config_init(void)
+{
+ int ret;
+
+ config_group_init(&memory_hp_subsys.su_group);
+ mutex_init(&memory_hp_subsys.su_mutex);
+ ret = configfs_register_subsystem(&memory_hp_subsys);
+ if (ret) {
+ printk(KERN_ERR "Error %d while registering memory configfs\n", ret);
+ return ret;
+ }
+ return 0;
+}
+
+#if 0
+static void __exit memory_config_exit(void)
+{
+ configfs_unregister_subsystem(&memory_hp_subsys);
+}
+#endif
+late_initcall(memory_config_init);
Index: mmotm-2.6.35-0701/drivers/base/Makefile
===================================================================
--- mmotm-2.6.35-0701.orig/drivers/base/Makefile
+++ mmotm-2.6.35-0701/drivers/base/Makefile
@@ -11,7 +11,7 @@ obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT)
obj-$(CONFIG_ISA) += isa.o
obj-$(CONFIG_FW_LOADER) += firmware_class.o
obj-$(CONFIG_NUMA) += node.o
-obj-$(CONFIG_MEMORY_HOTPLUG_SPARSE) += memory.o
+obj-$(CONFIG_MEMORY_HOTPLUG_SPARSE) += memory.o memory_config.o
obj-$(CONFIG_SMP) += topology.o
obj-$(CONFIG_IOMMU_API) += iommu.o
ifeq ($(CONFIG_SYSFS),y)
Index: mmotm-2.6.35-0701/mm/Kconfig
===================================================================
--- mmotm-2.6.35-0701.orig/mm/Kconfig
+++ mmotm-2.6.35-0701/mm/Kconfig
@@ -138,6 +138,7 @@ config MEMORY_HOTPLUG
config MEMORY_HOTPLUG_SPARSE
def_bool y
depends on SPARSEMEM && MEMORY_HOTPLUG
+ select CONFIGFS_FS
config MEMORY_HOTREMOVE
bool "Allow for memory hot remove"
Index: mmotm-2.6.35-0701/include/linux/memory.h
===================================================================
--- mmotm-2.6.35-0701.orig/include/linux/memory.h
+++ mmotm-2.6.35-0701/include/linux/memory.h
@@ -68,6 +68,16 @@ struct memory_isolate_notify {
struct notifier_block;
struct mem_section;
+
+ssize_t show_memoryblock_phys_index(struct memory_block *mb, char *buf);
+ssize_t show_memoryblock_state(struct memory_block *mb, char *buf);
+ssize_t store_memoryblock_state(struct memory_block *mb,
+ const char *buf, size_t count);
+
+int record_memory_state(unsigned long section, int state);
+int lookup_memory_state(unsigned long section);
+void forget_memory_state(unsigned long section);
+
/*
* Priorities for the hotplug memory callback routines (stored in decreasing
* order in the callback chain)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists