lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a18525a5-ea5f-5e7a-8765-a6c0e38ddd21@redhat.com>
Date:   Tue, 21 Aug 2018 15:17:10 +0200
From:   David Hildenbrand <david@...hat.com>
To:     Oscar Salvador <osalvador@...hadventures.net>,
        akpm@...ux-foundation.org
Cc:     mhocko@...e.com, dan.j.williams@...el.com, jglisse@...hat.com,
        jonathan.cameron@...wei.com, Pavel.Tatashin@...rosoft.com,
        yasu.isimatu@...il.com, logang@...tatee.com, dave.jiang@...el.com,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        Oscar Salvador <osalvador@...e.de>
Subject: Re: [RFC v2 2/2] mm/memory_hotplug: Shrink spanned pages when
 offlining memory

> add_device_memory is in charge of

I wouldn't use the terminology of onlining/offlining here. That applies
rather to memory that is exposed to the rest of the system (e.g. buddy
allocator, has underlying memory block devices). I guess it is rather a
pure setup/teardown of that device memory.

> 
> a) calling either arch_add_memory() or add_pages(), depending on whether
>    we want a linear mapping
> b) online the memory sections that correspond to the pfn range
> c) calling move_pfn_range_to_zone() being zone ZONE_DEVICE to
>    expand zone/pgdat spanned pages and initialize its pages
> 
> del_device_memory, on the other hand, is in charge of
> 
> a) offline the memory sections that correspond to the pfn range
> b) calling shrink_pages(), which shrinks node/zone spanned pages.
> c) calling either arch_remove_memory() or __remove_pages(), depending on
>    whether we need to tear down the linear mapping or not
> 
> These two functions are called from:
> 
> add_device_memory:
> 	- devm_memremap_pages()
> 	- hmm_devmem_pages_create()
> 
> del_device_memory:
> 	- devm_memremap_pages_release()
> 	- hmm_devmem_release()
> 
> I think that this will get easier as soon as [1] gets merged.
> 
> Finally, shrink_pages() is moved to offline_pages(), so now,
> all pages/zone handling is being taken care in online/offline_pages stage.
> 
> [1] https://lkml.org/lkml/2018/6/19/110
> 

[...]

> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index f90fa077b0c4..d04338ffabec 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1152,15 +1152,9 @@ int __ref arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *a
>  {
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
> -	struct page *page = pfn_to_page(start_pfn);
> -	struct zone *zone;
>  	int ret;
>  
> -	/* With altmap the first mapped page is offset from @start */
> -	if (altmap)
> -		page += vmem_altmap_offset(altmap);
> -	zone = page_zone(page);
> -	ret = __remove_pages(zone, start_pfn, nr_pages, altmap);
> +	ret = __remove_pages(nid, start_pfn, nr_pages, altmap);

These changes certainly look nice.

[...]

> index 7a832b844f24..95df37686196 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -119,6 +119,8 @@ static void devm_memremap_pages_release(void *data)
>  	struct dev_pagemap *pgmap = data;
>  	struct device *dev = pgmap->dev;
>  	struct resource *res = &pgmap->res;
> +	struct vmem_altmap *altmap = pgmap->altmap_valid ?
> +					&pgmap->altmap : NULL;
>  	resource_size_t align_start, align_size;
>  	unsigned long pfn;
>  	int nid;
> @@ -138,8 +140,7 @@ static void devm_memremap_pages_release(void *data)
>  	nid = dev_to_node(dev);
>  
>  	mem_hotplug_begin();

I would really like to see the mem_hotplug_begin/end also getting moved
inside add_device_memory()/del_device_memory(). (just like for
add/remove_memory)

I wonder if kasan_ stuff actually requires this lock, or if it could
also be somehow moved inside add_device_memory/del_device_memory.

> -	arch_remove_memory(nid, align_start, align_size, pgmap->altmap_valid ?
> -			&pgmap->altmap : NULL);
> +	del_device_memory(nid, align_start, align_size, altmap, true);
>  	kasan_remove_zero_shadow(__va(align_start), align_size);
>  	mem_hotplug_done();
>  
> @@ -175,7 +176,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
>  {
>  	resource_size_t align_start, align_size, align_end;
>  	struct vmem_altmap *altmap = pgmap->altmap_valid ?
> -			&pgmap->altmap : NULL;
> +					&pgmap->altmap : NULL;
>  	struct resource *res = &pgmap->res;
>  	unsigned long pfn, pgoff, order;
>  	pgprot_t pgprot = PAGE_KERNEL;
> @@ -249,11 +250,8 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
>  		goto err_kasan;
>  	}
>  
> -	error = arch_add_memory(nid, align_start, align_size, altmap, false);
> -	if (!error)
> -		move_pfn_range_to_zone(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
> -					align_start >> PAGE_SHIFT,
> -					align_size >> PAGE_SHIFT, altmap);
> +	error = add_device_memory(nid, align_start, align_size, altmap, true);
> +
>  	mem_hotplug_done();
>  	if (error)
>  		goto err_add_memory;
> diff --git a/kernel/resource.c b/kernel/resource.c
> index 30e1bc68503b..8e68b5ca67ca 100644
> --- a/kernel/resource.c
> +++ b/kernel/resource.c
> @@ -1262,6 +1262,22 @@ int release_mem_region_adjustable(struct resource *parent,
>  			continue;
>  		}
>  
> +		/*
> +		 * All memory regions added from memory-hotplug path
> +		 * have the flag IORESOURCE_SYSTEM_RAM.
> +		 * IORESOURCE_SYSTEM_RAM = (IORESOURCE_MEM|IORESOURCE_SYSRAM)
> +		 * If the resource does not have this flag, we know that
> +		 * we are dealing with a resource coming from HMM/devm.
> +		 * HMM/devm use another mechanism to add/release a resource.
> +		 * This goes via devm_request_mem_region/devm_release_mem_region.
> +		 * HMM/devm take care to release their resources when they want, so
> +		 * if we are dealing with them, let us just back off nicely
> +		 */

Maybe shorten that a bit

"HMM/devm memory does not have IORESOURCE_SYSTEM_RAM set. They use
 devm_request_mem_region/devm_release_mem_region to add/release a
 resource. Just back off here."

> +		if (!(res->flags & IORESOURCE_SYSRAM)) {
> +			ret = 0;
> +			break;
> +		}
> +
>  		if (!(res->flags & IORESOURCE_MEM))
>  			break;
>  
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 21787e480b4a..23ce7fbdb651 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -996,6 +996,7 @@ static void hmm_devmem_release(struct device *dev, void *data)
>  	struct zone *zone;
>  	struct page *page;
>  	int nid;
> +	bool mapping;
>  
>  	if (percpu_ref_tryget_live(&devmem->ref)) {
>  		dev_WARN(dev, "%s: page mapping is still live!\n", __func__);
> @@ -1012,12 +1013,14 @@ static void hmm_devmem_release(struct device *dev, void *data)
>  
>  	mem_hotplug_begin();
>  	if (resource->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY)
> -		__remove_pages(zone, start_pfn, npages, NULL);
> +		mapping = false;
>  	else
> -		arch_remove_memory(nid, start_pfn << PAGE_SHIFT,
> -				   npages << PAGE_SHIFT, NULL);
> -	mem_hotplug_done();
> +		mapping = true;
>  
> +	del_device_memory(nid, start_pfn << PAGE_SHIFT, npages << PAGE_SHIFT,
> +									NULL,
> +									mapping);
> +	mem_hotplug_done();
>  	hmm_devmem_radix_release(resource);
>  }
>  
> @@ -1027,6 +1030,7 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
>  	struct device *device = devmem->device;
>  	int ret, nid, is_ram;
>  	unsigned long pfn;
> +	bool mapping;
>  
>  	align_start = devmem->resource->start & ~(PA_SECTION_SIZE - 1);
>  	align_size = ALIGN(devmem->resource->start +
> @@ -1085,7 +1089,6 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
>  	if (nid < 0)
>  		nid = numa_mem_id();
>  
> -	mem_hotplug_begin();
>  	/*
>  	 * For device private memory we call add_pages() as we only need to
>  	 * allocate and initialize struct page for the device memory. More-
> @@ -1096,21 +1099,18 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
>  	 * For device public memory, which is accesible by the CPU, we do
>  	 * want the linear mapping and thus use arch_add_memory().
>  	 */
> +	mem_hotplug_begin();
>  	if (devmem->pagemap.type == MEMORY_DEVICE_PUBLIC)
> -		ret = arch_add_memory(nid, align_start, align_size, NULL,
> -				false);
> +		mapping = true;
>  	else
> -		ret = add_pages(nid, align_start >> PAGE_SHIFT,
> -				align_size >> PAGE_SHIFT, NULL, false);
> -	if (ret) {
> -		mem_hotplug_done();
> -		goto error_add_memory;
> -	}
> -	move_pfn_range_to_zone(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
> -				align_start >> PAGE_SHIFT,
> -				align_size >> PAGE_SHIFT, NULL);
> +		mapping = false;
> +
> +	ret = add_device_memory(nid, align_start, align_size, NULL, mapping);
>  	mem_hotplug_done();
>  
> +	if (ret)
> +		goto error_add_memory;
> +
>  	for (pfn = devmem->pfn_first; pfn < devmem->pfn_last; pfn++) {
>  		struct page *page = pfn_to_page(pfn);
>  
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 9bd629944c91..60b67f09956e 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -320,12 +320,10 @@ static unsigned long find_smallest_section_pfn(int nid, struct zone *zone,
>  				     unsigned long start_pfn,
>  				     unsigned long end_pfn)
>  {
> -	struct mem_section *ms;
> -
>  	for (; start_pfn < end_pfn; start_pfn += PAGES_PER_SECTION) {
> -		ms = __pfn_to_section(start_pfn);
> +		struct mem_section *ms = __pfn_to_section(start_pfn);
>  
> -		if (unlikely(!valid_section(ms)))
> +		if (unlikely(!online_section(ms)))
>  			continue;
>  
>  		if (unlikely(pfn_to_nid(start_pfn) != nid))
> @@ -345,15 +343,14 @@ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
>  				    unsigned long start_pfn,
>  				    unsigned long end_pfn)
>  {
> -	struct mem_section *ms;
>  	unsigned long pfn;
>  
>  	/* pfn is the end pfn of a memory section. */
>  	pfn = end_pfn - 1;
>  	for (; pfn >= start_pfn; pfn -= PAGES_PER_SECTION) {
> -		ms = __pfn_to_section(pfn);
> +		struct mem_section *ms = __pfn_to_section(pfn);
>  
> -		if (unlikely(!valid_section(ms)))
> +		if (unlikely(!online_section(ms)))
>  			continue;
>  
>  		if (unlikely(pfn_to_nid(pfn) != nid))
> @@ -415,7 +412,7 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>  	for (; pfn < zone_end_pfn; pfn += PAGES_PER_SECTION) {
>  		ms = __pfn_to_section(pfn);
>  
> -		if (unlikely(!valid_section(ms)))
> +		if (unlikely(!online_section(ms)))
>  			continue;
>  
>  		if (page_zone(pfn_to_page(pfn)) != zone)
> @@ -502,23 +499,33 @@ static void shrink_pgdat_span(struct pglist_data *pgdat,
>  	pgdat->node_spanned_pages = 0;
>  }
>  
> -static void __remove_zone(struct zone *zone, unsigned long start_pfn)
> +static void shrink_pages(struct zone *zone, unsigned long start_pfn,
> +						unsigned long end_pfn,
> +						unsigned long offlined_pages)
>  {
>  	struct pglist_data *pgdat = zone->zone_pgdat;
>  	int nr_pages = PAGES_PER_SECTION;
>  	unsigned long flags;
> +	unsigned long pfn;
>  
> -	pgdat_resize_lock(zone->zone_pgdat, &flags);
> -	shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
> -	shrink_pgdat_span(pgdat, start_pfn, start_pfn + nr_pages);
> -	pgdat_resize_unlock(zone->zone_pgdat, &flags);
> +	zone->present_pages -= offlined_pages;
> +
> +	clear_zone_contiguous(zone);
> +	pgdat_resize_lock(pgdat, &flags);
> +
> +	for(pfn = start_pfn; pfn < end_pfn; pfn += nr_pages) {
> +		shrink_zone_span(zone, pfn, pfn + nr_pages);
> +		shrink_pgdat_span(pgdat, pfn, pfn + nr_pages);
> +	}
> +	pgdat->node_present_pages -= offlined_pages;
> +
> +	pgdat_resize_unlock(pgdat, &flags);
> +	set_zone_contiguous(zone);
>  }
>  
> -static int __remove_section(struct zone *zone, struct mem_section *ms,
> +static int __remove_section(int nid, struct mem_section *ms,
>  		unsigned long map_offset, struct vmem_altmap *altmap)
>  {
> -	unsigned long start_pfn;
> -	int scn_nr;
>  	int ret = -EINVAL;
>  
>  	if (!valid_section(ms))
> @@ -528,17 +535,13 @@ static int __remove_section(struct zone *zone, struct mem_section *ms,
>  	if (ret)
>  		return ret;
>  
> -	scn_nr = __section_nr(ms);
> -	start_pfn = section_nr_to_pfn((unsigned long)scn_nr);
> -	__remove_zone(zone, start_pfn);
> -
> -	sparse_remove_one_section(zone, ms, map_offset, altmap);
> +	sparse_remove_one_section(nid, ms, map_offset, altmap);
>  	return 0;
>  }
>  
>  /**
>   * __remove_pages() - remove sections of pages from a zone
> - * @zone: zone from which pages need to be removed
> + * @nid: nid from which pages belong to
>   * @phys_start_pfn: starting pageframe (must be aligned to start of a section)
>   * @nr_pages: number of pages to remove (must be multiple of section size)
>   * @altmap: alternative device page map or %NULL if default memmap is used
> @@ -548,35 +551,27 @@ static int __remove_section(struct zone *zone, struct mem_section *ms,
>   * sure that pages are marked reserved and zones are adjust properly by
>   * calling offline_pages().
>   */
> -int __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
> +int __remove_pages(int nid, unsigned long phys_start_pfn,
>  		 unsigned long nr_pages, struct vmem_altmap *altmap)
>  {
>  	unsigned long i;
>  	unsigned long map_offset = 0;
>  	int sections_to_remove, ret = 0;
> +	resource_size_t start, size;
>  
> -	/* In the ZONE_DEVICE case device driver owns the memory region */
> -	if (is_dev_zone(zone)) {
> -		if (altmap)
> -			map_offset = vmem_altmap_offset(altmap);
> -	} else {
> -		resource_size_t start, size;
> -
> -		start = phys_start_pfn << PAGE_SHIFT;
> -		size = nr_pages * PAGE_SIZE;
> +	start = phys_start_pfn << PAGE_SHIFT;
> +	size = nr_pages * PAGE_SIZE;
>  
> -		ret = release_mem_region_adjustable(&iomem_resource, start,
> -					size);
> -		if (ret) {
> -			resource_size_t endres = start + size - 1;
> +	if (altmap)
> +		map_offset = vmem_altmap_offset(altmap);
>  
> -			pr_warn("Unable to release resource <%pa-%pa> (%d)\n",
> -					&start, &endres, ret);
> -		}
> +	ret = release_mem_region_adjustable(&iomem_resource, start, size);
> +	if (ret) {
> +		resource_size_t endres = start + size - 1;
> +		pr_warn("Unable to release resource <%pa-%pa> (%d)\n",
> +						&start, &endres, ret);
>  	}
>  
> -	clear_zone_contiguous(zone);
> -
>  	/*
>  	 * We can only remove entire sections
>  	 */
> @@ -587,15 +582,13 @@ int __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
>  	for (i = 0; i < sections_to_remove; i++) {
>  		unsigned long pfn = phys_start_pfn + i*PAGES_PER_SECTION;
>  
> -		ret = __remove_section(zone, __pfn_to_section(pfn), map_offset,
> -				altmap);
> +		ret = __remove_section(nid, __pfn_to_section(pfn), map_offset,
> +									altmap);
>  		map_offset = 0;
>  		if (ret)
>  			break;
>  	}
>  
> -	set_zone_contiguous(zone);
> -
>  	return ret;
>  }
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
> @@ -1595,7 +1588,6 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  	unsigned long pfn, nr_pages;
>  	long offlined_pages;
>  	int ret, node;
> -	unsigned long flags;
>  	unsigned long valid_start, valid_end;
>  	struct zone *zone;
>  	struct memory_notify arg;
> @@ -1665,11 +1657,9 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
>  	/* removal success */
>  	adjust_managed_page_count(pfn_to_page(start_pfn), -offlined_pages);
> -	zone->present_pages -= offlined_pages;
>  
> -	pgdat_resize_lock(zone->zone_pgdat, &flags);
> -	zone->zone_pgdat->node_present_pages -= offlined_pages;
> -	pgdat_resize_unlock(zone->zone_pgdat, &flags);
> +	/* Here we will shrink zone/node's spanned/present_pages */
> +	shrink_pages(zone, valid_start, valid_end, offlined_pages);
>  
>  	init_per_zone_wmark_min();
>  
> @@ -1902,4 +1892,57 @@ void __ref remove_memory(int nid, u64 start, u64 size)
>  	mem_hotplug_done();
>  }
>  EXPORT_SYMBOL_GPL(remove_memory);
> +
> +static int __del_device_memory(int nid, unsigned long start, unsigned long size,
> +				struct vmem_altmap *altmap, bool mapping)
> +{
> +	int ret;
> +	unsigned long start_pfn = PHYS_PFN(start);
> +	unsigned long nr_pages = size >> PAGE_SHIFT;
> +	struct zone *zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE];
> +
> +	offline_mem_sections(start_pfn, start_pfn + nr_pages);
> +	shrink_pages(zone, start_pfn, start_pfn + nr_pages, 0);
> +
> +	if (mapping)
> +		ret = arch_remove_memory(nid, start, size, altmap);
> +	else
> +		ret = __remove_pages(nid, start_pfn, nr_pages, altmap);
> +
> +	return ret;
> +}
> +
> +int del_device_memory(int nid, unsigned long start, unsigned long size,
> +			struct vmem_altmap *altmap, bool mapping)
> +{
> +	return __del_device_memory(nid, start, size, altmap, mapping);
> +}
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
> +
> +static int __add_device_memory(int nid, unsigned long start, unsigned long size,
> +				struct vmem_altmap *altmap, bool mapping)
> +{
> +	int ret;
> +        unsigned long start_pfn = PHYS_PFN(start);
> +        unsigned long nr_pages = size >> PAGE_SHIFT;
> +
> +	if (mapping)
> +		ret = arch_add_memory(nid, start, size, altmap, false);
> +        else
> +		ret = add_pages(nid, start_pfn, nr_pages, altmap, false);
> +
> +	if (!ret) {
> +		struct zone *zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE];
> +
> +		online_mem_sections(start_pfn, start_pfn + nr_pages);
> +		move_pfn_range_to_zone(zone, start_pfn, nr_pages, altmap);
> +	}
> +
> +	return ret;
> +}
> +
> +int add_device_memory(int nid, unsigned long start, unsigned long size,
> +				struct vmem_altmap *altmap, bool mapping)
> +{
> +	return __add_device_memory(nid, start, size, altmap, mapping);
> +}

Any reason for these indirections?

> diff --git a/mm/sparse.c b/mm/sparse.c
> index 10b07eea9a6e..016020bd20b5 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -766,12 +766,12 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap,
>  		free_map_bootmem(memmap);
>  }


I guess for readability, this patch could be split up into several
patches. E.g. factoring out of add_device_memory/del_device_memory,
release_mem_region_adjustable change ...

-- 

Thanks,

David / dhildenb

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ