linux-kernel - Re: [PATCH 092/104] mm: fix aio performance regression for database caused by THP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5249794C.5050204@profitbricks.com>
Date:	Mon, 30 Sep 2013 15:14:52 +0200
From:	Jack Wang <jinpu.wang@...fitbricks.com>
To:	Greg Kroah-Hartman <gregkh@...uxfoundation.org>
CC:	Luis Henriques <luis.henriques@...onical.com>,
	linux-kernel@...r.kernel.org, stable@...r.kernel.org,
	kernel-team@...ts.ubuntu.com, Khalid Aziz <khalid.aziz@...cle.com>,
	Pravin B Shelar <pshelar@...ira.com>,
	Christoph Lameter <cl@...ux.com>,
	Andrea Arcangeli <aarcange@...hat.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Mel Gorman <mel@....ul.ie>, Rik van Riel <riel@...hat.com>,
	Minchan Kim <minchan@...nel.org>,
	Andi Kleen <andi@...stfloor.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [PATCH 092/104] mm: fix aio performance regression for database
 caused by THP

On 09/30/2013 12:11 PM, Luis Henriques wrote:
> 3.5.7.22 -stable review patch.  If anyone has any objections, please let me know.
> 
> ------------------
> 
> From: Khalid Aziz <khalid.aziz@...cle.com>
> 
> commit 7cb2ef56e6a8b7b368b2e883a0a47d02fed66911 upstream.
> 
> I am working with a tool that simulates oracle database I/O workload.
> This tool (orion to be specific -
> <http://docs.oracle.com/cd/E11882_01/server.112/e16638/iodesign.htm#autoId24>)
> allocates hugetlbfs pages using shmget() with SHM_HUGETLB flag.  It then
> does aio into these pages from flash disks using various common block
> sizes used by database.  I am looking at performance with two of the most
> common block sizes - 1M and 64K.  aio performance with these two block
> sizes plunged after Transparent HugePages was introduced in the kernel.
> Here are performance numbers:
> 
> 		pre-THP		2.6.39		3.11-rc5
> 1M read		8384 MB/s	5629 MB/s	6501 MB/s
> 64K read	7867 MB/s	4576 MB/s	4251 MB/s
> 
> I have narrowed the performance impact down to the overheads introduced by
> THP in __get_page_tail() and put_compound_page() routines.  perf top shows
>> 40% of cycles being spent in these two routines.  Every time direct I/O
> to hugetlbfs pages starts, kernel calls get_page() to grab a reference to
> the pages and calls put_page() when I/O completes to put the reference
> away.  THP introduced significant amount of locking overhead to get_page()
> and put_page() when dealing with compound pages because hugepages can be
> split underneath get_page() and put_page().  It added this overhead
> irrespective of whether it is dealing with hugetlbfs pages or transparent
> hugepages.  This resulted in 20%-45% drop in aio performance when using
> hugetlbfs pages.
> 
> Since hugetlbfs pages can not be split, there is no reason to go through
> all the locking overhead for these pages from what I can see.  I added
> code to __get_page_tail() and put_compound_page() to bypass all the
> locking code when working with hugetlbfs pages.  This improved performance
> significantly.  Performance numbers with this patch:
> 
> 		pre-THP		3.11-rc5	3.11-rc5 + Patch
> 1M read		8384 MB/s	6501 MB/s	8371 MB/s
> 64K read	7867 MB/s	4251 MB/s	6510 MB/s
> 
> Performance with 64K read is still lower than what it was before THP, but
> still a 53% improvement.  It does mean there is more work to be done but I
> will take a 53% improvement for now.
> 
> Please take a look at the following patch and let me know if it looks
> reasonable.
> 
> [akpm@...ux-foundation.org: tweak comments]
> Signed-off-by: Khalid Aziz <khalid.aziz@...cle.com>
> Cc: Pravin B Shelar <pshelar@...ira.com>
> Cc: Christoph Lameter <cl@...ux.com>
> Cc: Andrea Arcangeli <aarcange@...hat.com>
> Cc: Johannes Weiner <hannes@...xchg.org>
> Cc: Mel Gorman <mel@....ul.ie>
> Cc: Rik van Riel <riel@...hat.com>
> Cc: Minchan Kim <minchan@...nel.org>
> Cc: Andi Kleen <andi@...stfloor.org>
> Signed-off-by: Andrew Morton <akpm@...ux-foundation.org>
> Signed-off-by: Linus Torvalds <torvalds@...ux-foundation.org>
> [ luis: backported to 3.5: adjusted context ]
> Signed-off-by: Luis Henriques <luis.henriques@...onical.com>
Hi Greg,

I suppose this patch also needed for 3.4, right?

Regards,
Jack


> ---
>  mm/swap.c | 77 ++++++++++++++++++++++++++++++++++++++++++---------------------
>  1 file changed, 52 insertions(+), 25 deletions(-)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 4e7e2ec..0c833e8 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -30,6 +30,7 @@
>  #include <linux/backing-dev.h>
>  #include <linux/memcontrol.h>
>  #include <linux/gfp.h>
> +#include <linux/hugetlb.h>
>  
>  #include "internal.h"
>  
> @@ -77,6 +78,19 @@ static void __put_compound_page(struct page *page)
>  
>  static void put_compound_page(struct page *page)
>  {
> +	/*
> +	 * hugetlbfs pages cannot be split from under us.  If this is a
> +	 * hugetlbfs page, check refcount on head page and release the page if
> +	 * the refcount becomes zero.
> +	 */
> +	if (PageHuge(page)) {
> +		page = compound_head(page);
> +		if (put_page_testzero(page))
> +			__put_compound_page(page);
> +
> +		return;
> +	}
> +
>  	if (unlikely(PageTail(page))) {
>  		/* __split_huge_page_refcount can run under us */
>  		struct page *page_head = compound_trans_head(page);
> @@ -180,38 +194,51 @@ bool __get_page_tail(struct page *page)
>  	 * proper PT lock that already serializes against
>  	 * split_huge_page().
>  	 */
> -	unsigned long flags;
>  	bool got = false;
> -	struct page *page_head = compound_trans_head(page);
> +	struct page *page_head;
>  
> -	if (likely(page != page_head && get_page_unless_zero(page_head))) {
> +	/*
> +	 * If this is a hugetlbfs page it cannot be split under us.  Simply
> +	 * increment refcount for the head page.
> +	 */
> +	if (PageHuge(page)) {
> +		page_head = compound_head(page);
> +		atomic_inc(&page_head->_count);
> +		got = true;
> +	} else {
> +		unsigned long flags;
> +
> +		page_head = compound_trans_head(page);
> +		if (likely(page != page_head &&
> +					get_page_unless_zero(page_head))) {
> +
> +			/* Ref to put_compound_page() comment. */
> +			if (PageSlab(page_head)) {
> +				if (likely(PageTail(page))) {
> +					__get_page_tail_foll(page, false);
> +					return true;
> +				} else {
> +					put_page(page_head);
> +					return false;
> +				}
> +			}
>  
> -		/* Ref to put_compound_page() comment. */
> -		if (PageSlab(page_head)) {
> +			/*
> +			 * page_head wasn't a dangling pointer but it
> +			 * may not be a head page anymore by the time
> +			 * we obtain the lock. That is ok as long as it
> +			 * can't be freed from under us.
> +			 */
> +			flags = compound_lock_irqsave(page_head);
> +			/* here __split_huge_page_refcount won't run anymore */
>  			if (likely(PageTail(page))) {
>  				__get_page_tail_foll(page, false);
> -				return true;
> -			} else {
> -				put_page(page_head);
> -				return false;
> +				got = true;
>  			}
> +			compound_unlock_irqrestore(page_head, flags);
> +			if (unlikely(!got))
> +				put_page(page_head);
>  		}
> -
> -		/*
> -		 * page_head wasn't a dangling pointer but it
> -		 * may not be a head page anymore by the time
> -		 * we obtain the lock. That is ok as long as it
> -		 * can't be freed from under us.
> -		 */
> -		flags = compound_lock_irqsave(page_head);
> -		/* here __split_huge_page_refcount won't run anymore */
> -		if (likely(PageTail(page))) {
> -			__get_page_tail_foll(page, false);
> -			got = true;
> -		}
> -		compound_unlock_irqrestore(page_head, flags);
> -		if (unlikely(!got))
> -			put_page(page_head);
>  	}
>  	return got;
>  }
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/