lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20121207141102.4fda582d.akpm@linux-foundation.org>
Date:	Fri, 7 Dec 2012 14:11:02 -0800
From:	Andrew Morton <akpm@...ux-foundation.org>
To:	Xishi Qiu <qiuxishi@...wei.com>
Cc:	WuJianguo <wujianguo@...wei.com>, Liujiang <jiang.liu@...wei.com>,
	<Vyacheslav.Dubeyko@...wei.com>, Borislav Petkov <bp@...en8.de>,
	<andi@...stfloor.org>, <linux-mm@...ck.org>,
	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

On Fri, 7 Dec 2012 16:48:45 +0800
Xishi Qiu <qiuxishi@...wei.com> wrote:

> On x86 platform, if we use "/sys/devices/system/memory/soft_offline_page" to offline a
> free page twice, the value of mce_bad_pages will be added twice. So this is an error,
> since the page was already marked HWPoison, we should skip the page and don't add the
> value of mce_bad_pages.
> 
> $ cat /proc/meminfo | grep HardwareCorrupted
> 
> soft_offline_page()
> 	get_any_page()
> 		atomic_long_add(1, &mce_bad_pages)
> 
> ...
>
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1582,8 +1582,11 @@ int soft_offline_page(struct page *page, int flags)
>  		return ret;
> 
>  done:
> -	atomic_long_add(1, &mce_bad_pages);
> -	SetPageHWPoison(page);
>  	/* keep elevated page count for bad page */
> +	if (!PageHWPoison(page)) {
> +		atomic_long_add(1, &mce_bad_pages);
> +		SetPageHWPoison(page);
> +	}
> +
>  	return ret;
>  }

A few things:

- soft_offline_page() already checks for this case:

	if (PageHWPoison(page)) {
		unlock_page(page);
		put_page(page);
		pr_info("soft offline: %#lx page already poisoned\n", pfn);
		return -EBUSY;
	}

  so why didn't this check work for you?

  Presumably because one of the earlier "goto done" branches was
  taken.  Which one, any why?

  This function is an utter mess.  It contains six return points
  randomly intermingled with three "goto done" return points.

  This mess is probably the cause of the bug you have observed.  Can
  we please fix it up somehow?  It *seems* that the design (lol) of
  this function is "for errors, return immediately.  For success, goto
  done".  In which case "done" should have been called "success".  But
  if you just look at the function you'll see that this approach didn't
  work.  I suggest it be converted to have two return points - one for
  the success path, one for the failure path.  Or something.

- soft_offline_huge_page() is a miniature copy of soft_offline_page()
  and might suffer the same bug.

- A cleaner, shorter and possibly faster implementation is

	if (!TestSetPageHWPoison(page))
		atomic_long_add(1, &mce_bad_pages);

- We have atomic_long_inc().  Use it?

- Why do we have a variable called "mce_bad_pages"?  MCE is an x86
  concept, and this code is in mm/.  Lights are flashing, bells are
  ringing and a loudspeaker is blaring "layering violation" at us!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ