linux-kernel - [PATCH] RAS/CEC: Reduce default threshold to offline a page to "2"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <20220607212015.175591-1-tony.luck@intel.com>
Date:   Tue,  7 Jun 2022 14:20:15 -0700
From:   Tony Luck <tony.luck@...el.com>
To:     Borislav Petkov <bp@...en8.de>
Cc:     x86@...nel.org, linux-kernel@...r.kernel.org,
        patches@...ts.linux.dev, Tony Luck <tony.luck@...el.com>
Subject: [PATCH] RAS/CEC: Reduce default threshold to offline a page to "2"

A large scale study of memory errors in data centers showed that it is
best to aggressively take pages with corrected errors offline. This is
the best strategy of using corrected errors as a predictor of future
uncorrected errors.

Signed-off-by: Tony Luck <tony.luck@...el.com>

---
Here's the link to the study. I thought of putting into the code
comment, or the commit comment. But these links are sometimes changed
as website is re-organised, making the link stale.

https://www.intel.com/content/dam/www/public/us/en/documents/intel-and-samsung-mrt-improving-memory-reliability-at-data-centers.pdf

The paper has two recommendations:
1) Change threshold to "2".
2) Do very smart platform dependent things

This commit only addresses the first :-)
---
 drivers/ras/cec.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
index 42f2fc0bc8a9..5d614c383ccf 100644
--- a/drivers/ras/cec.c
+++ b/drivers/ras/cec.c
@@ -125,8 +125,11 @@ static struct ce_array {
 static DEFINE_MUTEX(ce_mutex);
 static u64 dfs_pfn;
 
-/* Amount of errors after which we offline */
-static u64 action_threshold = COUNT_MASK;
+/*
+ * Number of errors after which we offline. Default is to aggressively
+ * offline the page when a second error is seen.
+ */
+static u64 action_threshold = 2;
 
 /* Each element "decays" each decay_interval which is 24hrs by default. */
 #define CEC_DECAY_DEFAULT_INTERVAL	24 * 60 * 60	/* 24 hrs */
-- 
2.35.3