linux-kernel - Re: [PATCH] x86: Prevent oops with >16 memory controllers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20150216113959.GA4458@pd.tnic>
Date:	Mon, 16 Feb 2015 12:40:00 +0100
From:	Borislav Petkov <bp@...en8.de>
To:	Daniel J Blueman <daniel@...ascale.com>
Cc:	Doug Thompson <dougthompson@...ssion.com>,
	Mauro Carvalho Chehab <mchehab@....samsung.com>,
	linux-edac@...r.kernel.org, linux-kernel@...r.kernel.org,
	Steffen Persvold <sp@...ascale.com>
Subject: Re: [PATCH] x86: Prevent oops with >16 memory controllers

On Sat, Feb 14, 2015 at 11:18:40AM +0800, Daniel J Blueman wrote:
> When ECC interrupts occur on memory controllers after EDAC_MAX_MCS (16), the

I knew this artificial limit would come back to bite us someday :-\

> kernel fatally dereferences unallocated structures [1]; this occurs on at
> least NumaConnect systems.
> 
> Minimally fix by checking if a memory controller info structure is allocated;
> candidate for stable.
> 
> Signed-off-by: Daniel J Blueman <daniel@...ascale.com>
> 
> -- [1]
> 
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000320
> IP: [<ffffffff819f714f>] decode_bus_error+0x2f/0x2b0
> PGD 2f8b5a3067 PUD 2f8b5a2067 PMD 0
> Oops: 0000 [#2] SMP
> Modules linked in:
> CPU: 224 PID: 11930 Comm: stream_c.exe.gn Tainted: G   D    3.19.0 #1

CPU 224?! What node is that? :)

> ---
>  drivers/edac/amd64_edac.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> index 17638d7..baccc0e 100644
> --- a/drivers/edac/amd64_edac.c
> +++ b/drivers/edac/amd64_edac.c
> @@ -2175,7 +2175,7 @@ static void __log_bus_error(struct mem_ctl_info *mci, struct err_info *err,
>  static inline void decode_bus_error(int node_id, struct mce *m)
>  {
>  	struct mem_ctl_info *mci = mcis[node_id];
> -	struct amd64_pvt *pvt = mci->pvt_info;
> +	struct amd64_pvt *pvt;
>  	u8 ecc_type = (m->status >> 45) & 0x3;
>  	u8 xec = XEC(m->status, 0x1f);
>  	u16 ec = EC(m->status);
> @@ -2190,6 +2190,11 @@ static inline void decode_bus_error(int node_id, struct mce *m)
>  	if (xec && xec != F10_NBSL_EXT_ERR_ECC)
>  		return;
>  
> +	/* Unable to decode on memory controllers after EDAC_MAX_MCS, as no mci is allocated */
> +	if (!mci)
> +		return;
> +	pvt = mci->pvt_info;

Hmm, so we have all the facilities to fix that properly, IINM:
edac_mc_find(), add_mc_to_global_list() and so on.

Would looking through the list of the memory controllers help instead,
i.e. if you do:

static inline void decode_bus_error(int node_id, struct mce *m)
{
	struct mem_ctl_info *mci = edac_mc_find(node_id);
	if (!mci)
		return;

?

Then we can get rid of that local mcis dumbness and do it properly...

Thanks.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/