[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120327170655.GB7937@aftab>
Date: Tue, 27 Mar 2012 19:06:55 +0200
From: Borislav Petkov <bp@...64.org>
To: "Luck, Tony" <tony.luck@...el.com>
Cc: Mauro Carvalho Chehab <mchehab@...hat.com>,
Ingo Molnar <mingo@...e.hu>,
EDAC devel <linux-edac@...r.kernel.org>,
LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 3/3] EDAC: Convert AMD EDAC pieces to use RAS printk
buffer
On Mon, Mar 12, 2012 at 07:03:59PM +0100, Borislav Petkov wrote:
> On Mon, Mar 12, 2012 at 04:59:37PM +0000, Luck, Tony wrote:
> > > Sounds better, especially the close-on-exit part. Please elaborate on
> > > the races...
> >
> > Errors are happening asynchronously to everything. Race looks like:
> >
> > Daemon exits (or is killed)
> > <<<< race begins here
> > kernel close routine called
> > close routine updates your global variable
> > <<<< race ends here
>
> Well, in that case, we're going to miss logging a single error, or log
> it incomplete.
>
> Unless, we make the global variable atomic and make the daemon zero it
> as the first action it does when it starts going away. If it is killed,
> then we probably need some sanity-checking functionality which checks
> periodically whether the daemon is still alive ...
>
> This probably needs more meditation.
Ok, hm, how about we add a timer which runs for a safe period of say...
a couple of minutes after the error has been logged into the buffer.
Before it expires we expect that the userspace daemon comes in and
consumes the information - we test explicitly whether it wrote to some
file - or implicitly by checking whether the buffer got emptied in the
meantime (the exact method is still TBD).
In any case, if during the safe period of time we haven't received
confirmation from userspace that the item has been consumed, we switch
irreversibly back to the kernel log buffer and reissue the decoded info
through printk.
This way we
* don't introduce a device file with a ->close
* remain races-agnostic: either the timeout has happened and userspace
hasn't consumed the decoded data or it worked just fine and we continue
on with our marry error collection.
If other errors happen while the timer is running, we log them as usual
and restart the timer to give the newest error an equal chance. Error
size shouldn't overflow the buffer because we're reserving 4 pages per
CPU currently and this can easily be enlarged...
Hmm, thoughts..?
--
Regards/Gruss,
Boris.
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists