[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150917204343.GA14658@amd>
Date: Thu, 17 Sep 2015 22:43:43 +0200
From: Pavel Machek <pavel@....cz>
To: "Chen, Yu C" <yu.c.chen@...el.com>
Cc: "rjw@...ysocki.net" <rjw@...ysocki.net>,
"Brown, Len" <len.brown@...el.com>,
"linux-pm@...r.kernel.org" <linux-pm@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"Zhang, Rui" <rui.zhang@...el.com>,
"jlee@...e.com" <jlee@...e.com>,
"joeyli.kernel@...il.com" <joeyli.kernel@...il.com>,
"yinghai@...nel.org" <yinghai@...nel.org>,
Ingo Molnar <mingo@...nel.org>
Subject: Re: [PATCH] [v4] PM / hibernate: Fix hibernation panic caused by
inconsistent e820 map
Hi!
> > > This is also because BIOS provides different e820 memory map
> > > before/after hibernation, and linux regards it as invalid process
> > > and refuses to resume, in order to protect against data corruption.
> > > However, this check might be too strict, consider the following scenario:
> >
> > Well... yes, the check is strict, but why is BIOS doing that? Can you
> > fix it instead?
> Humm, I sync with BIOS team, then I got the answer that,
> the e820 map is allocated dynamically each time it boots up,
> so it is poissible BIOS shows different map each time it boots up.
> Currently our BIOS team is working on it, but some problematic BIOS
> have already be released, so I think Linux should deal with this situation.
Well.. you can't really deal with the situation. That's what confuses
me. If original kernel uses memory that is "not present" now, there's
nothing you can do... but panic / fail resume.
> > > With v3 patch applied, I did 30 cycles on my problematic platform,
> > > no panic triggered anymore(50% reproducible before patched, by
> > > plugging/unplugging memory peripheral during hibernation), and it
> > > just warns of invalid pages.
> >
> > "Just warns of invalid pages". Do you want to say that you "just cause data
> > corruption"?
> >
> > If you don't have enough memory, YOU DON'T RESTORE. Disks were synced,
> > so not restoring is safe. Running with memory corruption is NOT.
> >
> Sorry, I do not quite understand this scenario, do you mean:
> "Without this patch , the checking of memory consistency is at a early stage,
> just before the actual pages restoring,so it's a safe time for system to determin
> restore or terminate.
> And with this patch applied, the checking will be put off to a later stage, which
> is not safe when memory is low?"
>
> I think in this patch, the memory size checking has been moved
> a little later than Its original place, the checking is still before the
> actual restoring image data pages:
> It happens once the last meta_page has been readed:
> prepare_image->mark_unsafe_pages (before the actual restoing of
> data pages)
Aha, so you should still see some failures in the testing.
> > > + if (!swsusp_page_is_valid(pfn_to_page(pfn))) {
> > > + pr_err(
> > > + "PM: Hibernation failed, address %#010llx to restored not
> > valid!\n",
> > > + (unsigned long long) pfn << PAGE_SHIFT);
> >
> > ...and still bad english.
> >
> Oh, will fix it:
> PM: Hibernation failed, address %#010llx to be restored is not valid!
>
> Hope to hear from you, thanks!
Yes, that's better.
But I still don't like the patch.
0) BIOS is broken, and this does not completely work around it. Users
will still see the failed hibernation when the memory that is now
unavailable was actually used.
1) It allocates bm3 even on systems that don't need the workaround
(arm, ia32)
2) If you use hibernation on 32-bit kernel on affected system, you'll
still get panic.
3) I'm not sure I understand the changelog correctly. What happens
when BIOS reports less memory on hibernation? Will you magically
remove memory from kernel at runtime? Will /proc/meminfo be invalid
after resume? Will all the memory management tuning need fixing?
Changelog is really confusing. "failor" is not a english word.
After this patch applied, the panic will be replaced with the warning:
...
according to your explanation, panic will be replaced with the resume
failure, not mere warning.
I believe we have case of "this BIOS problem can not be reasonably
worked around" here.
Best regards,
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists