linux-kernel - Re: [PATCH v2] PM: hibernate: Fix a bug in copying the zero bitmap to safe pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZSBCLE1VLxN-hdRD@FVFF77S0Q05N.cambridge.arm.com>
Date:   Fri, 6 Oct 2023 18:21:48 +0100
From:   Mark Rutland <mark.rutland@....com>
To:     "Rafael J. Wysocki" <rafael@...nel.org>
Cc:     Brian Geffon <bgeffon@...gle.com>,
        Pavankumar Kondeti <quic_pkondeti@...cinc.com>,
        Pavel Machek <pavel@....cz>, Len Brown <len.brown@...el.com>,
        kernel@...cinc.com,
        "Rafael J. Wysocki" <rafael.j.wysocki@...el.com>,
        linux-pm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] PM: hibernate: Fix a bug in copying the zero bitmap
 to safe pages

Hi Rafael,

On Wed, Oct 04, 2023 at 08:46:56PM +0200, Rafael J. Wysocki wrote:
> On Wed, Oct 4, 2023 at 2:19 PM Brian Geffon <bgeffon@...gle.com> wrote:
> >
> > On Wed, Oct 4, 2023 at 1:01 AM Pavankumar Kondeti
> > <quic_pkondeti@...cinc.com> wrote:
> > >
> > > The following crash is observed 100% of the time during resume from
> > > the hibernation on a x86 QEMU system.
> > >
> > > [   12.931887]  ? __die_body+0x1a/0x60
> > > [   12.932324]  ? page_fault_oops+0x156/0x420
> > > [   12.932824]  ? search_exception_tables+0x37/0x50
> > > [   12.933389]  ? fixup_exception+0x21/0x300
> > > [   12.933889]  ? exc_page_fault+0x69/0x150
> > > [   12.934371]  ? asm_exc_page_fault+0x26/0x30
> > > [   12.934869]  ? get_buffer.constprop.0+0xac/0x100
> > > [   12.935428]  snapshot_write_next+0x7c/0x9f0
> > > [   12.935929]  ? submit_bio_noacct_nocheck+0x2c2/0x370
> > > [   12.936530]  ? submit_bio_noacct+0x44/0x2c0
> > > [   12.937035]  ? hib_submit_io+0xa5/0x110
> > > [   12.937501]  load_image+0x83/0x1a0
> > > [   12.937919]  swsusp_read+0x17f/0x1d0
> > > [   12.938355]  ? create_basic_memory_bitmaps+0x1b7/0x240
> > > [   12.938967]  load_image_and_restore+0x45/0xc0
> > > [   12.939494]  software_resume+0x13c/0x180
> > > [   12.939994]  resume_store+0xa3/0x1d0
> > >
> > > The commit being fixed introduced a bug in copying the zero bitmap
> > > to safe pages. A temporary bitmap is allocated with PG_ANY flag in
> > > prepare_image() to make a copy of zero bitmap after the unsafe pages
> > > are marked. Freeing this temporary bitmap with PG_UNSAFE_KEEP later
> > > results in an inconsistent state of unsafe pages. Since free bit is
> > > left as is for this temporary bitmap after free, these pages are
> > > treated as unsafe pages when they are allocated again. This results
> > > in incorrect calculation of the number of pages pre-allocated for the
> > > image.
> > >
> > > nr_pages = (nr_zero_pages + nr_copy_pages) - nr_highmem - allocated_unsafe_pages;
> > >
> > > The allocate_unsafe_pages is estimated to be higher than the actual
> > > which results in running short of pages in safe_pages_list. Hence the
> > > crash is observed in get_buffer() due to NULL pointer access of
> > > safe_pages_list.
> > >
> > > Fix this issue by creating the temporary zero bitmap from safe pages
> > > (free bit not set) so that the corresponding free bits can be cleared while
> > > freeing this bitmap.
> > >
> > > Cc: stable <stable@...nel.org>
> > > Fixes: 005e8dddd497 ("PM: hibernate: don't store zero pages in the image file")
> > > Suggested-by:: Brian Geffon <bgeffon@...gle.com>
> > > Signed-off-by: Pavankumar Kondeti <quic_pkondeti@...cinc.com>
> >
> > Reviewed-by: Brian Geffon <bgeffon@...gle.com>
> 
> Applied as 6.7 material, but without the Cc: stable tag that is (a)
> invalid (there should be vger.kernel.org in the host part) and (b)
> unnecessary AFAICS.

Just to check, did you mean as v6.6 material?

I'm consistently hitting this on real arm64 hardware with v6.6-rc*.

If this is v6.7 material, are we going to revert 005e8dddd497 for now?

I've tested the above patch atop v6.6-rc3, and it solves the problem for me, so
FWIW:

Tested-by: Mark Rutland <mark.rutland@....com>

Thanks,
Mark.