lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Pine.LNX.4.64.0703132009570.11129@blonde.wat.veritas.com>
Date:	Tue, 13 Mar 2007 20:49:04 +0000 (GMT)
From:	Hugh Dickins <hugh@...itas.com>
To:	Martin Drab <drab@...ler.fjfi.cvut.cz>
cc:	Carsten Otte <cotte.de@...il.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: Question about memory mapping mechanism

On Fri, 9 Mar 2007, Martin Drab wrote:
> On Fri, 9 Mar 2007, Martin Drab wrote:
> > On Thu, 8 Mar 2007, Martin Drab wrote:
> > > On Thu, 8 Mar 2007, Carsten Otte wrote:
> > > > On 3/8/07, Martin Drab <drab@...ler.fjfi.cvut.cz> wrote:
> > > > > 
> > > > > The thing is that I'd like to prevent kernel to swap these pages out,
> > > > > because then I may loose some data when they are not available in time
> > > > > for the next round.
> > > > 
> > > > One think you could do is grab a reference to the pages upfront.
> > > 
> > > I'm not really sure what exactly do you mean by "grab a reference 
> > > upfront"?
> > 
> > I seem to get it now. So instead of setting PG_reserved upon allocation of 
> > the buffer pages, I should increase the referrence of the pages by calling 
> > get_page() on them and that would prevent the pages to get on the lru list 
> > and thus preventing them to be swapped out. Is that it?
> 
> Well, so I tried it. Truth is that the "Bad page" messages upon munmap(2) 
> are gone. But whether the pages are really prevented from being swapped 
> out? I don't know. I don't know how to find out.

Hi Martin, sorry for joining so late.

Most of your anxieties are unfounded: the pages your driver allocates
with __get_free_pages are not put on any LRU (until they're freed),
kernel pages are never swapped out, and making them visible to user-
space does not put them in any more danger of being swapped out;
but yes, if the refcounting goes wrong, that will cause premature
freeing, "Bad page" messages, trouble.

(Of course, filesystem pagecache pages are liable to be swapped
out, and anonymous userspace pages: but you'd have to take special
steps to go down those paths, which I'm confident you're not taking:
this code used to work, you say.)

Please disregard the suggestion to look at Infiniband, it will only
confuse you further: Infiniband core/uverbs_mem.c does provide a very
good example of how to handle the much more complex opposite of what
you're trying to do.  You have a driver making its pages visible to
userspace, Infiniband shows how a driver should deal with userspace
buffers made available to it (which does involve worrying about
those issues which were concerning you).

Your problem is that PageReserved used to work for you, and now
(post-2.6.14) it doesn't: we do now rely entirely on the refcount,
and will report "Bad page state" if a pagecount goes down to zero
while still marked as PageReserved, which is what you saw.

My guess is that you're using __get_free_pages with non-0 order?
But mapping individual PAGE_SIZE pages from that into userspace
by a nopage method?  That is liable to be a problem, yes, because
the refcount for the whole is kept in the first struct page, the
later struct pages showing refcount 0: the whole is supposed to
be dealt with all together, but userspace page accounting on exit
will treat each PAGE_SIZE separately.  PageReserved used to
override the refcount, but it no longer does so.

There's a number of different solutions to that, and fiddling with
the reference counts of the constituent pages is certainly one of
them; though better is to use split_page() (see mm/page_alloc.c),
then free the constituent pages separately at the end; (and better
is to avoid >0-order allocations since they're harder to guarantee,
but presumably you don't want to reorganize your driver right now;)
but the simplest change is to __get_free_pages with the __GFP_COMP
flag set, which marks all the constituent pages as constituents of
a compound page, and thereby keeps the refcounting right - that's
the solution we used in sound/core/memalloc.c when this problem
first came up (but it won't work pre-2.6.15).

Maybe the solution you've already adopted, with additional
get_page()s, is correct: but you need to be careful when your
module is unloaded, there's a danger of leaking the memory if
you don't put_page() the first, and there's a danger of... I
forget what exactly if you don't reset the counts properly on
the subsequent constituents.  __GFP_COMP to hold the compound
page properly together, or split_page() to split it properly
apart, are preferable.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ