linux-kernel - Re: [PATCH] mm: fix up a spurious page fault whenever it happens

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <51BE2F5C.8070408@meduna.org>
Date:	Sun, 16 Jun 2013 23:34:20 +0200
From:	Stanislav Meduna <stano@...una.org>
To:	Rik van Riel <riel@...hat.com>
CC:	"H. Peter Anvin" <hpa@...or.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	"linux-rt-users@...r.kernel.org" <linux-rt-users@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>,
	the arch/x86 maintainers <x86@...nel.org>,
	Hai Huang <hhuang@...hat.com>
Subject: Re: [PATCH] mm: fix up a spurious page fault whenever it happens

Hi all,

I was able to reproduce the page fault problem with
a relatively simple application, for now on the
Geode platform. It can be downloaded at

  http://www.meduna.org/tmp/PageFault.tar.gz

Basically the test application does:

- 4 threads that do nothing but periodically sleep
- 1 thread looping in a timerfd loop doing nothing
- 4 threads doing nonblocking TCP connects to an address
  in the local network that does not exist, i.e. all that
  happens are ARP requests.
- additionally a non-existing TCP congestion algorithm is
  requested resulting in repeated futile requests to load
  the module. This looks to be an important part in reproducing
  it, but the problem also occasionally happened with kernels
  that did not have modules enabled at all, so it is
  probably just pushing some probabilities.
- the application is statically linked - this might or might
  not be relevant, I just wanted the text-segment to be bigger

I know it is a weird mix, I was just trying to mimic what
our application did in the form that was able to trigger
the faults most often.

In my few tests this repeatably triggered the problem in hours,
max a day.

My feeling is that the problem is triggered best if there
is little network traffic and no other connections to the
machine, but this is only a subjective feeling.

The kernel configuration, cpuinfo, meminfo and lspci
are included in the tarball. The kernel configuration is not
very clean, it is a kernel intended to work on both Geode
and Celeron and is also a snapshot of what reproduced the
problem the best.

The environment is a current 3.4-rt with following tweaks:

 chrt -f -p 37 <pid of ksoftirqd/0>
 chrt -o -p 0 <pid of irq/14-pata>  [because of a pata_cs5536 bug]
 renice -15 <pid of irq/14-pata>
 ulimit -s 512

Before compiling change the CONNECT_ADDR define to an address
that is in the local LAN but is not present.

Other than this application a lightweight mix of usual Debian
processes is running. There are no servers except openssh and ntp.
A shell script that wakes each 2 seconds and does some
housekeeping is running, that probably recovers the system
when it enters the page-fault loop followed by the
RT throttling.

Right now a test with the same kernel with preempt none
is running to see whether the problem also happens with this
application there (due to the timing sensitivity only a positive
result has a significance). I did not have a chance to test
on an Intel processor yet.

Thanks
-- 
                                       Stano

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/