lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <51BE2F5C.8070408@meduna.org>
Date:	Sun, 16 Jun 2013 23:34:20 +0200
From:	Stanislav Meduna <stano@...una.org>
To:	Rik van Riel <riel@...hat.com>
CC:	"H. Peter Anvin" <hpa@...or.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	"linux-rt-users@...r.kernel.org" <linux-rt-users@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>,
	the arch/x86 maintainers <x86@...nel.org>,
	Hai Huang <hhuang@...hat.com>
Subject: Re: [PATCH] mm: fix up a spurious page fault whenever it happens

Hi all,

I was able to reproduce the page fault problem with
a relatively simple application, for now on the
Geode platform. It can be downloaded at

  http://www.meduna.org/tmp/PageFault.tar.gz

Basically the test application does:

- 4 threads that do nothing but periodically sleep
- 1 thread looping in a timerfd loop doing nothing
- 4 threads doing nonblocking TCP connects to an address
  in the local network that does not exist, i.e. all that
  happens are ARP requests.
- additionally a non-existing TCP congestion algorithm is
  requested resulting in repeated futile requests to load
  the module. This looks to be an important part in reproducing
  it, but the problem also occasionally happened with kernels
  that did not have modules enabled at all, so it is
  probably just pushing some probabilities.
- the application is statically linked - this might or might
  not be relevant, I just wanted the text-segment to be bigger

I know it is a weird mix, I was just trying to mimic what
our application did in the form that was able to trigger
the faults most often.

In my few tests this repeatably triggered the problem in hours,
max a day.

My feeling is that the problem is triggered best if there
is little network traffic and no other connections to the
machine, but this is only a subjective feeling.

The kernel configuration, cpuinfo, meminfo and lspci
are included in the tarball. The kernel configuration is not
very clean, it is a kernel intended to work on both Geode
and Celeron and is also a snapshot of what reproduced the
problem the best.

The environment is a current 3.4-rt with following tweaks:

 chrt -f -p 37 <pid of ksoftirqd/0>
 chrt -o -p 0 <pid of irq/14-pata>  [because of a pata_cs5536 bug]
 renice -15 <pid of irq/14-pata>
 ulimit -s 512

Before compiling change the CONNECT_ADDR define to an address
that is in the local LAN but is not present.

Other than this application a lightweight mix of usual Debian
processes is running. There are no servers except openssh and ntp.
A shell script that wakes each 2 seconds and does some
housekeeping is running, that probably recovers the system
when it enters the page-fault loop followed by the
RT throttling.

Right now a test with the same kernel with preempt none
is running to see whether the problem also happens with this
application there (due to the timing sensitivity only a positive
result has a significance). I did not have a chance to test
on an Intel processor yet.

Thanks
-- 
                                       Stano

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ