linux-kernel - RT thread migration (?) problem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <CAGRd_97YLsQ6MkJFgENuRKHhY30fKP4NgQmzcswgTz=abw-KRQ@mail.gmail.com>
Date:	Wed, 10 Apr 2013 08:49:48 -0400
From:	Phil Wilshire <sysdcs@...il.com>
To:	linux-kernel@...r.kernel.org
Cc:	Phil Wilshire <sysdcs@...il.com>
Subject: RT thread migration (?) problem

We are using kernel 3.2.21

It looks like we are seeing stalls or lockups when allowing rt threads to
migrate.

The problem has been much reduced by pinning threads to cpus.

However, we still see unreasonable delays in triggering a high priority rt
monitoring thread when it, alone, is allowed to migrate cpus. If we lock
this monitoring thread down to a single cpu and it all seems to work
properly.
By unreasonable, I mean a delay of over 500mS or more on a Real Time thread
expected to schedule every 250 mS.

 The kernel 3.2.21 config is attached as is the test code.
 We have tried other kernels and have seen exactly the same problem.

We  have been looking at this problem for about 3 months now.. It was
initially exposed on an intel atom D525 cpu. The system simply seemed to
stop working for a period of time, sometimes over 5 seconds. All threads
and interrupts seemed to stop until suddenly it all woke up again.
Of course when the watchdog was set to 4 seconds we would get a surprise
reboot !!
When we extended the watchdog time we started to see the system recover in
the 5 second period.

This problem  happens very rarely on "normal" system loading, say once
every three months or so.
We managed to accelerate the problem by creating some special stress test
code ( attached ) that reduced the failure time to a day or so on some
systems.

The test code loads the system heavily with a number of RT worker threads
repeatedly performing a small task. We are not worried about small missed
deadlines on the worker threads. We understand that we are subjecting the
system to an unreasonable load. We are simply trying to expose the main
problem.

In addition, there is an overall supervisor thread at the highest priority.
Under the fault condition the supervisor thread would sleep for 5 or so
seconds and then wake up and continue. The supervisor thread is in charge
of pinging the watchdog.

When running the special stress test, we hit the RT Throttling warning and
understand what that means.
I have enabled LTT tracing and ftrace. I did manage to get some
limited traces showing the system just hanging with no activity for 5
seconds. When it wakes up we have a rush of thread migration
operations as the system tries to recover.

Both trace options did not show much detail just prior to the hang.
When I enabled more tracing detail under ftrace the problem did not
seem to happen.

We originally thought this problem was associated with the ATOM processor.
The reason for the posting is that we are now seeing the same  pattern
with some larger Intel Xeon  E5-2630/2680  systems

I am ready to run any tests etc, possibly even provide hardware etc to
explore this problem.
Note the attached thread_test program currently pins the management
thread to cpu 0

TIA

Download attachment "config-t510v1-3.2.21" of type "application/octet-stream" (79515 bytes)

View attachment "thread_test.c" of type "text/x-csrc" (10328 bytes)