lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <1300746429.2837.20.camel@edumazet-laptop>
Date:	Mon, 21 Mar 2011 23:27:09 +0100
From:	Eric Dumazet <eric.dumazet@...il.com>
To:	ben@...u.net
Cc:	Avi Kivity <avi@...hat.com>, KVM list <kvm@...r.kernel.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	linux-kernel <linux-kernel@...r.kernel.org>,
	John Stultz <johnstul@...ibm.com>,
	Richard Cochran <richard.cochran@...cron.at>
Subject: [RFC] posix-timers: RCU conversion

Le lundi 21 mars 2011 à 23:57 +0545, Ben Nagy a écrit :
> On Mon, Mar 21, 2011 at 10:57 PM, Eric Dumazet <eric.dumazet@...il.com> wrote:
> > Le lundi 21 mars 2011 à 19:02 +0200, Avi Kivity a écrit :
> >
> >> Any ideas on how to fix it?  We could pre-allocate IDs and batch them in
> >> per-cpu caches, but it seems like a lot of work.
> >>
> >
> > Hmm, I dont know what syscalls kvm do, but even a timer_gettime() has to
> > lock this idr_lock.
> >
> > Sounds crazy, and should be fixed in kernel code, not kvm ;)
> 
> Hi,
> 
> I'll need to work out a way I can make the perf.data available
> (~100M), but here's the summary
> 
> http://paste.ubuntu.com/583425/
> 
> And here's the summary of the summary
> 
> # Overhead          Command         Shared Object
>                 Symbol
> # ........  ...............  ....................
> ..........................................
> #
>     71.86%              kvm  [kernel.kallsyms]     [k] __ticket_spin_lock
>                         |
>                         --- __ticket_spin_lock
>                            |
>                            |--54.19%-- default_spin_lock_flags
>                            |          _raw_spin_lock_irqsave
>                            |          |
>                            |           --54.14%-- __lock_timer
>                            |                     |
>                            |                     |--31.92%-- sys_timer_gettime
>                            |                     |          system_call_fastpath
>                            |                     |
>                            |                      --22.22%-- sys_timer_settime
>                            |                                system_call_fastpath
>                            |
>                            |--15.66%-- _raw_spin_lock
> [...]
> 
> Which I guess seems to match what Eric just said.
> 
> I'll post a link to the full data tomorrow. Many thanks for the help
> so far. If it's a kernel scaling limit then I guess we just wait until
> the kernel gets better. :S I'll check it out with a linux guest
> tomorrow anyway.

Here is a quick & dirty patch to lower idr_lock use. You could try it
eventually ;)

Thanks

[RFC] posix-timers: RCU conversion

Ben Nagy reported a scalability problem with KVM/QEMU that hit very hard
a single spinlock (idr_lock) in posix-timers code.

Even on a 16 cpu machine (2x4x2), a single test can show 98% of cpu time
used in ticket_spin_lock

Switching to RCU should be quite easy, IDR being already RCU ready.

idr_lock should be locked only before an insert/delete, not a find.

Reported-by: Ben Nagy <ben@...u.net>
Signed-off-by: Eric Dumazet <eric.dumazet@...il.com>
Cc: Avi Kivity <avi@...hat.com>
Cc: Thomas Gleixner <tglx@...utronix.de>
Cc: John Stultz <johnstul@...ibm.com>
Cc: Richard Cochran <richard.cochran@...cron.at>
---
 include/linux/posix-timers.h |    1 +
 kernel/posix-timers.c        |   26 ++++++++++++++------------
 2 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index d51243a..5dc27ca 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -81,6 +81,7 @@ struct k_itimer {
 			unsigned long expires;
 		} mmtimer;
 	} it;
+	struct rcu_head rcu;
 };
 
 struct k_clock {
diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c
index 4c01249..e2a823a 100644
--- a/kernel/posix-timers.c
+++ b/kernel/posix-timers.c
@@ -491,6 +491,11 @@ static struct k_itimer * alloc_posix_timer(void)
 	return tmr;
 }
 
+static void k_itimer_rcu_free(struct rcu_head *head)
+{
+	kmem_cache_free(posix_timers_cache, container_of(head, struct k_itimer, rcu));
+}
+
 #define IT_ID_SET	1
 #define IT_ID_NOT_SET	0
 static void release_posix_timer(struct k_itimer *tmr, int it_id_set)
@@ -499,11 +504,12 @@ static void release_posix_timer(struct k_itimer *tmr, int it_id_set)
 		unsigned long flags;
 		spin_lock_irqsave(&idr_lock, flags);
 		idr_remove(&posix_timers_id, tmr->it_id);
+		tmr->it_id = -1;
 		spin_unlock_irqrestore(&idr_lock, flags);
 	}
 	put_pid(tmr->it_pid);
 	sigqueue_free(tmr->sigq);
-	kmem_cache_free(posix_timers_cache, tmr);
+	call_rcu(&tmr->rcu, k_itimer_rcu_free);
 }
 
 static struct k_clock *clockid_to_kclock(const clockid_t id)
@@ -631,22 +637,18 @@ out:
 static struct k_itimer *__lock_timer(timer_t timer_id, unsigned long *flags)
 {
 	struct k_itimer *timr;
-	/*
-	 * Watch out here.  We do a irqsave on the idr_lock and pass the
-	 * flags part over to the timer lock.  Must not let interrupts in
-	 * while we are moving the lock.
-	 */
-	spin_lock_irqsave(&idr_lock, *flags);
+
+	rcu_read_lock();
 	timr = idr_find(&posix_timers_id, (int)timer_id);
 	if (timr) {
-		spin_lock(&timr->it_lock);
-		if (timr->it_signal == current->signal) {
-			spin_unlock(&idr_lock);
+		spin_lock_irqsave(&timr->it_lock, *flags);
+		if (timr->it_id == (int)timer_id && timr->it_signal == current->signal) {
+			rcu_read_unlock();
 			return timr;
 		}
-		spin_unlock(&timr->it_lock);
+		spin_unlock_irqrestore(&timr->it_lock, *flags);
 	}
-	spin_unlock_irqrestore(&idr_lock, *flags);
+	rcu_read_unlock();
 
 	return NULL;
 }


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ