linux-kernel - Re: [PATCH 1/2] Make the TSC safe to be used by gettimeofday().

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <45592929.2000606@FreeBSD.org>
Date:	Mon, 13 Nov 2006 18:25:45 -0800
From:	Suleiman Souhlal <ssouhlal@...eBSD.org>
To:	Andi Kleen <ak@...e.de>
CC:	Linux Kernel ML <linux-kernel@...r.kernel.org>, vojtech@...e.cz,
	Jiri Bohac <jbohac@...e.cz>
Subject: Re: [PATCH 1/2] Make the TSC safe to be used by gettimeofday().

Andi Kleen wrote:
> On Tuesday 14 November 2006 02:08, Suleiman Souhlal wrote:
> 
>>This is done by a per-cpu vxtime structure that stores the last TSC and HPET
>>values.
>>
>>Whenever we switch to a userland process after a HLT instruction has been
>>executed or after the CPU frequency has changed, we force a new read of the
>>TSC, HPET and xtime so that we know the correct frequency we have to deal
>>with.
> 
> 
> Hmm, interesting approach. 
> 
> 
>>With this, we can safely use RDTSC in gettimeofday() in CPUs where the
>>TSCs are not synchronized, such as Opterons, instead of doing a very expensive
>>HPET read.
> 
> 
>>+	cpu = hard_smp_processor_id();
> 
> 
> Why not smp_processor_id ? 

Because CPUID returns the APIC ID, and I was under the impression that 
the cpu numbers
smp_processor_id() were dynamically allocated and didn't necessarily 
match the
APIC ID.

>>+
>>+	apicid = hard_smp_processor_id();
> 
> 
> The apicid <-> linux cpuid mapping should be 1:1 so i don't know
> why you do this separately. All the uses of hard_smp_processor_id
> are bogus.
> 
> 
>>+
>>+	/*
>>+	 * We will need to also do this when switching to kernel tasks if we
>>+	 * want to use the per-cpu monotonic_clock in the kernel
>>+	 */
>>+	if (vxtime.pcpu[apicid].need_update == 1 && next_p->mm != NULL)
>>+		vxtime_update_pcpu();
> 
> 
> Why? It should be really only needed on HLT/cpufreq change, or?
 >
> Anyways, I'm not very fond of adding code to the context switch critical
> path.

Yes, it's only needed on HLT and cpufreq change.
The code here is to force a "resynch" with the HPET if we've done a HLT. 
  It has to be done before we switch to any userland thread that might 
use the per-cpu vxtime. switch_to() seemed like the most natural place 
to put this.

> 
> 
>> 		seq = read_seqbegin(&xtime_lock);
>>+		preempt_disable();
>>+		cpu = hard_smp_processor_id();
> 
> 
> Again, shouldn't use hard
> 
> 
>> 
>>-		sec = xtime.tv_sec;
>>-		usec = xtime.tv_nsec / NSEC_PER_USEC;
>>+		sec = vxtime.pcpu[cpu].tv_sec;
>>+		usec = vxtime.pcpu[cpu].tv_usec;
>> 
>> 		/* i386 does some correction here to keep the clock 
>> 		   monotonous even when ntpd is fixing drift.
>>@@ -135,9 +138,13 @@ void do_gettimeofday(struct timeval *tv)
>> 		   be found. Note when you fix it here you need to do the same
>> 		   in arch/x86_64/kernel/vsyscall.c and export all needed
>> 		   variables in vmlinux.lds. -AK */ 
>>-		usec += do_gettimeoffset();
>>+		t = get_cycles_sync();
>>+		x = (((t - vxtime.pcpu[cpu].last_tsc) *
>>+		    vxtime.pcpu[cpu].tsc_nsquot) >> NS_SCALE) / NSEC_PER_USEC;
>>+		usec += x;
>> 
>> 	} while (read_seqretry(&xtime_lock, seq));
>>+	preempt_enable();
> 
> 
> If it can be implemented without races in user space, why does
> it need preempt disable in kernel space?

Because there is indeed a small race in user space (as you noted below) 
that I still haven't found how to address.

> The faster way to access all the vxtime code would be to stick
> a pointer to the per CPU vxtime data into the PDA
> 
> 
> 
>>+#define NSEC_PER_TICK (NSEC_PER_SEC / HZ)
>>+extern unsigned long hpet_tick;
>>+
>>+extern unsigned long vxtime_hz;
> 
> 
> No externs in .c files. Happened earlier too I think
> 
> 
>>+
>> static __always_inline void timeval_normalize(struct timeval * tv)
>> {
>> 	time_t __sec;
>>@@ -57,35 +67,107 @@ static __always_inline void timeval_norm
>> 	}
>> }
>> 
>>+inline int apicid(void)
>>+{
>>+	int cpu;
>>+
>>+	__asm __volatile("cpuid" : "=b" (cpu) : "a" (1) : "cx", "dx");
>>+	return (cpu >> 24);
> 
> 
> The faster way to do this is to use LSL from a magic GDT entry.

A cow-orker suggested that we use SIDT and encode the CPU number in the 
limit of the IDT, which should be even faster than LSL.

> 
>>+}
>>+
>> static __always_inline void do_vgettimeofday(struct timeval * tv)
>> {
>> 	long sequence, t;
>> 	unsigned long sec, usec;
>>+	int cpu;
>> 
>> 	do {
>> 		sequence = read_seqbegin(&__xtime_lock);
>>-		
>>-		sec = __xtime.tv_sec;
>>-		usec = __xtime.tv_nsec / 1000;
>>-
>>-		if (__vxtime.mode != VXTIME_HPET) {
>>-			t = get_cycles_sync();
>>-			if (t < __vxtime.last_tsc)
>>-				t = __vxtime.last_tsc;
>>-			usec += ((t - __vxtime.last_tsc) *
>>-				 __vxtime.tsc_quot) >> 32;
>>-			/* See comment in x86_64 do_gettimeofday. */
>>-		} else {
>>-			usec += ((readl((void __iomem *)
>>-				   fix_to_virt(VSYSCALL_HPET) + 0xf0) -
>>-				  __vxtime.last) * __vxtime.quot) >> 32;
>>-		}
>>-	} while (read_seqretry(&__xtime_lock, sequence));
>>+		cpu = apicid();
>>+
>>+		sec = __vxtime.pcpu[cpu].tv_sec;
>>+		usec = __vxtime.pcpu[cpu].tv_usec;
>>+		rdtscll(t);
>>+
>>+		usec += (((t - __vxtime.pcpu[cpu].last_tsc) *
>>+		    __vxtime.pcpu[cpu].tsc_nsquot) >> NS_SCALE) / NSEC_PER_USEC;
>>+	} while (read_seqretry(&__xtime_lock, sequence) || apicid() != cpu);
> 
> 
> Hmm, isn't there still a (unlikely, but possible) race: 
> 
>          CPU #0                          CPU #1
>          read cpu number
>  	 switch to CPU #1
>                                          read TSC 
>                                          switch back to CPU #0
>          check cpu number
>          check succeeds, but you got wrong TSC data
> 
> How do you prevent that? I suppose you could force the seqlock to retry
> in this case in the context switch, but I don't think you do that? 

I couldn't figure out how to tell if a context switch has happened from 
userland. I tried putting a per-cpu context switch count, but I couldn't 
figure out how to get it atomically along with the CPU number..

> Also I'm not sure what your context switch code is good for.
> 
> -Andi

Thanks for the feedback.

-- Suleiman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/