Project 1: Worst Case Execution Time

One of the key task parameters used to schedule jobs is the worst-case execution time of jobs in the task. We cannot simply run the job on any arbitrary input and measure its execution time. The program implementing the job may have several paths that take varying amounts of time depending on job's input. For example, the number of iterations of a loop may vary depending on a value read from an input.

On non-pipelined processors with no cache, the WCET can be determined by counting instructions in each of a program's basic blocks and determining the longest possible path through them. For example, C source code below:

for (i = 0 ; i < n ; i++) {
        if (a[i] > 5) {
                a[i] = a[i] + 3;
        } else {
                a[i] = 3 * b[i] - 7;
        }
}

Assuming the locations of a, b, and n are constant, the above code might compile into something like:

1         movl  n, %eax            ; move n into %eax register
2         movl  $0, %ebx           ; move 0 into %ebx register (i)
3 loop:
4         cmpl  %ebx, %eax         ; while i < n 
5         jge   loop_end
6         cmp   $5, a(%ebx)        ; if (a[i] < 5) then
7         jge   else
8         addl  $3, a(%ebx)        ; a[i] = a[i] + 3
9         jmp   end_if
10 else:
11        movl  b(%ebx), %ecx      ; a[i] = 3 * b[i] - 7
12        imul  $3,  %ecx
13        subl  $7, %ecx
14        movl  %ecx, a(%ebx)
15 end_if:
16        inc   %ebx
17        jmp   loop
18 loop_end:

The two instructions in lines 1 and 2 are executed only once. The loop test and the jump, in lines 4 and 5, are executed n + 1 times, and the rest of the loop body is executed n times. Only one branch of the if statement is executed in each iteration. The if clause has two instructions, in lines 8 and 9, and the else clause has four instructions, in lines 11 through 14. Since we are concerned with the worst-case execution time, we will assume the else clause is always taken because it has more instructions. The overall the number of clock cycles to execute the job, assuming one clock cycle per instruction, is:

2 + (n+1) × 2 + n × ( 2 + 4 + 2 )
lines 1-2 lines 4-5 lines 6-7 lines11-14 lines 16-17

Determining the worst-case execution time is straightforward, as long as an upper bound on n is available.

The above analysis works well as long as the number of cycles per instructions is known. On older architectures and simple microcontrollers, determining the number of cycles per instruction is relatively simple. The number of cycles per instruction were published in the processor datasheet. Modern processors have features like pipelines, caches, branch target buffers, caches, TLB's, etc that greatly increase average execution speed, but make determining worst-case execution time much more difficult. For example, a memory access that hits the L1 cache might take two to four clock cycles, but a memory access that misses the L1 and L2 cache and the TLB can take several hundred clock cycles.

The calculated worst-case execution time for the job will be several hundred times longer than the actual worst-case execution time if we assume the worst-case execution time for every instruction (i.e., pipeline is flushed, L1 cache always missed, L2 cache always missed, TLB always missed, etc). Such an estimate is still safe, meaning that it is still an upper bound on the execution time, but it is not a tight bound. With more careful analysis we can get an estimate that is still safe and tighter than the worst-case-for-every-instruction bound. The following papers discuss tools that deal with some of the issues with modern processors:

Analyze This

The following code performs a transfer function that might be used in a real-time control system. You will analyze it to determine its worst-case execution time.

double xfer(double v)
{
	register double v4;
       	if (v < 0.0) {
                return 0;
        } else if (v > 1.0) {
                return 1.0;
        }
        v4 = v * v;
        v4 = v4 * v4;
        return (((-20.0 * v + 70.0) * v - 84.0) * v + 35.0) * v4;
}

To analyze this code you will need an assembly language version. The compiler, gcc, will generate assembly language when give the -S option. The -On option tells the compiler to use optimization level n when compiling the code. In this case, I recommend at least -O2 to reduce the amount of code you have to analyze.

Undergraduates

When analyzing the code, assume that each instruction takes exactly one cycle, and the clock runs at 495 MHz.

Graduate Students

When analyzing the code, use the pipeline structure and cache structure described in the AMD-K6-2 Code Optimization Application Note and some of the ideas from the papers listed above to determine a worst-case number of clock cycles for the code segment. Note: you do not have to write a tool like those described in the literature, as you only have to analyze this one short program segment.

Assume a clock speed of 495 MHz to determine the execution time.

Everybody

Many commodity CPU's, including the K6-2, have a tick counter, that starts counting as when the machine is first powered on. Measure the average execution time of the above code using the CPU tick counters. The following code reads the CPU tick counter an many Intel and AMD IA32 processors:

union tick_union {
			        unsigned long long      val64;
			        struct {
			                unsigned int    low32;
			                unsigned int    hi32;
			        } lowhi;
			};
			
			extern inline
			unsigned long long
			ticks(void)
			{
			        register union tick_union result;
			        register unsigned int msr = 0;
			
			        __asm __volatile(
			".byte 0xf; .byte 0x31  # Read the cycle counter\n"
			"        movl %%edx, %0  # high order bits\n"
			"        movl %%eax, %1  # low order bits\n"
			: "=g" (result.lowhi.hi32), "=g" (result.lowhi.low32)
			: "g" (msr)
			: "eax", "edx"
			                         );
			
			        return(result.val64);
			}
			
		

Use the code by assigning the return value of ticks() to a long long variable before and after the code you want to measure. The difference between the two values is the number of ticks the code used to execute. Here is an example:

long long start, end;

start = ticks();

... code to measure ...

end = ticks();
printf("code to measure took %lld ticks\n", end - start);

There are a few extra ticks added when you call ticks(), so you should measure the code in a loop and factor out the loop overhead, unless the extra overhead of calling ticks is small compared to the number of ticks required to execute the code being measured.

What to Submit

For this project, you should submit, C source code you used for analysis and testing, the assembly code you analyzed, and a report. The report should introduce the project, document how you calculated the worst-case execution time, and document any thing else interesting about the project. For example, how do the average and worst case execution times compare? Does the average change much if the CPU load is high? (I am looking for some thought here, not just "one number is bigger than the other".)

The code and documentation will be submitted through a web form on the EE 599 web site. The documentation should either be in ASCII text format (preferred) or PDF (if you need graphics). Grading reports is much more difficult when they are submitted in different formats, so reports submitted in formats other than ASCII text or PDF (like Microsoft Word) will not be read.

Resources