I am currently trying to optimize some MIPS assembler that I’ve written for a program that triangulates a 24×24 matrix. My current goal is to utilize delayed branching and manual loop unrolling to try and cut down on the cycles. Note: I am using 32-bit single precision for all the matrix arithmetic.
Part of the algorithm involves the following loop that I’m trying to unroll (N will always be 24)
...
float inv = 1/A[k][k]
for (j = k + 1; j < N; j++) {
/* divide by pivot element */
A[k][j] = A[k][j] * inv;
}
...
I want
...
float inv = 1/A[k][k]
for (j = k + 1; j < N; j +=2) {
/* divide by pivot element */
A[k][j] = A[k][j] * inv;
A[k][j + 1] = A[k][j + 1] * inv;
}
...
but it generates the incorrect result and I don’t know why. The interesting thing is that the version with loop unrolling generates the first row of matrix correctly but the remaining ones incorrect. The version without loop unrolling correctly triangulates the matrix.
Here is my attempt at doing it.
...
# No loop unrolling
loop_2:
move $a3, $t2 # column number b = j (getelem A[k][j])
jal getelem # Addr of A[k][j] in $v0 and val in $f0
addiu $t2, $t2, 1 ## j += 2
mul.s $f0, $f0, $f2 # Perform A[k][j] * inv
bltu $t2, 24, loop_2 # if j < N, jump to loop_2
swc1 $f0, 0($v0) ## Perform A[k][j] := A[k][j] * inv
# The matrix triangulates without problem with this original code.
...
...
# One loop unrolling
loop_2:
move $a3, $t2 # column number b = j (getelem A[k][j])
jal getelem # Addr of A[k][j] in $v0 and val in $f0
addiu $t2, $t2, 2 ## j += 2
lwc1 $f1, 4($v0) # $f1 <- A[k][j + 1]
mul.s $f0, $f0, $f2 # Perform A[k][j] * inv
mul.s $f1, $f1, $f2 # Perform A[k][j+1] * inv
swc1 $f0, 0($v0) # Perform A[k][j] := A[k][j] * inv
bltu $t2, 24, loop_2 # if j < N, jump to loop_2
swc1 $f1, 4($v0) ## Perform A[k][j + 1] := A[k][j + 1] * inv
# The first row in the resulting matrix is correct, but the remaining ones not when using this once unrolled loop code.
...
>Solution :
The original C is potentially buggy.
j < N; j +=2 can start the loop body with j = N-1, accessing the array at A[k][N-1] (fine) and A[k][N] (not fine).
One common method is j < N-1, or in general j < N-(unroll-1). But for unsigned N, you also have to separately check N >= unroll before starting the loop, because N-1 could wrap to a huge unsigned value.
Keeping the j < limit is generally good for C compilers vs. j + 1 < N which is a separate thing they’d have to calculate. And can also stop a compiler from proving that the loop isn’t infinite for unsigned counts (like size_t), because that’s well-defined as wrapping around, so N = UINT_MAX could lead to the condition always being true depending on the starting point. (e.g. j = UINT_MAX-2 makes UINT_MAX-1 < UINT_MAX, and j+=2 makes 0 < UINT_MAX, also true.) So it’s a similar problem to using j <= limit for unsigned counters. Compilers really like to know when a loop is potentially infinite. For some, that it disables auto-vectorization if the trip-count isn’t calculable ahead of the first iteration.
If j was starting at 0, you can get away with a sloppy condition if N was guaranteed to be a multiple of the unroll factor. But here it’s different, as Nate points out.
Re: the efficiency of your MIPS asm
generally the point of loop unrolling is performance. A non-inline call to a helper function inside the loop is kind of defeating the purpose.
jal getelem I assume does a bunch of multiplies and stuff to redo the indexing with a pointer and two integers? Notice that you’re scanning along contiguous memory in one row, so you can just increment a pointer.
Calculate an end-pointer to compare against, so your MIPS loop can look like
looptop: # do{
lwc1 $f2, 0($t0)
lwc1 $f3, 4($t0)
addiu $t0, $t0, 4*2 # p+=2 advance by 8 bytes, 2 floats
...
swc1 something, 0($t0)
swc1 something, 4($t0)
bne $t0, $t1 # }while(p!=endp)
Notice that you’re only incrementing the pointer once inside the unrolled loop, and using the 16-bit immediate in load/store instructions to offset it.