@ r0 = input array pointer
@ r1 = output array pointer
@ r2 = length of data in array
@ We can assume that the array length is greater than zero, is an integer
@ number of vectors, and is greater than or equal to the length of data
@ in the array.
add r2, r2, #7 @ add (vector length - 1) to the data length
lsr r2, r2, #3 @ divide the length of the array by the length
@ of a vector, 8, to find the number of
@ vectors of data to be processed
loop:
subs r2, r2, #1 @ decrement the loop counter, and set flags
vld1.8 {d0}, [r0]! @ load eight elements from the array pointed to
@ by r0 into d0, and update r0 to point to the next vector
...
... @ process the input in d0
...
vst1.8 {d0}, [r1]! @ write eight elements to the output array, and
@ update r1 to point to next vector
bne loop @ if r2 is not equal to 0, loop
@ r0 = input array pointer
@ r1 = output array pointer
@ r2 = length of data in array
@ We can assume that the operation is idempotent, and the array is greater
@ than or equal to one vector long.
ands r3, r2, #7 @ calculate number of elements left over after
@ processing complete vectors using
@ data length & (vector length - 1)
beq loopsetup @ if the result of the ands is zero, the length
@ of the data is an integer number of vectors,
@ so there is no overlap, and processing can begin at the loop
@ handle the first vector separately
vld1.8 {d0}, [r0], r3 @ load the first eight elements from the array,
@ and update the pointer by the number of elements left over
...
... @ process the input in d0
...
vst1.8 {d0}, [r1], r3 @ wirte eight elements to the output array, and
@ update the pointer
@ now, set up the vector processing loop
loopsetup:
lsr r2, r2, #3 @ divide the length of the array by the length
@ of a vector, 8, to find the number of
@ vectors of data to be processed
@ the loop can now be executed as normal. the
@ first few elements of the first vector will
@ overlap with some of those processed above
loop:
subs r2, r2, #1 @ decrement the loop counter, and set flags
vld1.8 {d0}, [r0]! @ load eight elements from the array, and update
@ the pointer
...
... @ process the input in d0
...
vst1.8 {d0}, [r1]! @ write eight elements to the output array, and
@ update the pointer
bne loop @ if r2 is not equal to 0, loop
@ r0 = input array pointer
@ r1 = output array pointer
@ r2 = length of data in array
lsrs r3, r2, #3 @ calculate the number of complete vectors to be
@ processed and set flags
beq singlesetup @ if there are zero complete vectors, branch to
@ the single element handling code
@ process vector loop
vectors:
subs r3, r3, #1 @ decrement the loop counter, and set flags
vld1.8 {d0}, [r0]! @ load eight elements from the array and update
@ the pointer
...
... @ process the input in d0
...
vst1.8 {d0}, [r1]! @ write eight elements to the output array, and
@ update the pointer
bne vectors @ if r3 is not equal to zero, loop
singlesetup:
ands r3, r2, #7 @ calculate the number of single elements to process
beq exit @ if the number of single elements is zero, branch
@ to exit
@ process single element loop
singles:
subs r3, r3, #1 @ decrement the loop counter, and set flags
vld1.8 {d0[0]}, [r0]! @ load single element into d0, and update the
@ pointer
...
... @ process the input in d0[0]
...
vst1.8 {d0[0]}, [r1]! @ write the single element to the output array,
@ and update the pointer
bne singles @ if r3 is not equal to zero, loop
exit:
Further Considerations
Beginning or End
Overlapping 和 single element 技术可以应用到处理数组的开始或结束位置。如何应用程序更适合处理结束端,上面的代码可以很容易的改成处理末端的元素。