Mahmoud Nouh - Software Engineer

I was writing a skeletal animation system for the PSP (PlayStation Portable) and after I made it work I noticed that the framerate is very slow. I narrowed it down to my vertex skinning loop,


for (U32 j = 0; j < MODEL_JOINT_INFLUENCE; j++)
{
	if (inputVtx->jointIDs[j] == 0xFF)
		continue;

	UMatrix4 skinMat = { .fm = inModel->skel->skinningMatrices[inputVtx->jointIDs[j]] };

	UMatrix4 transposed;
	for (U32 c = 0; c < 4; c++)
		for (U32 r = 0; r < 4; r++)
			transposed.f[c][r] = skinMat.f[r][c];

	Vector4 localPos = { inputVtx->pos.x, inputVtx->pos.y, inputVtx->pos.z, 1.0f };
	vfpu_transform_vector(&transposed.fm, &localPos, &localPos);
	vfpu_scale_vector(&localPos, &localPos, inputVtx->weights[j]);

	outVtx->pos.x += localPos.x;
	outVtx->pos.y += localPos.y;
	outVtx->pos.z += localPos.z;
}

This code brought the frametime up to 38.5ms. When I removed the transposing loop, the animation breaks but the frametime is back to 15.9ms.
After some time, I noticed that this is because transposing the matrices causes many cache misses, and this loop runs 3 times (MODEL_JOINT_INFLUENCE) per vertex.
To solve this problem I tried¹,

Fixing the math. But I couldn't do it because at the time, I wasn't very good at linear algebra and matrices specifically.
Moving the matrix to the scratchpad (a memory region just as fast as L1 cache on the PSP) and transposing it there. But then the bottleneck became the memory copies between main RAM and the scratchpad.
Transposing the matrices on the VFPU (Vector Floating Point Unit, which is like SIMD but for the PSP). This worked.

1) I noticed while writing this post that I also could've transposed the matrices outside of the vertex loop which would have fixed the performance problem, but I am glad I never thought of this because I wouldn't have experimented with the VFPU.

I modified the vfpu_transform_vector function from libpspmath to use a column-major alias of the used VFPU matrix register,


	# void vmMultMat4Vec4(const Matrix4* inMat4, const Vector4* inVec4, Vector4* ioVec4)
	.global vmMultMat4Vec4
vmMultMat4Vec4:
	# load matrix into vfpu E000 (C0xx) register
	lv.q		C000, 0($a0)		# inc by 16 = 4 floats * 4 bytes
	lv.q		C010, 16($a0)
	lv.q		C020, 32($a0)
	lv.q		C030, 48($a0)

	lv.q		C100, 0($a1)		# load input vector into C100 register
	vtfm4.q		C110, E000, C100	# perform matrix-vector mult and store vec4 result in C110
#	vtfm4.q		C110, M000, C100	# this is the original line from libpspmath which resulted in transposed matrices
	sv.q		C110, 0($a2)		# store output C110 into ioVec4
	jr			$ra					# jump to $ra (return address)
	nop								# burn branch delay slot (mips-specific quirk)

The E000 VFPU register is a column-major alias of the row-major M000 matrix register.
The Cxy0 register is a 4-component vector representing the y^th column of the x^th matrix.

Then I changed my vertex loop to,


for (U32 j = 0; j < MODEL_JOINT_INFLUENCE; j++)
{
	[...]

	Vector4 localPos = { inputVtx->pos.x, inputVtx->pos.y, inputVtx->pos.z, 1.0f };
	vmMultMat4Vec4(&transposed.fm, &localPos, &localPos);
	vfpu_scale_vector(&localPos, &localPos, inputVtx->weights[j]);

	[...]
}

The only thing which changed from the original loop is replacing vfpu_transform_vector with vmMultMat4Vec4 which transposes the matrices directly using the VFPU without any repeated memory accesses. And the frametime is now 15.9ms, same as when I removed the transpose code completely in the beginning of my debugging.

Cache Coherency on the PSP