This is a great post for educational purposes, but it should be pointed out that it is (very likely, I haven't tried) going to be significantly slower than the builtin Transpose.
The reason is that naive implementations of simple operations like transposition, array addition, etc. will not be as fast as an explicitly vectorized version. Writing it to take advantage of SIMD operations (like SSE) and considering cache effects should speed it up considerably. The builtin Transpose is certainly written this way.
The point I want to come to is that it would be great if LibraryLink provided access to some of these operations, in particular transposition. Mathematica stores matrices in row-major order (for very good reasons). Most other libraries use column-major order. If we want to transfer data between the two, it becomes necessary to transpose while copying. In a future version I would love to see a transpose-copy function for this purpose.