I do not have an answer but I want to avoid the "avoid extended discussion in comments" and anyway this is too long for comments.
First: I do get that same failure to converge message when I run the code in the linked .m file. I do not know what particular route it takes in LAPACK so I'm not sure how to track it. I might be able to get a bit of insight when I try tracking the route of the bignum version, if it happens that our bignum code accurately mirrors the LAPACK version.
Next: "...SVD convergence-failure seemingly is exquisitely sensitive to the least-significant bits of IEEE complex numbers (which are of course precisely the bits to which no well-conditioned SVD algorithm should be sensitive)." This is true but it also provides a hint as to what might be the general nature of the problem. Also I did some perambulation through LAPACK sources.
http://www.netlib.org/lapack/explore-html/index.html
http://www.netlib.org/lapack/explore-html/d0/da6/group__complex16_o_t_h_e_rcomputational_ga42f492d0f5e62a073a80f9ae57a5ee62.html#ga42f492d0f5e62a073a80f9ae57a5ee62
Comments in those files suggest that there may be (uncommon) failure-to-converge states. What happens I suspect has to do with something as basic as trichotomy: the assumption that for real numbers (a,b)
, either a<b
or a=b
or a>b
. The fact that manifestations depend on machine epsilonic differences in input values, and might even be processor dependent (per discussion in one of the links), are what incline me in this direction.
One way to get a failure is to have an actual mismatch in the code, wherein one place uses less-than and another uses less-equal. This can led to a loop that in effect starts to do nothing at some point and continues this for all successive steps but fails to recognize it has achieved convergence. The people who write and maintain LAPACK know what they are doing and most likely did not make this mistake. That said, given the thicket of code paths it is always possible that such a mismatch exists between some particular pair of routines that use different conventions for checking convergence e.g. IF( thetamin .LT. thresh )
vs. IF( thetamin .LTE. thresh )
.
Another way to run afoul of this, somewhat more subtle, is to have a pair of values that explicitly violates trichotomy. Of course that's always impossible...except when it's not. Which is to say, I myself ran afoul of this a couple of years ago. The way it happens, in machine arithmetic, is when some code twice computes the same numeric value, possible but not necessarily in two different ways, and then compares them. They should be equal. Depending on vagaries such as optimization level, vector 8 vs 16 byte alignment, and maybe other details, one value might or might not remain in a register that, depending on architecture, might or might not be larger than a machine double (typical is 80 bits vs 64 for machine doubles). If a computed value has not been stored in a 64 bit location but instead kept in its register, a comparison might well determine that it is strictly greater, or less, than the same value computed earlier and stored in memory.
Such a comparison error can then lead to a loop wherein one part thinks no more work is needed to get convergence, and another thinks convergence has not been attained. Is this the cause of the problems in this particular case? Obviously I don't know. All I can say is this general type of problem is consistent with all the data I have seen that describes circumstances under which is is seen.