Message Boards Message Boards

Sporadic convergence failure: “SingularValueDecomposition::cflsvd”

GROUPS:

Update 1 Per a request from Daniel Lichtblau at Wolfram Research, a minimal example has been uploaded to Google Drive as a pure-ASCII no-format "m"-file named "oneMatrixThatFailsSVD_2017.m". The file is rather large (about 7.7 MBytes) because it encodes a convergence-failing 320x320 complex matrix as an integer byte-array of dimension 320x320x16. This bit-perfect encoding "trick" is necessary because SVD convergence-failure seemingly is exquisitely sensitive to the least-significant bits of IEEE complex numbers (which are of course precisely the bits to which no well-conditioned SVD algorithm should be sensitive).


Update 2 A comment has been added to the end of the above-mentioned text file "oneMatrixThatFailsSVD_2017.m" that provides the following (exact) pipe for flipping the least-significant-bit of complex matrices: Flatten// (* complex square matrix -> complex list *) ExportString[#,"Complex128"]&// (* complex list -> byte string *) ExportString[#,"Base64"]&// (* byte string -> Base64 string *) ByteArray[#]&//Normal// (* Base64 string -> integer list *) Partition[#,16]&// (* integer list -> {...} *) Partition[#,#//Length//Sqrt]&// (* {...} -> integer array *) Map[( {#[[1]]//If[#//EvenQ,#+1,#-1]&} ~ Join ~ (* flip LSB of real part *) #[[2;;8]] ~ Join ~ {#[[9]]//If[#//EvenQ,#+1,#-1]&} ~ Join ~ (* flip LSB of imag part *) #[[10;;16]] )&,#,{2}]&// Flatten// (* integer array -> integer list *) ExportString[#,"Byte"]&// (* integer list -> byte string *) ImportString[#,"Complex128"]&// (* byte string -> complex list *) Partition[#,#//Length//Sqrt]&; (* complex list -> complex matrix *)

It turns out that flipping the least-significant bits of a convergence-failing input matrix does reproducibly eliminate the "SingularValueDecomposition::cflsvd:" message (which is behavior that no well-conditioned SVD algorithm should exhibit).

Hopefully this fine-grained, exactly reproducible control of the "cflsvd" SVD bug will make fixing it much easier in 2017 than back in 2005, when Mathematica tools like ByteArray[__] and ExportString[_,"Complex128"] were less-developed, such that the bug was more challenging to exhibit reproducibly and diagnose reliably.


The bug in a nutshell:  for single-precision complex matrices, Mathematica's SingularValueDecomposition[] sporadically fails to converge.

The associated Mathematica-generated error message is:

SingularValueDecomposition::cflsvd: Machine-precision algorithm
  failed to converge. Arbitrary-precision algorithm is called, 
  which is slower but more accurate.

This is a followup to a long-standing Mathematica bug report (specifically Wolfram Research bug report [TS 28968], submitted way back in 2005).

This tarball provides (in a folder named "SVDfailures_2017") 25 examples of matrices whose convergence fails under Mathematica 10.2.0 for Mac OS X x86. The same tarball provides (in a folder named "SVDfailures_2005") matrices that fail under various versions of Mathematica dating back to 2005 (these files were provided with bug report [TS 28968]). The tarball is rather large (more than 100 MBytes) because it consists mostly of numerical matrices created with "DumpSave[__]").

A principal difference between the 2005 failures and the 2017 failures is that (at least some of) the matrices that failed outright back in 2005, now generate the (undocumented?) convergence-failure message "SingularValueDecomposition::cflsvd"

To anticipate some questions:

  • The arbitrary-precision evaluation does yield a correct decomposition, at the expense of a runtime that is 500-1000X longer.
  • The matrices that fail of convergence are (seemingly) unremarkable in respect to numerical condition and rank.
  • There is no reason (known to me at least) why SVDs of $\mathcal{O}(1)$-entry single-precision matrices should ever "fail to converge", and there is no linear algebra software other than Mathematica's (known to me) that exhibits a similar convergence failure.

My questions are:

  • Does "SingularValueDecomposition::cflsvd" convergence-failure occur more generally, i.e., on systems other than Mathematica 10.2.0 for Mac OS X?
  • What's the best way to report this bug (if it is a reproducible bug)?
POSTED BY: John Sidles
Answer
3 months ago

Please note, too, that an effectively unlimited number of convergence-failing matrices can be generated, as it turns out that randomly permuting rows and columns of a numerical matrix that "cflsvd"-fails, sporadically--not invariably---yields a new matrix that similarly "cflsvd"`-fails.

POSTED BY: John Sidles
Answer
3 months ago

On Mar 11, 2016, near-identical symptoms (namely, SVD convergence-failure) were reported on GitHub's "Anaconda-Issues" forum as "BUG: LinAlgError raised by scipy.linalg.svd for valid matrix #695".

By Apr 26, 2016, Anaconda users reported that:

Great news -- the Intel folks could replicate the issue with the same array 
that Anaconda currently fails on, and they report that it is fixed 
by the current pre-release version of MKL 11.3.3

Here "MKL" is an acronym for (afaict) the "Intel Kernel Library".

However, as of March 9, 2017, Intel users and engineers were still wresting with this problem, as reported in the thread "Bug in GESDD (but not GESVD)". It is not apparent that the source of the SVD failures is understood, much less that a software fix has been implemented.

Conclusion  Multiple lines of evidence seeming establish that an SVD bug exists, that is reproducibly elicited by unremarkable matrices, that has been affecting multiple users, on multiple systems (including Mathematica), for multiple years, whose algorithmic origin(s) are unknown, and for which no software fix(es) are presently available.

POSTED BY: John Sidles
Answer
3 months ago

The original question has been edited to establish that the SVD "cflsvd" bug is sensitive to the least-significant-bits of MachinePrecision input matrices, which is behavior that no well-conditioned SVD algorithm should exhibit.

POSTED BY: John Sidles
Answer
3 months ago

I do not have an answer but I want to avoid the "avoid extended discussion in comments" and anyway this is too long for comments.

First: I do get that same failure to converge message when I run the code in the linked .m file. I do not know what particular route it takes in LAPACK so I'm not sure how to track it. I might be able to get a bit of insight when I try tracking the route of the bignum version, if it happens that our bignum code accurately mirrors the LAPACK version.

Next: "...SVD convergence-failure seemingly is exquisitely sensitive to the least-significant bits of IEEE complex numbers (which are of course precisely the bits to which no well-conditioned SVD algorithm should be sensitive)." This is true but it also provides a hint as to what might be the general nature of the problem. Also I did some perambulation through LAPACK sources.

http://www.netlib.org/lapack/explore-html/index.html

http://www.netlib.org/lapack/explore-html/d0/da6/group__complex16_o_t_h_e_rcomputational_ga42f492d0f5e62a073a80f9ae57a5ee62.html#ga42f492d0f5e62a073a80f9ae57a5ee62

Comments in those files suggest that there may be (uncommon) failure-to-converge states. What happens I suspect has to do with something as basic as trichotomy: the assumption that for real numbers (a,b), either a<b or a=b or a>b. The fact that manifestations depend on machine epsilonic differences in input values, and might even be processor dependent (per discussion in one of the links), are what incline me in this direction.

One way to get a failure is to have an actual mismatch in the code, wherein one place uses less-than and another uses less-equal. This can led to a loop that in effect starts to do nothing at some point and continues this for all successive steps but fails to recognize it has achieved convergence. The people who write and maintain LAPACK know what they are doing and most likely did not make this mistake. That said, given the thicket of code paths it is always possible that such a mismatch exists between some particular pair of routines that use different conventions for checking convergence e.g. IF( thetamin .LT. thresh ) vs. IF( thetamin .LTE. thresh ).

Another way to run afoul of this, somewhat more subtle, is to have a pair of values that explicitly violates trichotomy. Of course that's always impossible...except when it's not. Which is to say, I myself ran afoul of this a couple of years ago. The way it happens, in machine arithmetic, is when some code twice computes the same numeric value, possible but not necessarily in two different ways, and then compares them. They should be equal. Depending on vagaries such as optimization level, vector 8 vs 16 byte alignment, and maybe other details, one value might or might not remain in a register that, depending on architecture, might or might not be larger than a machine double (typical is 80 bits vs 64 for machine doubles). If a computed value has not been stored in a 64 bit location but instead kept in its register, a comparison might well determine that it is strictly greater, or less, than the same value computed earlier and stored in memory.

Such a comparison error can then lead to a loop wherein one part thinks no more work is needed to get convergence, and another thinks convergence has not been attained. Is this the cause of the problems in this particular case? Obviously I don't know. All I can say is this general type of problem is consistent with all the data I have seen that describes circumstances under which is is seen.

POSTED BY: Daniel Lichtblau
Answer
3 months ago

In case it's helpful, the code below seems to trigger the same issue.

L = 5;
n = 4;
id = ConstantArray[1, L] // SparseArray // DiagonalMatrix;
D0 = SparseArray[{{i_, j_} /; Mod[j - i, L] == 1 :> 1, {i_, j_} /; Mod[j - i, L] == 0 :> -1}, {L, L}];
Dj[j_] := KroneckerProduct @@ Table[If[i == j, D0, id], {i, n}]
M = Sum[If[a != b != c != d, KroneckerProduct[Dj[b], SparseArray[{a, (c - 1) n + d} -> 1, {n, n^2}]], 0], {a, n}, {b, n}, {c, n}, {d, n}];
svd = M // Normal // N // SingularValueDecomposition;
POSTED BY: Kevin Slagle
Answer
2 months ago

Group Abstract Group Abstract