Message Boards Message Boards

GROUPS:

Issue in the classification optimization algorithm for large data sets?

Posted 1 month ago
322 Views
|
4 Replies
|
0 Total Likes
|

I believe that I encountered a bug in Mathematica 12.

The Classify[] function throws errors when simultaneously:

  1. The training set has significantly above $10^5$ examples.
  2. Method->"_" option is absent, i.e. the procedure for searching for the optimal algorithm is active.

(When any of these conditions are changed, the classifier training proceeds correctly.)

The problem depends on the Mathematica version. (I will elaborate this idea.)

The following errors appear together (repeated several times):

a)

NetTrain :: encgenfail2: Could not encode one or more inputs for "Output" port: supplied data was a length-64 vector of real numbers, but expected a class. The invalid inputs had indices {158629, ..., << 14 >>}

b)

LibraryFunction :: typerr: An error occurred in the tree_evaluation.

c)

Part :: pkspec1: The expression -LibraryFunctionError [LIBRARYTYPEERROR, 1] can not be used as a part specification.

and subsequent errors related to list and iteration indices and function domains.

The function in question returns a working classifier, but it takes a lot of time and sometimes it is obtained by a non-optimal method and exhibits a non-optimal performance.

It seems that for large (but not very large!) data sets, the optimization procedure of the classification method fails.

Performing the classification with specified methods and selecting the best is not a satisfactory solution, among others because each method has its own variants and they have some meta- parameters that are optimized. I do not know if in case of specified method, optimization within its variants and meta-parameters is done, or their default values ​​are used. On smaller data sets, where both approaches work, you can notice a worse performance of classification with a certain method, even if it is the one that the automatic search finds the best.

In version 11.3 this error did not show up. I do not know, however, whether it was absent or simply invisible, because it seemed to me that the performance of the classification was (with large sets) insufficient and the method choice surprising.

Does anyone of you have an idea how to force it to work properly or is there any hope for a patch?

To be really precise, I attach links to the notebook and the database:

4 Replies

Without the actual network, the audience here can only guess. Can you post the code etc?

The code is really simple:

classify = Classify[trainingset]

The problem is in the data set, or rather in its size, becouse its structure is rather ordinary and simple:

trainingset={{1.1397,"abc",5.76211,26.7396}->"A",{3.21085,"klm",47.1485,17.5633}->"C",{2.57019,"xyz",59.5656,13.73}->"A",...,{1.04451,"klm",13.9758,1.44347}->"B"}};

where the length of the list is of the order of $n=10^6$.

When I subsample the trainingset to the lenght of $n'=10^5$, for example by classify=Classify[RandomSample[trainingset,10^5]], the problem disappear.

The problem disappear also when I specify a method by classify=Classify[trainingset,Method->"DecisionTree"] (or with another method).

Errors in this place did not appear in Mathematica version 11.3.

There is nothing more to be specified about the code. The error messages I cited above.

To be really precise, I attach links to the notebook and the database in the edited version of the question.

The same topic is present on Mathematica StackExchange: link.

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract