I'm trying to cluster texts.
Currently I'm using BERT (Bidirectional Encoder Representations from Transformers) (wordembeddings) to score similarity between documents. The resulting similarity matrix is then run through a clustering algorithm (HDBSCAN / spectral clustering).
This would probably work great, if the subjects were dissimilar (world war 2, fruit, sustainable energy).
But all the texts are from the same overall subject.
For instance, the subject 'sustainable energy' could contain sub-subjects like 'wind energy', 'solar energy' and 'geothermal energy', which makes it considerable harder to seperate.
Do you have any suggestions on possible techniques i could look in to?
Maybe it doesn't have to be actual machine learning, but more a preprocessing / datasorting or statistical technique.
Is this a question about how to use Mathematica or the Wolfram Cloud for the task at hand? If so, it really needs more detail, for example, sample texts and an indication of a specific desired result.