Group Abstract

Message Boards

5.5K Views

1 Reply

0 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Data Science Software Development Natural Language Processing

Posted 7 years ago

I'm trying to cluster texts. Currently I'm using BERT (Bidirectional Encoder Representations from Transformers) (wordembeddings) to score similarity between documents. The resulting similarity matrix is then run through a clustering algorithm (HDBSCAN / spectral clustering). This would probably work great, if the subjects were dissimilar (world war 2, fruit, sustainable energy). But all the texts are from the same overall subject. For instance, the subject 'sustainable energy' could contain sub-subjects like 'wind energy', 'solar energy' and 'geothermal energy', which makes it considerable harder to seperate. Do you have any suggestions on possible techniques i could look in to? Maybe it doesn't have to be actual machine learning, but more a preprocessing / datasorting or statistical technique.

POSTED BY: Mathias Immerkjær

1 Reply

Sort By:

Posted 7 years ago

Is this a question about how to use Mathematica or the Wolfram Cloud for the task at hand? If so, it really needs more detail, for example, sample texts and an indication of a specific desired result.

POSTED BY: Daniel Lichtblau

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback