Message Boards Message Boards

Cleaning Up Messy Data & Preparing It For Classification/Predictions

Normally I use OpenRefine software for cleaning up messy data. It has some cool stuff. The problem is, it is highly limited to spreadsheet/table type formats and everything I do, almost, is with large lists.

I want to do the following filtering on the text of each list item within a list (even a list of lists):

  • Strip special characters (except {"x","y","z"} )
  • Strip out non-alphanumeric characters (except {"x","y","z"} )
  • strip out punctuation (except {"x","y","z"} )
  • Strip out list of stopwords (whole words)
  • Require a list of keepwords (if contained in any part)

Here is what I got so far:

bigList = {"item1", "item2", "item3", "item4", "item6", "item7", "item8", "item9","item10", "item12", "item23", "item34", "item46", "item57", "item68", "item79"};
    excludedElements = {"item1", "item2", "item5", "item6"};
    includedElements = {"item8", "item9","item79"};
    includeElementParts = {"8","5","4"};

    StringFreeQ[#, excludedElements] &],
     ! StringFreeQ[#, includedElements] &]
POSTED BY: David Johnston
2 Replies

I'm not sure that your example is all that representative of your actual problem based on your description (??) but maybe for whole words you could consider DeleteCases with a combination of Alternatives and Except. For stripping out characters something like StringReplace[...,(Alternative@@characterList)-""]

POSTED BY: Mike Honeychurch

Yeah, that was the problem. After days of research and reading everything possible, I was still unable to get code close to fitting my goals.

POSTED BY: David Johnston
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
or Discard

Group Abstract Group Abstract