Cleaning Up Messy Data & Preparing It For Classification/Predictions

Normally I use OpenRefine software for cleaning up messy data. It has some cool stuff. The problem is, it is highly limited to spreadsheet/table type formats and everything I do, almost, is with large lists.

I want to do the following filtering on the text of each list item within a list (even a list of lists):

  • Strip special characters (except {"x","y","z"} )
  • Strip out non-alphanumeric characters (except {"x","y","z"} )
  • strip out punctuation (except {"x","y","z"} )
  • Strip out list of stopwords (whole words)
  • Require a list of keepwords (if contained in any part)

Here is what I got so far:

bigList = {"item1", "item2", "item3", "item4", "item6", "item7", "item8", "item9","item10", "item12", "item23", "item34", "item46", "item57", "item68", "item79"};
    excludedElements = {"item1", "item2", "item5", "item6"};
    includedElements = {"item8", "item9","item79"};
    includeElementParts = {"8","5","4"};

    StringFreeQ[#, excludedElements] &],
     ! StringFreeQ[#, includedElements] &]
POSTED BY: David Johnston
I'm not sure that your example is all that representative of your actual problem based on your description (??) but maybe for whole words you could consider DeleteCases with a combination of Alternatives and Except. For stripping out characters something like StringReplace[...,(Alternative@@characterList)-""]

POSTED BY: Mike Honeychurch

Yeah, that was the problem. After days of research and reading everything possible, I was still unable to get code close to fitting my goals.

POSTED BY: David Johnston
