Message Boards Message Boards

Cleaning Up Messy Data & Preparing It For Classification/Predictions

Normally I use OpenRefine software for cleaning up messy data. It has some cool stuff. The problem is, it is highly limited to spreadsheet/table type formats and everything I do, almost, is with large lists.

I want to do the following filtering on the text of each list item within a list (even a list of lists):

  • Strip out list of stopwords (whole words)
  • Require a list of keepwords (if contained in any part)
  • Here is what I got so far:

    bigList = {"item1", "item2", "item3", "item4", "item6", "item7", "item8", "item9","item10", "item12", "item23", "item34", "item46", "item57", "item68", "item79"};
        excludedElements = {"item1", "item2", "item5", "item6"};
        includedElements = {"item8", "item9","item79"};
        includeElementParts = {"8","5","4"};
    
    Select[Select[bigList, 
        StringFreeQ[#, excludedElements] &],
         ! StringFreeQ[#, includedElements] &]
    
    POSTED BY: David Johnston
    2 Replies

    I'm not sure that your example is all that representative of your actual problem based on your description (??) but maybe for whole words you could consider DeleteCases with a combination of Alternatives and Except. For stripping out characters something like StringReplace[...,(Alternative@@characterList)-""]

    POSTED BY: Mike Honeychurch

    Yeah, that was the problem. After days of research and reading everything possible, I was still unable to get code close to fitting my goals.

    POSTED BY: David Johnston
    Reply to this discussion
    Community posts can be styled and formatted using the Markdown syntax.
    Reply Preview
    Attachments
    Remove
    or Discard

    Group Abstract Group Abstract