Message Boards

12622 Views

2 Replies

0 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Business Analytics Data Science Economics Mathematics Operations Research Social Science Software Development Wolfram Language Wolfram Cloud Wolfram Development Platform

Cleaning Up Messy Data & Preparing It For Classification/Predictions

Posted 10 years ago

Normally I use OpenRefine software for cleaning up messy data. It has some cool stuff. The problem is, it is highly limited to spreadsheet/table type formats and everything I do, almost, is with large lists. I want to do the following filtering on the text of each list item within a list (even a list of lists): Strip special characters (except {"x","y","z"} ) Strip out non-alphanumeric characters (except {"x","y","z"} ) strip out punctuation (except {"x","y","z"} ) Strip out list of stopwords (whole words) Require a list of keepwords (if contained in any part) Here is what I got so far: bigList = {"item1", "item2", "item3", "item4", "item6", "item7", "item8", "item9","item10", "item12", "item23", "item34", "item46", "item57", "item68", "item79"}; excludedElements = {"item1", "item2", "item5", "item6"}; includedElements = {"item8", "item9","item79"}; includeElementParts = {"8","5","4"}; Select[Select[bigList, StringFreeQ[#, excludedElements] &], ! StringFreeQ[#, includedElements] &]

POSTED BY: David Johnston

2 Replies

Sort By:

Posted 10 years ago

I'm not sure that your example is all that representative of your actual problem based on your description (??) but maybe for whole words you could consider DeleteCases with a combination of Alternatives and Except. For stripping out characters something like StringReplace[...,(Alternative@@characterList)-""]

POSTED BY: Mike Honeychurch

Posted 10 years ago

Yeah, that was the problem. After days of research and reading everything possible, I was still unable to get code close to fitting my goals.

POSTED BY: David Johnston

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback