Normally I use OpenRefine software for cleaning up messy data. It has some cool stuff. The problem is, it is highly limited to spreadsheet/table type formats and everything I do, almost, is with large lists.
I want to do the following filtering on the text of each list item within a list (even a list of lists):
- Strip special characters (except {"x","y","z"} )
- Strip out non-alphanumeric characters (except {"x","y","z"} )
- strip out punctuation (except {"x","y","z"} )
- Strip out list of stopwords (whole words)
- Require a list of keepwords (if contained in any part)
Here is what I got so far:
bigList = {"item1", "item2", "item3", "item4", "item6", "item7", "item8", "item9","item10", "item12", "item23", "item34", "item46", "item57", "item68", "item79"};
excludedElements = {"item1", "item2", "item5", "item6"};
includedElements = {"item8", "item9","item79"};
includeElementParts = {"8","5","4"};
Select[Select[bigList,
StringFreeQ[#, excludedElements] &],
! StringFreeQ[#, includedElements] &]