Message Boards Message Boards

0
|
5341 Views
|
8 Replies
|
10 Total Likes
View groups...
Share
Share this post:

help cleaning up a list

Posted 10 years ago

Hello all, I am a new Mathematica user and I very much like the power of the platform for data analysis and visualization. I've run into a problem importing data from an instrument with limited formatting capabilities. The system places dashes in missing spaces and I need to strip them away. The problem is that the data comes in as a list of lists, and some of the lists are all dashes and need to be removed. A very simplified example would be:

{{1,2,3,---},{4,5,---},{---,---,---},{7,8,9,10,---}}

Which I would like to reduce to:

{{1,2,3},{4,5},{7,8,9,10}}

I can handle the trailing dashes in the first, second, and fourth list, but removing the set that is all dashes has me stumped. And since the actual data is composed of over 100 sets with anywhere from 10 to 1000 items, a generic automated method is a must. Any suggestions would be greatly appreciated. Thanks, Mike.

POSTED BY: Michael Marino
8 Replies

You sent me your data file and the example notebook and I took a look at them. The reason why it appears that your data columns are appended is that you are looking at the result using TableForm. Remember that your original data is a rectangular array--which is a list of lists, each of equal length. When you then remove your dashed strings using any of my methods above, the result is a non-rectangular array. The lists in the outermost list each may have a different length. When you use TableForm to display this on screen each such list (a "row" in the TableForm display) is left justified. However, this is not showing the column structure... it is just showing what is in each row. (In fact, by left justifying the display of the ragged array of data, TableForm is indeed showing the correct index structure that you would need to use if you were using the Part function.) Since the array is no longer rectangular, the meaning of a column is ambiguous relative to original rectangular data array since the information on which column any item came from is lost. Here is a simple example of a rectangular array with elements that we wish to remove.

test={{a, b, c}, {1, "XXX",2}, {e, f, g}, {"XXX",3,"XXX"}}

When one shows it using TableForm its rectangular structure is clear

TableForm[test]

enter image description here

Now remove the "XXX" strings:

test1 = test /. "XXX" -> Sequence[]

Which gives the following non-rectangular array (and where the information on which column of the original array each element came from is lost):

{{a, b, c}, {1, 2}, {e, f, g}, {3}}

And here is how this looks when one uses TableForm on it

TableForm[test1]

TableForm[test1]

For small arrays this behavior is pretty clear. But in your case you had very long rows and so the ragged righthand side was not visible on your screen. You would have had to scroll the cell very far to the right to see this hint as to what was at the root of your concern.

If you want to retain the rectangular form of your array so that you can then use it in matrix calculations, the approach to removing the stringified dashs sequences would be to replace them by an appropriate number.

POSTED BY: David Reiss

Please see the function DeleteCases:

DeleteCases[{1, 2, 3, "---"}, "---"]
POSTED BY: Sean Clarke

Is the list that you obtain when it is imported into Mathematica this,

{{1,2,3,---},{4,5,---},{---,---,---},{7,8,9,10,---}}

or is it this:

{{1,2,3,"---"},{4,5,"---"},{"---","---","---"},{7,8,9,10,"---"}}

I.e., are the dashes imported as a string of dashes? Without them being strings, the expressions (of 3 dashes in a row) are not syntactically valid. So I will assume that they appear as strings. If so then the following will work for you:

In[1]:= test = {{1, 2, 3, "---"}, {4, 5, "---"}, {"---", "---", "---"}, {7, 8, 9, 10, "---"}}

In[2]:= DeleteCases[test /. "---" -> Sequence[], {}, Infinity]

Out[2]= {{1, 2, 3}, {4, 5}, {7, 8, 9, 10}}

Another way to do it might be

In[3]:= DeleteCases[DeleteCases[test, "---", Infinity], {}, Infinity]

Out[3]= {{1, 2, 3}, {4, 5}, {7, 8, 9, 10}}

And here's another....

In[4]:= (test /. "---" -> Sequence[]) /. {} -> Sequence[]

Out[4]= {{1, 2, 3}, {4, 5}, {7, 8, 9, 10}}
POSTED BY: David Reiss

Straight Import of your data file, may be what is causing sub-lists to become appended. There are a few lower-level functions that can give finer control over how the data is grabbed from the file. These functions are named Read and ReadList.

To see where the sub-list confusion is coming from, start with ReadList of Record and check if that does not give you the expected data structure. If yes, then Import is getting confused. If no, then your instrument recording software is omitting record separator delimiters in the data file.

Things to try:

ReadList[ file, Record]
ReadList[ file, Record, RecordSeparators-> {"\r\n", "\n", "\r"}]
ReadList[ file, Word]
ReadList[ file, Word, WordSeparators -> {", ", ",", " ", "\t"}]

Read is like a microscope; it lets you pipe the data in from the file as a stream one chunk at a time.

Things to try:

str = OpenRead[ file]
Read[str, Word]
StreamPosition[str]

SetStreamPosition[str, \[Infinity]];
Read[str, Byte]
endoffile = StreamPosition[str]

SetStreamPosition[str, 0];
Reap[
  While[StreamPosition[str] <= 256,
   Sow[FromCharacterCode@Read[str, Byte]]]
  ][[2]]
Posted 10 years ago

Of course! I feel silly for not realizing it sooner. TableForm had me thinking that the data was in columns when rows where the more accurate way to think about it. Transposing the original array before using the DeleteCases function has everything working perfectly now. Thanks again David for all your help!

Chris, that is some useful information that I will definitely use in the future.

Best regards,

Mike.

POSTED BY: Michael Marino
Posted 10 years ago

Thanks Guys, I read up on DeleteCases and using the examples provide above was able to make some headway. David, you were correct about the strings and I apologize for that oversight Unfortunately the function seems to be reorganizing the data in the sublists in an unusal way. If the first sublist contains 50 items, and a sublist further along contains 150, it is taking items 51-150 from the latter list and appending them to the end of the first sublist. The net result is a series of lists that decrease in length and have no dashes, but also bear little similarity to the original data structure. I can't seem to reproduce this with a smaller example and several attempts at attaching the text file to this post have failed. I'm hoping that my admittedly poor description might ring a bell with someone who can point me in the right direction. I really appreciate the help. I'm trying to crack this myself, but I have to admit that I am at the limits of my current understanding. Thanks, Mike.

POSTED BY: Michael Marino

Hmmmm... it's hard to say without an example. It is possible that the import from the text file is creating a data structure that is somewhat different from what you are expecting. Does the issue happen with all 3 approaches that I suggested? And, if so, does it happen in the same way? How large is the file that shows the problem? If it is email-able (less than 5 meg) then email it to the address that is posted on the "contact" section of my website (found from clicking on my name here).

By the way, the reason why you may have thought that the three dashes were not strings is that string characters (the quotation marks) are by default suppressed in output cells. So if you execute this in an input cell

"I am a string"

The output cell will look like this

I am a string
POSTED BY: David Reiss

Great!

POSTED BY: David Reiss
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract