Message Boards Message Boards

Quickly clean up missing values from hundreds of XLS files?

I have hundreds of XLS files that need to be aggregated, but I find that each file has more or less missing values or error messages. Is there any way to quickly clean up these missing values or error messages? Because I still don't know how to clean up a lot of data, so the following code only extract the data in different files.

(*   Get a list of files   *)
dir = "D:\\Data";
fileList = FileNames["*", dir];
For[i = 1, i = Length[fileList], i++,
 If[i == 1,
      rawdata = Import[fileList[[i]], {"XLS", "Data"}][[1]];    (*i=1 olny *)
     ,
  {
   tempData = Rest@Import[fileList[[i]], {"XLS", "Data"}][[1]];        (*i != 1*)
   rawdata = AppendTo[rawdata, tempData];
   }
  ]
 ]
POSTED BY: Tsai Ming-Chou
4 Replies

Thank you very much for your demonstration. I will follow this idea to practice.

POSTED BY: Tsai Ming-Chou

Tsai,

You should use Map to Import all your files instead of For

dir = "D:\\Data";
fileList = FileNames["*", dir];
data =  Map[Import[#, {"XLS", "Data"}]&,fileList];

You can then flatten the list one level if you want it in one list instead of a list of lists (I do not know how you will be using it).

What do you mean by "Clean up" data? For example if a value is "15#" should that be deleted or become "15" or something else? What should happen to blank data? Do you want to make them all the Mathematica symbol Missing[]?

Regards,

Neil

POSTED BY: Neil Singer

In the attachment, the field names are 01 to 24, representing hourly measurement data. I want to mark the contents of these fields as the value of "blank data" uniformly marked as "NA" and delete characters (for example, "15#" is marked as 15). Since the data is aggregated and will calculate on a time-by-time basis, data from different files should flatten into a significant data set for subsequent matrix operations.

data = Flatten[Map[Import[#, {"XLS", "Data"}] &, fileList], 2];
POSTED BY: Tsai Ming-Chou

I uploaded the files but have not bothered to open them in EXCELL. Rather, I import them directly into Mathematica and start 'blind'. This loads all the files into a list of files.

ClearAll[rawFiles]
rawFiles[S1] = Import["c:\\" <> "your path" <> "\\*.xls"]

When loading and wrangling unfamiliar data I always use the name[Sn] form where 'S' is step so rawFiles[S1] means rawFiles[step1]

rawFiles[S1] // Dimensions

{3, 1}

The dimensions above tell me there is an extra bracket ({ }) - typical of .xls imports. I'll come back to this later. I want to see the column headers (if any) along with the first couple rows of data.

rawFiles[S1][[1, 1, 1 ;; 3]] // Grid
rawFiles[S1][[2, 1, 1 ;; 3]] // Grid
rawFiles[S1][[3, 1, 1 ;; 3]] // Grid

output not shown

It looks like the column headers are the same for all the files but I want to make sure.

rawFiles[S1][[1, 1, 1]] == rawFiles[S1][[2, 1, 1]] == rawFiles[S1][[3, 1, 1]]

True

Now I know it's safe to take the column headers of the first file for all subsequent files. So I can take all rows for the first file and all rows accept the first row of the remaining files.

rawFiles[S2] = 
  Join[{rawFiles[S1][[1, 1, 1 ;;]], rawFiles[S1][[2, 1, 2 ;;]], 
    rawFiles[S1][[3, 1, 2 ;;]]}];

Now I get ride of the extra { }'s and generate a single set of column x row date that seems to be flat and nicely dimensioned.

Flatten[rawFiles[S2], 1] // Dimensions
rawFiles[S3] = Flatten[rawFiles[S2], 1];

{19729, 27}

At this point I want to create a Dataset[ ] object to leverage the capabilities and flexibility that's available. I start by taking the first row (the headers) and thread if over the remaining rows.

colHeaders = First[rawFiles[S3]]

{"Date", "Position", "Measuring", "01", "02", "03", "04", "05", "06", \ "07", "08", "09", "10", "11", "12", "13", "14", "15", "16", "17", \ "18", "19", "20", "21", "22", "23", "24"}

dataDS[S1] = 
 Dataset[AssociationThread[colHeaders -> #] & /@ Rest[rawFiles[S3]]]
  • Dataset output not shown

Now we can perform rigorous data wrangling in a straightforward - rule based - way. I'm going to apply some of my 'best practices' to what I see - but I'm not going to do an exhaustive inspection and data wrangle. I'll leave that for you ;)

First, I want to change all dates to DateObject[ ]'s.

dataDS[S2] = dataDS[S1][All, Query[Thread[{"Date"} -> DateObject ] ]];

Next, I always change string entries like "NR" and symbolic entries like "Null" into Missing[ ] when the type of data is known. That is, if a column should always be a number but sometimes data is missing, then make sure all data in a column has head Number or Missing. This way, in computation you know the data is directly computable or missing. There are many downstream benefits that accrue when doing analysis and computation. Look up Missing[ ] in help. Here's how I go about cleaning up all the "NR"'s in one step. For the same reason I don't want "" (blank spaces) in columns that otherwise would be numerical. Also, (assumption based on reading this post), data of this form, '..d#' should be '..d' and be understood to be a number. I add three replacement rules.

dataDS[S3] = dataDS[S2][All, All, Replace[{
     "NR" :> Missing["imported as NR"],
     "" :> Missing["imported as ''"]
     }]];

And, finally I deal with the n# cases.

dataDS[S4] = dataDS[S3][All, All, Replace[{
     v_DateObject :> v,
     v_Number :>  v,
     v_String :> 
      If[ StringTake[v, -1] == "#", ToExpression[StringDrop[v, -1]], v]
     }]];
  • dataDS[S4] not shown

    dataDS[S4][Transpose, Drop[colHeaders, 3]][All, 
      Select[! MissingQ[#] &]][All, (Histogram[#] &)]
    
  • histograms not shown

There's loads more data inspection that can be done given your goals, but after the inspection your ready for computation, which I will leave to you...

Attachments:
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract