Group Abstract

Message Boards

WOLFRAM COMMUNITY

9.3K Views

4 Replies

4 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Quickly clean up missing values from hundreds of XLS files?

Tsai Ming-Chou

Tsai Ming-Chou, National Defense Medical Center

Posted 7 years ago

I have hundreds of XLS files that need to be aggregated, but I find that each file has more or less missing values or error messages. Is there any way to quickly clean up these missing values or error messages? Because I still don't know how to clean up a lot of data, so the following code only extract the data in different files. (* Get a list of files ) dir = "D:\\Data"; fileList = FileNames["", dir]; For[i = 1, i = Length[fileList], i++, If[i == 1, rawdata = Import[fileList[[i]], {"XLS", "Data"}][[1]]; (i=1 olny ) , { tempData = Rest@Import[fileList[[i]], {"XLS", "Data"}][[1]]; (i != 1) rawdata = AppendTo[rawdata, tempData]; } ] ] Attachments: Annan_2012.xls Banqiao_2012.xls Cailiao_2012.xls

POSTED BY: Tsai Ming-Chou

4 Replies

Sort By:

Tsai Ming-Chou

Tsai Ming-Chou, National Defense Medical Center

Posted 7 years ago

Thank you very much for your demonstration. I will follow this idea to practice.

POSTED BY: Tsai Ming-Chou

Neil Singer

Neil Singer, AC Kinetics, Inc.

Posted 7 years ago

Tsai, You should use Map to Import all your files instead of For dir = "D:\\Data"; fileList = FileNames["*", dir]; data = Map[Import[#, {"XLS", "Data"}]&,fileList]; You can then flatten the list one level if you want it in one list instead of a list of lists (I do not know how you will be using it). What do you mean by "Clean up" data? For example if a value is "15#" should that be deleted or become "15" or something else? What should happen to blank data? Do you want to make them all the Mathematica symbol Missing[]? Regards, Neil

POSTED BY: Neil Singer

Tsai Ming-Chou

Tsai Ming-Chou, National Defense Medical Center

Posted 7 years ago

In the attachment, the field names are 01 to 24, representing hourly measurement data. I want to mark the contents of these fields as the value of "blank data" uniformly marked as "NA" and delete characters (for example, "15#" is marked as 15). Since the data is aggregated and will calculate on a time-by-time basis, data from different files should flatten into a significant data set for subsequent matrix operations. data = Flatten[Map[Import[#, {"XLS", "Data"}] &, fileList], 2];

POSTED BY: Tsai Ming-Chou

Mitch Stonehocker

Mitch Stonehocker, Christy Health

Posted 7 years ago

I uploaded the files but have not bothered to open them in EXCELL. Rather, I import them directly into Mathematica and start 'blind'. This loads all the files into a list of files. ClearAll[rawFiles] rawFiles[S1] = Import["c:\\" <> "your path" <> "\\.xls"] When loading and wrangling unfamiliar data I always use the name[Sn] form where 'S' is step so rawFiles[S1] means rawFiles[step1] rawFiles[S1] // Dimensions {3, 1} The dimensions above tell me there is an extra bracket ({ }) - typical of .xls imports. I'll come back to this later. I want to see the column headers (if any) along with the first couple rows of data. rawFiles[S1][[1, 1, 1 ;; 3]] // Grid rawFiles[S1][[2, 1, 1 ;; 3]] // Grid rawFiles[S1][[3, 1, 1 ;; 3]] // Grid output not shown* It looks like the column headers are the same for all the files but I want to make sure. rawFiles[S1][[1, 1, 1]] == rawFiles[S1][[2, 1, 1]] == rawFiles[S1][[3, 1, 1]] True Now I know it's safe to take the column headers of the first file for all subsequent files. So I can take all rows for the first file and all rows accept the first row of the remaining files. rawFiles[S2] = Join[{rawFiles[S1][[1, 1, 1 ;;]], rawFiles[S1][[2, 1, 2 ;;]], rawFiles[S1][[3, 1, 2 ;;]]}]; Now I get ride of the extra { }'s and generate a single set of column x row date that seems to be flat and nicely dimensioned. Flatten[rawFiles[S2], 1] // Dimensions rawFiles[S3] = Flatten[rawFiles[S2], 1]; {19729, 27} At this point I want to create a Dataset[ ] object to leverage the capabilities and flexibility that's available. I start by taking the first row (the headers) and thread if over the remaining rows. colHeaders = First[rawFiles[S3]] {"Date", "Position", "Measuring", "01", "02", "03", "04", "05", "06", \ "07", "08", "09", "10", "11", "12", "13", "14", "15", "16", "17", \ "18", "19", "20", "21", "22", "23", "24"} dataDS[S1] = Dataset[AssociationThread[colHeaders -> #] & /@ Rest[rawFiles[S3]]] Dataset output not shown Now we can perform rigorous data wrangling in a straightforward - rule based - way. I'm going to apply some of my 'best practices' to what I see - but I'm not going to do an exhaustive inspection and data wrangle. I'll leave that for you ;) First, I want to change all dates to DateObject[ ]'s. dataDS[S2] = dataDS[S1][All, Query[Thread[{"Date"} -> DateObject ] ]]; Next, I always change string entries like "NR" and symbolic entries like "Null" into Missing[ ] when the type of data is known. That is, if a column should always be a number but sometimes data is missing, then make sure all data in a column has head Number or Missing. This way, in computation you know the data is directly computable or missing. There are many downstream benefits that accrue when doing analysis and computation. Look up Missing[ ] in help. Here's how I go about cleaning up all the "NR"'s in one step. For the same reason I don't want "" (blank spaces) in columns that otherwise would be numerical. Also, (assumption based on reading this post), data of this form, '..d#' should be '..d' and be understood to be a number. I add three replacement rules. dataDS[S3] = dataDS[S2][All, All, Replace[{ "NR" :> Missing["imported as NR"], "" :> Missing["imported as ''"] }]]; And, finally I deal with the n# cases. dataDS[S4] = dataDS[S3][All, All, Replace[{ v_DateObject :> v, v_Number :> v, v_String :> If[ StringTake[v, -1] == "#", ToExpression[StringDrop[v, -1]], v] }]]; dataDS[S4] not shown dataDS[S4][Transpose, Drop[colHeaders, 3]][All, Select[! MissingQ[#] &]][All, (Histogram[#] &)] histograms not shown There's loads more data inspection that can be done given your goals, but after the inspection your ready for computation, which I will leave to you... Attachments: post2.nb

POSTED BY: Mitch Stonehocker

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback