6
|
10635 Views
|
|
6 Total Likes
View groups...
Share
GROUPS:

# NoSQL data: catching typos in human coded data in a few lines

Posted 8 years ago
 Here's an example of the supreme compact codes can can be written in Mathematica for real world data analysis leveraging operations on nested Association Starting with a given set of human coded clickstream samples obtained from human review of video of EHR (Electronic health Record) activity during real visits, we're need to catch some inevitable typos. The following table shows a fragment of the data, where the top level keys (Q115 etc) indicate the visit identifier, and 2nd level keys are Quantity objects indicating the time of a mouse click - the associated values are tags screenscraped from the computer UI by human coders indicating computer workflow. Call this nested table data[a] First, since in this example the timestamps are not needed, we remove them Values and perform basic normalization to lower case, string split and trim: data[b] = data[a] [All, Values /* ToLowerCase /* stringSplit[" - "] /* Map[StringTrim]] For example, data[b][2] // Normal (* only a fragment shown *) First form an association tally of all the tags across all visits using associationTally helper function: associationTally=Query[Tally/*SortBy[Last]/*Reverse/*Map[Apply[Rule]]/*Association] Create the tally: tally = data[b][Catenate /* Flatten][associationTally] // Normal (* fragment shown *) Now the fun part: data[b][All, Flatten /* Union][ KeyValueMap[List /* Reverse /* Apply[Rule] /* Thread /* Association]][Merge[Join]][KeySortBy[tally] /* Reverse][All, {Length, Identity}] This processed table sorted by the original tally, is ready for Export to Gsheet for human coder to review, and indicates which file each term is to be found, eg data[c] [{400}] // Normal <|"satisfied on" -> {2, {"Q122", "Q141"}}|> Granted, still requires human review of a long list, but imagine doing this basic housekeeping with Python or R.