Message Boards Message Boards

GROUPS:

NoSQL data: catching typos in human coded data in a few lines

Posted 6 years ago
7668 Views
|
1 Reply
|
6 Total Likes
|

Here's an example of the supreme compact codes can can be written in Mathematica for real world data analysis leveraging operations on nested Association

Starting with a given set of human coded clickstream samples obtained from human review of video of EHR (Electronic health Record) activity during real visits, we're need to catch some inevitable typos.

The following table shows a fragment of the data, where the top level keys (Q115 etc) indicate the visit identifier, and 2nd level keys are Quantity objects indicating the time of a mouse click - the associated values are tags screenscraped from the computer UI by human coders indicating computer workflow.

Call this nested table data[a]

enter image description here

First, since in this example the timestamps are not needed, we remove them Values and perform basic normalization to lower case, string split and trim:

data[b] = data[a] [All, Values /* ToLowerCase /* stringSplit[" - "] /* Map[StringTrim]]

For example,

data[b][2] // Normal (* only a fragment shown *)

enter image description here

First form an association tally of all the tags across all visits using associationTally helper function:

associationTally=Query[Tally/*SortBy[Last]/*Reverse/*Map[Apply[Rule]]/*Association]

Create the tally:

tally = data[b][Catenate /* Flatten][associationTally] // Normal (* fragment shown *)

enter image description here

Now the fun part:

data[b][All, Flatten /* Union][ KeyValueMap[List /* Reverse /* Apply[Rule] /* Thread /* Association]][Merge[Join]][KeySortBy[tally] /* Reverse][All, {Length, Identity}]

enter image description here

This processed table sorted by the original tally, is ready for Export to Gsheet for human coder to review, and indicates which file each term is to be found, eg

data[c] [{400}] // Normal

<|"satisfied on" -> {2, {"Q122", "Q141"}}|>

Granted, still requires human review of a long list, but imagine doing this basic housekeeping with Python or R.

POSTED BY: Alan Calvitti

enter image description here - another post of yours has been selected for the Staff Picks group, congratulations !

We are happy to see you at the tops of the "Featured Contributor" board. Thank you for your wonderful contributions, and please keep them coming!

POSTED BY: Moderation Team
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract