Group Abstract

Message Boards

WOLFRAM COMMUNITY

13.2K Views

3 Replies

7 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

NoSQL data workflows with Mathematica and Google // Indexing and Census

Alan Calvitti

Alan Calvitti, Ettain Group

Posted 10 years ago

Who needs SQL in the cloud age? This and followup tutorials will show how flexible data workflows can be built in Mathematica, relying on Google cloud-based apps (Drive, gsheet..) for storage, sync, and team co-editing. I'll use examples from a large, time-and-motion study - called Quick- of computer use by doctors at two large institutions. Here we show how index and perform basic file census in a few lines. The star is `keyTrie`: a one-liner recursive trie constructor that mirrors the entire filesystem (at the given root) as nested associations. Suppose your data is stored under a common folder, here "~/Google Drive/VASD/Projects/Quick/Data" . All data here is de-identified (it has been cleaned up from versions that are in secure storage). quick[index] = Dataset["~/Google Drive/VASD/Projects/Quick/Data"][fileIndex /* AssociationMap[Reverse]][keyTrie] Preliminary timing studies show linear time complexity (http://mathematica.stackexchange.com/a/84952/80) - attributable to the efficient hash tree implementation of `Association`. (Be warned if you point keyTrie at a root directory "~/Google Drive" directory will take a while to complete.) keyTrie = Query[keyGroupBy[First]] /* Query[All, KeyMap[Rest]] /* Query[All, If[Keys[#] == {{}}, First@Values[#], keyTrie[#]] &]; `keyTrie` uses the helper `keyGroupBy`, which is as useful as the built-in `GroupBy` when processing nested `Associations` - here is the operator form: keyGroupBy[f_][expr_] := Association /@ GroupBy[Normal@expr, Keys /* f]; `fileIndex` is less interesting, but showing for completion: It filters data files, eg CSV, TXT, .gsheet etc and performs a basic filename cleanup (details upon request) fileIndex[path_] := FileNames[".csv" \| ".txt" \| ".xls" \| ".xlsx" \| ".gsheet", path, Infinity] // AssociationMap[Drop[StringSplit[#, "/"], Length@StringSplit[path, "/"]] &] // Dataset // Query[All,MapAt[First@StringSplit[#, {".", "-", "_"}] &, #, -1] &]; At this point, files comprise data and metadata and are only indexed, not imported. For example, here's what a portion of the folder under `EHR` looks like quick[index]["VisitData", "EHR", preLeafMap[Keys]] where identifiers `Qxxx` are units of study (patients, visits) - extracted from filenames by `fileIndex` above. Each file is typically a CSV or gsheet, eg: quick[index]["VisitData", "EHR", First, First] "~/Google Drive/VASD/Projects/Quick/Data/VisitData/EHR/EpicAccessLogs/Q115-18_epicLog.csv" `preLeafMap` is a convenience operator form of `Map` applied at level -2: preLeafMap[f_][expr_] := Map[f, expr, {-2}]; Note that nesting of files in folders varies by data layer, eg "HumanCodedMouseActivity" contains sub-folders "UCSD", "VASD". Using `associationFlatten` a census dashboard can be obtained easily: here scaled up to "VisitData": quick[index]["VisitData", preLeafMap[KeyDrop["meta"] / Keys] /* associationFlatten][All, GroupBy[quick[data]["Master", All, "Site"]] /* {"UCSD", "VASD"}, Replace[_Missing -> 0] /* Length] where: associationFlatten[as_Association] := Map[keyFlatten, as, {0, Infinity}]; and keyFlatten[as_Association] := as // KeyValueMap[ List /* Replace[{{a_, b_Association} :> KeyMap[{a, #} &, b], {a_, b_} :> <\|a -> b\|>}]] // Map[Normal] // Flatten // Association; keyFlatten[l_List] := l; keyFlatten[a_ /; AtomQ[a]] := a; Notes: The grouping of the `Qxxx` into sites "UCSD" and "VASD" is based on metadata in the gsheet "Master" - which again is available to anyone in the team in real time for co-editing: Once `Master` is imported, the `GroupBy` can use the derived association of `PatientID` to `Site`: quick[data]["Master",All,"Site"]

POSTED BY: Alan Calvitti

3 Replies

Sort By:

EDITORIAL BOARD

EDITORIAL BOARD, WOLFRAM

Posted 9 years ago

- you earned "Featured Contributor" badge, congratulations ! Dear @Alan Calvitti, this is a great post and it has been selected for the curated Staff Picks group. Your profile is now distinguished by a "Featured Contributor" badge and displayed on the "Featured Contributor" board.

POSTED BY: EDITORIAL BOARD

Vitaliy Kaurov

Vitaliy Kaurov, WOLFRAM Research

Posted 9 years ago

Very interesting, @Alan Calvitti, thanks for sharing! Both - the research and the NoSQL workflow. But about the research, isn't it something related to mouse tracking on screen like iographica or am I way off and this is about real mice? DO you use Mathematica in your research for analysis?

POSTED BY: Vitaliy Kaurov

Alan Calvitti

Alan Calvitti, Ettain Group

Posted 9 years ago

The study uses data collected by usability software (TechSmith Morae) which is installed on doctors' computers - as well as video capture and other sensors, during real clinical encounters. I use Mathematica exclusively for research and analysis and is the centerpiece of our team-based scientific workflow. The V10 functionality released in 2014 accelerated data workflow to the point where a few lines of code have replaced pages of (V9) code.

POSTED BY: Alan Calvitti

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback