Message Boards Message Boards

[WSS18] Empirically derived conditional distribution of System symbols

Empirically derived conditional distribution of System symbols

Range->Length->Part is commonly found in:

graph

Given that Mathematica is a symbolic language it would be interesting to view the relationships between these symbols e.g. Length[List[1,2,3]] implies a relationship of Length -> List.

Since we do not know this information, we will have to retrieve it. For that, I headed to GitHub. Note: GitHub's public api limits the rate you can query to $1$ query every $30$ seconds (so the crawl takes many hours).

(* Get repositories marked with Mathematica as a language *)
githubResults = SearchRepositories["Mathematica"];

"8586 public repositories with language(s): Mathematica.
Only the first 1000 are available via the public api."


(* Get repository names *)
repositoriesWithMathematica = Normal[githubResults[All, "full_name"]];

(* Search for .nb files in these repos *)
notebookResults = SearchNotebooks[repositoriesWithMathematica];
"3525 notebooks found."


(* Now go download all of those .nbs *)
files = DownloadRawURLs[notebookResults, CreateDirectory[]]

With our $3,525$ notebooks, we are ready to try and extract the Symbol->Symbol relationship (thanks to Carl Woll for helping with this part):

Test for system symbol

SetAttributes[systemSymbol, {Listable, HoldFirst}]

systemSymbol[Symbol[_String]] = False;
systemSymbol[s_Symbol] := Context[s] === "System`"
systemSymbol[_] = False;

Extract symbol relationships

Note, we replace all non system symbols with a placeholder "NonWolfram".

NBExpressionRelations[file_] := Module[
  {
   nb = NotebookImport[file, "Input"],
   expr =
    Cases[nb,
     p : _Symbol?systemSymbol[s_Symbol, ___] :> Hold[p], {3, Infinity}]
   },
  Cases[nb,
   p_Symbol?systemSymbol[s_Symbol[___], ___] :>
    If[Context[s] === "System`", RuleDelayed[p, s],
     RuleDelayed[p, "NonWolfram"]], {3, Infinity}]
  ]

Just loop this over all the files to get the data, and then

(* if you want to keep them by file *)
results = AssociationThread[fileNames, expressionData]
(* or *)
data = Flatten[expressionData]

Explore System symbol relationships

(* Tally over the keys (symbols used as Heaed) *)
tally = Tally[data[[;; , 1]]];

If we just look at the symbols by occurrence, we see that unsurprisingly List is the most prominent:

barchart

We can view the interaction network as well, where the vertices are colored and sized by occurrence.

ToString[#[[1]]] \[DirectedEdge] ToString[#[[2]]] & /@ data;

graph

The light blue dot near the center is List.

This isn't all that informative, but we can make the frequency distribution of the symbols

NormalizeAssociation[assoc_] :=
 With[{tots = Total[assoc]}, Reverse[Sort[Map[N[#/tots] &, assoc]]]]
(* Occurances of symbols *)
symOccur = Association @@ Rule @@@ tally;
(* Probability of symbols *)
symProb = NormalizeAssociation[symOccur];

and then the conditioned distribution

(* Conditional occurances of symbols *)

symCondOccur =
  Map[Association @@ Rule @@@ Tally[#[[;; , 2]]] &,
   GroupBy[Rule @@@ data, First]];
(* Conditional probability of symbols *)

symCondProb = Map[NormalizeAssociation, symCondOccur];

with this we can traverse our network (in this case we take only the most likely at each step, which may not be the best approach overall)

nextSymbol[symbol_] := Module[
  {

   dist, max, sel, key
   },
  If[symbol == Nothing, Return[Nothing]];

  dist = symCondProb[symbol];

  (* Maybe we never saw this symbol as head in our data *)

  If[MissingQ@dist, Return[Nothing]];
  max = Max[symCondProb[symbol]];
  sel = Select[dist, max == # &];

  (* Likewise perhaps selection failed as it was not connected to anything *)
  If[sel == <||>, Return[Nothing]];
  key = First@Keys[sel];

  (* Since we always take max,
  if key\[Equal]symbol we will go in circles *)

  If[key == symbol, Return[Nothing]];
  Return[key];
]
  ```


Now we can see some promising results:

NestList[nextSymbol[#] &, Range, 5] `` yields:{Range, Length, Part}`

NestList[nextSymbol[#] &, Import, 4]

yields: {Import, StringJoin, NotebookDirectory, EvaluationNotebook}

so it seems people tend to write Range[Length[Part[...]]] e.g. Range@Length@myData[[;;,1]] and Import[StringJoin[NotebookDirectory[],...]]

Whats next?

Next it would be interesting to throw these notebooks into a language model.

Code

Attached is the code for crawling, everything else of relevance is above.

Attachments:
POSTED BY: Sumner Magruder

enter image description here - Congratulations! This post is now a Staff Pick as distinguished by a badge on your profile! Thank you, keep it coming!

POSTED BY: EDITORIAL BOARD
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract