Message Boards Message Boards

GROUPS:

[WSS18] Empirically derived conditional distribution of System symbols

Posted 2 months ago
306 Views
|
1 Reply
|
2 Total Likes
|

Empirically derived conditional distribution of System symbols

Range->Length->Part is commonly found in:

graph

Given that Mathematica is a symbolic language it would be interesting to view the relationships between these symbols e.g. Length[List[1,2,3]] implies a relationship of Length -> List.

Since we do not know this information, we will have to retrieve it. For that, I headed to GitHub. Note: GitHub's public api limits the rate you can query to $1$ query every $30$ seconds (so the crawl takes many hours).

(* Get repositories marked with Mathematica as a language *)
githubResults = SearchRepositories["Mathematica"];

"8586 public repositories with language(s): Mathematica.
Only the first 1000 are available via the public api."


(* Get repository names *)
repositoriesWithMathematica = Normal[githubResults[All, "full_name"]];

(* Search for .nb files in these repos *)
notebookResults = SearchNotebooks[repositoriesWithMathematica];
"3525 notebooks found."


(* Now go download all of those .nbs *)
files = DownloadRawURLs[notebookResults, CreateDirectory[]]

With our $3,525$ notebooks, we are ready to try and extract the Symbol->Symbol relationship (thanks to Carl Woll for helping with this part):

Test for system symbol

SetAttributes[systemSymbol, {Listable, HoldFirst}]

systemSymbol[Symbol[_String]] = False;
systemSymbol[s_Symbol] := Context[s] === "System`"
systemSymbol[_] = False;

Extract symbol relationships

Note, we replace all non system symbols with a placeholder "NonWolfram".

NBExpressionRelations[file_] := Module[
  {
   nb = NotebookImport[file, "Input"],
   expr =
    Cases[nb,
     p : _Symbol?systemSymbol[s_Symbol, ___] :> Hold[p], {3, Infinity}]
   },
  Cases[nb,
   p_Symbol?systemSymbol[s_Symbol[___], ___] :>
    If[Context[s] === "System`", RuleDelayed[p, s],
     RuleDelayed[p, "NonWolfram"]], {3, Infinity}]
  ]

Just loop this over all the files to get the data, and then

(* if you want to keep them by file *)
results = AssociationThread[fileNames, expressionData]
(* or *)
data = Flatten[expressionData]

Explore System symbol relationships

(* Tally over the keys (symbols used as Heaed) *)
tally = Tally[data[[;; , 1]]];

If we just look at the symbols by occurrence, we see that unsurprisingly List is the most prominent:

barchart

We can view the interaction network as well, where the vertices are colored and sized by occurrence.

ToString[#[[1]]] \[DirectedEdge] ToString[#[[2]]] & /@ data;

graph

The light blue dot near the center is List.

This isn't all that informative, but we can make the frequency distribution of the symbols

NormalizeAssociation[assoc_] :=
 With[{tots = Total[assoc]}, Reverse[Sort[Map[N[#/tots] &, assoc]]]]
(* Occurances of symbols *)
symOccur = Association @@ Rule @@@ tally;
(* Probability of symbols *)
symProb = NormalizeAssociation[symOccur];

and then the conditioned distribution

(* Conditional occurances of symbols *)

symCondOccur =
  Map[Association @@ Rule @@@ Tally[#[[;; , 2]]] &,
   GroupBy[Rule @@@ data, First]];
(* Conditional probability of symbols *)

symCondProb = Map[NormalizeAssociation, symCondOccur];

with this we can traverse our network (in this case we take only the most likely at each step, which may not be the best approach overall)

nextSymbol[symbol_] := Module[
  {

   dist, max, sel, key
   },
  If[symbol == Nothing, Return[Nothing]];

  dist = symCondProb[symbol];

  (* Maybe we never saw this symbol as head in our data *)

  If[MissingQ@dist, Return[Nothing]];
  max = Max[symCondProb[symbol]];
  sel = Select[dist, max == # &];

  (* Likewise perhaps selection failed as it was not connected to anything *)
  If[sel == <||>, Return[Nothing]];
  key = First@Keys[sel];

  (* Since we always take max,
  if key\[Equal]symbol we will go in circles *)

  If[key == symbol, Return[Nothing]];
  Return[key];
]
  ```


Now we can see some promising results:

NestList[nextSymbol[#] &, Range, 5] `` yields:{Range, Length, Part}`

NestList[nextSymbol[#] &, Import, 4]

yields: {Import, StringJoin, NotebookDirectory, EvaluationNotebook}

so it seems people tend to write Range[Length[Part[...]]] e.g. Range@Length@myData[[;;,1]] and Import[StringJoin[NotebookDirectory[],...]]

Whats next?

Next it would be interesting to throw these notebooks into a language model.

Code

Attached is the code for crawling, everything else of relevance is above.

Attachments:

enter image description here - Congratulations! This post is now a Staff Pick as distinguished by a badge on your profile! Thank you, keep it coming!

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract