Message Boards Message Boards


[WSS18] Empirically derived conditional distribution of System symbols

Posted 1 year ago
1 Reply
2 Total Likes

Empirically derived conditional distribution of System symbols

Range->Length->Part is commonly found in:


Given that Mathematica is a symbolic language it would be interesting to view the relationships between these symbols e.g. Length[List[1,2,3]] implies a relationship of Length -> List.

Since we do not know this information, we will have to retrieve it. For that, I headed to GitHub. Note: GitHub's public api limits the rate you can query to $1$ query every $30$ seconds (so the crawl takes many hours).

(* Get repositories marked with Mathematica as a language *)
githubResults = SearchRepositories["Mathematica"];

"8586 public repositories with language(s): Mathematica.
Only the first 1000 are available via the public api."

(* Get repository names *)
repositoriesWithMathematica = Normal[githubResults[All, "full_name"]];

(* Search for .nb files in these repos *)
notebookResults = SearchNotebooks[repositoriesWithMathematica];
"3525 notebooks found."

(* Now go download all of those .nbs *)
files = DownloadRawURLs[notebookResults, CreateDirectory[]]

With our $3,525$ notebooks, we are ready to try and extract the Symbol->Symbol relationship (thanks to Carl Woll for helping with this part):

Test for system symbol

SetAttributes[systemSymbol, {Listable, HoldFirst}]

systemSymbol[Symbol[_String]] = False;
systemSymbol[s_Symbol] := Context[s] === "System`"
systemSymbol[_] = False;

Extract symbol relationships

Note, we replace all non system symbols with a placeholder "NonWolfram".

NBExpressionRelations[file_] := Module[
   nb = NotebookImport[file, "Input"],
   expr =
     p : _Symbol?systemSymbol[s_Symbol, ___] :> Hold[p], {3, Infinity}]
   p_Symbol?systemSymbol[s_Symbol[___], ___] :>
    If[Context[s] === "System`", RuleDelayed[p, s],
     RuleDelayed[p, "NonWolfram"]], {3, Infinity}]

Just loop this over all the files to get the data, and then

(* if you want to keep them by file *)
results = AssociationThread[fileNames, expressionData]
(* or *)
data = Flatten[expressionData]

Explore System symbol relationships

(* Tally over the keys (symbols used as Heaed) *)
tally = Tally[data[[;; , 1]]];

If we just look at the symbols by occurrence, we see that unsurprisingly List is the most prominent:


We can view the interaction network as well, where the vertices are colored and sized by occurrence.

ToString[#[[1]]] \[DirectedEdge] ToString[#[[2]]] & /@ data;


The light blue dot near the center is List.

This isn't all that informative, but we can make the frequency distribution of the symbols

NormalizeAssociation[assoc_] :=
 With[{tots = Total[assoc]}, Reverse[Sort[Map[N[#/tots] &, assoc]]]]
(* Occurances of symbols *)
symOccur = Association @@ Rule @@@ tally;
(* Probability of symbols *)
symProb = NormalizeAssociation[symOccur];

and then the conditioned distribution

(* Conditional occurances of symbols *)

symCondOccur =
  Map[Association @@ Rule @@@ Tally[#[[;; , 2]]] &,
   GroupBy[Rule @@@ data, First]];
(* Conditional probability of symbols *)

symCondProb = Map[NormalizeAssociation, symCondOccur];

with this we can traverse our network (in this case we take only the most likely at each step, which may not be the best approach overall)

nextSymbol[symbol_] := Module[

   dist, max, sel, key
  If[symbol == Nothing, Return[Nothing]];

  dist = symCondProb[symbol];

  (* Maybe we never saw this symbol as head in our data *)

  If[MissingQ@dist, Return[Nothing]];
  max = Max[symCondProb[symbol]];
  sel = Select[dist, max == # &];

  (* Likewise perhaps selection failed as it was not connected to anything *)
  If[sel == <||>, Return[Nothing]];
  key = First@Keys[sel];

  (* Since we always take max,
  if key\[Equal]symbol we will go in circles *)

  If[key == symbol, Return[Nothing]];

Now we can see some promising results:

NestList[nextSymbol[#] &, Range, 5] `` yields:{Range, Length, Part}`

NestList[nextSymbol[#] &, Import, 4]

yields: {Import, StringJoin, NotebookDirectory, EvaluationNotebook}

so it seems people tend to write Range[Length[Part[...]]] e.g. <a href="mailto:Range@Length@myData[[;;,1]">Range@Length@myData[[;;,1]] and Import[StringJoin[NotebookDirectory[],...]]

Whats next?

Next it would be interesting to throw these notebooks into a language model.


Attached is the code for crawling, everything else of relevance is above.


enter image description here - Congratulations! This post is now a Staff Pick as distinguished by a badge on your profile! Thank you, keep it coming!

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
or Discard

Group Abstract Group Abstract