[WSS18] Empirically derived conditional distribution of System symbols

Posted 1 year ago
1408 Views
|
|
2 Total Likes
|

Empirically derived conditional distribution of System symbols

Range->Length->Part is commonly found in: Given that Mathematica is a symbolic language it would be interesting to view the relationships between these symbols e.g. Length[List[1,2,3]] implies a relationship of Length -> List.

Since we do not know this information, we will have to retrieve it. For that, I headed to GitHub. Note: GitHub's public api limits the rate you can query to $1$ query every $30$ seconds (so the crawl takes many hours).

(* Get repositories marked with Mathematica as a language *)
githubResults = SearchRepositories["Mathematica"];

"8586 public repositories with language(s): Mathematica.
Only the first 1000 are available via the public api."

(* Get repository names *)
repositoriesWithMathematica = Normal[githubResults[All, "full_name"]];

(* Search for .nb files in these repos *)
notebookResults = SearchNotebooks[repositoriesWithMathematica];
"3525 notebooks found."



With our $3,525$ notebooks, we are ready to try and extract the Symbol->Symbol relationship (thanks to Carl Woll for helping with this part):

Test for system symbol

SetAttributes[systemSymbol, {Listable, HoldFirst}]

systemSymbol[Symbol[_String]] = False;
systemSymbol[s_Symbol] := Context[s] === "System"
systemSymbol[_] = False;


Extract symbol relationships

Note, we replace all non system symbols with a placeholder "NonWolfram".

NBExpressionRelations[file_] := Module[
{
nb = NotebookImport[file, "Input"],
expr =
Cases[nb,
p : _Symbol?systemSymbol[s_Symbol, ___] :> Hold[p], {3, Infinity}]
},
Cases[nb,
p_Symbol?systemSymbol[s_Symbol[___], ___] :>
If[Context[s] === "System", RuleDelayed[p, s],
RuleDelayed[p, "NonWolfram"]], {3, Infinity}]
]


Just loop this over all the files to get the data, and then

(* if you want to keep them by file *)
(* or *)
data = Flatten[expressionData]


Explore System symbol relationships

(* Tally over the keys (symbols used as Heaed) *)
tally = Tally[data[[;; , 1]]];


If we just look at the symbols by occurrence, we see that unsurprisingly List is the most prominent: We can view the interaction network as well, where the vertices are colored and sized by occurrence.

ToString[#[]] \[DirectedEdge] ToString[#[]] & /@ data; The light blue dot near the center is List.

This isn't all that informative, but we can make the frequency distribution of the symbols

NormalizeAssociation[assoc_] :=
With[{tots = Total[assoc]}, Reverse[Sort[Map[N[#/tots] &, assoc]]]]
(* Occurances of symbols *)
symOccur = Association @@ Rule @@@ tally;
(* Probability of symbols *)
symProb = NormalizeAssociation[symOccur];


and then the conditioned distribution

(* Conditional occurances of symbols *)

symCondOccur =
Map[Association @@ Rule @@@ Tally[#[[;; , 2]]] &,
GroupBy[Rule @@@ data, First]];
(* Conditional probability of symbols *)

symCondProb = Map[NormalizeAssociation, symCondOccur];


with this we can traverse our network (in this case we take only the most likely at each step, which may not be the best approach overall)

nextSymbol[symbol_] := Module[
{

dist, max, sel, key
},
If[symbol == Nothing, Return[Nothing]];

dist = symCondProb[symbol];

(* Maybe we never saw this symbol as head in our data *)

If[MissingQ@dist, Return[Nothing]];
max = Max[symCondProb[symbol]];
sel = Select[dist, max == # &];

(* Likewise perhaps selection failed as it was not connected to anything *)
If[sel == <||>, Return[Nothing]];
key = First@Keys[sel];

(* Since we always take max,
if key\[Equal]symbol we will go in circles *)

If[key == symbol, Return[Nothing]];
Return[key];
]


Now we can see some promising results:



NestList[nextSymbol[#] &, Range, 5]  yields:{Range, Length, Part}

NestList[nextSymbol[#] &, Import, 4]


yields: {Import, StringJoin, NotebookDirectory, EvaluationNotebook}

so it seems people tend to write Range[Length[Part[...]]] e.g. <a href="mailto:Range@Length@myData[[;;,1]">Range@Length@myData[[;;,1]] and Import[StringJoin[NotebookDirectory[],...]]

Whats next?

Next it would be interesting to throw these notebooks into a language model.

Code

Attached is the code for crawling, everything else of relevance is above. Attachments: Answer - Congratulations! This post is now a Staff Pick as distinguished by a badge on your profile! Thank you, keep it coming! Answer