Community RSS Feed
http://community.wolfram.com
RSS Feed for Wolfram Community showing any discussions in tag Data Science sorted by activeThe voters have spoken - but what did they say?
http://community.wolfram.com/groups/-/m/t/1221950
Introduction
----------
Elections are fundamental to our democracies and way of life. The electorate bestows power on the lawmakers and public officials whose legitimisation is derived solely from the fact that they are elected. In recent times, many elections, such as the presidential elections the United States and the Brexit and general elections in the United Kingdom have seen a very pronounced polarisation the electorate. Can mathematics and the Wolfram Language help to better understand this crucial process of the most fundamental decision making process in societies? The main question I want to discuss in a sequence of posts is what can be learnt from elections. So this post is about **computational politics**.
It is important to notice that democracies rely on a general acceptance of the result of elections. Without that consensus a peaceful co-existence is at risk. Societies agree on certain electoral procedures and the results have to be understood to be final. There are, however, more fundamental questions such as how do we derive the "voters will" from ballots. That is a highly non-trivial problem. Also, elected politicians often claim that they represent the "will of the people". But is that true and what is the "will of the people"? Hence, my question: the voters have spoken - but what did they say?
This is the first of a series of posts (hopefully):
1. **Electoral systems - interpreting the will of the voters**
2. Gerrymandering - shaping the voters' will to your liking
3. [Analysis of the Brexit vote - how do we know what did the voters really want?][1]
I apologise because there won't be any nice graphics in this post, but I find the results relevant anyway. In the following posts we will apply some more sophisticated algorithms to try to figure out what the voters meant to tell us. We will then see that we can use mathematics to estimate what the voters might have wanted to tell us; and then we can correct for that.
![enter image description here][2]
Note, that this cannot be the procedure by which we determine the decision that the electorate have taken. This is all a bit confusing, but I hope that it will become a bit clearer in these posts.
All posts were developed with [Bjoern Schelter][3], who is also a member of this community. The definitions and main ideas for this post were taken from the excellent book ["Mathematics And Politics: Strategy, Voting, Power, and Proof"][4] by Alan Taylor, a text which I highly recommend. It is a treasure trove of ideas and concepts for anyone interested in this ares. It goes far beyond what is described in this post.
Social choice procedure
----------
It turns out that reading the voters' will from a ballot is much more complicated than one might think - in many situations it is impossible (mathematically provable!) to do so in a "good" way. Let's define some words first. Let's assume that each voter ranks some alternatives in their preferred order.
**Definition:** A social choice procedure is a function for which a typical input is a sequence of lists (without ties) of some set A (the set of alternatives) and the corresponding output is either an element of A, a subset of A, of "No Winner".
There is a beautiful theorem by May (1952):
**Theorem:**
If the number of people is odd and each election produces a unique winner, then the majority rule is the only social choice procedure for two alternatives that satisfies the following three conditions:
1. It treats all the voters the same: If any two voters exchange their ballots, the outcome of the election is un-effected.
2. It threats both alternatives the same: If every voter reverses his or her vote (changing a vote for a to a vote for b and vice versa), the the election outcome is reversed as well.
3. It is monotone: If some voter were to change his or her ballot from a vote for the loser to a vote for the winner, then the election outcome would be unchanged.
Six examples of social choice procedures
----------
Let us assume that we have several alternatives to be voted for:
alternatives = {a, b, c}
Then there are the voting lists of the voters. They form the ballots:
ballots = Table[RandomSample[alternatives, Length[alternatives]], {k, 4}]
(*{{a, b, c}, {c, b, a}, {a, b, c}, {a, b, c}}*)
> Condorcet's method:
>
> An alternative x is among the winners if for any other alternative y,
> at least half of the voters rank x over y on their ballots.
This is an implementation of this social choice function:
condorcetindiv[ballots_List, a_, b_] :=
Module[{},
If[a === b, "NA",
If[#[[1, 2]] >= Length[ballots]/2, {a, b}, #[[-1, 1, 1]]] &@
SortBy[Tally[Select[#, # == a || # == b &] & /@ ballots], Last]]]
condorcet[ballots_List, alternatives_] :=
Module[{},
If[Max[#] >= Length[ballots],
alternatives[[
Flatten[Position[#, Max[#]]]]], {}] &@(Count[
Flatten[Outer[condorcetindiv[ballots, #1, #2] &, alternatives,
alternatives], Infinity], #] & /@ alternatives)]
Given the ballots and the alternatives we can now compute the social choice by applying the function condorcet:
condorcet[ballots, alternatives]
{a}
Note, that the social choice is given as a list, because it is not necessarily unique. This will become important later on.
> Plurality:
>
> Declare the social choice(s) the alternative(s) with the largest
> number of first-place rankings in the ballot list.
This is the respective implementation:
plurality[ballots_] :=
Module[{}, GatherBy[SortBy[Tally[ballots[[All, 1]]], Last], Last][[-1, All, 1]]]
The social choice is computed like this:
plurality[ballots]
(*{a}*)
> Borda Count
>
> We award points to the n alternatives in a ballot as follows: the
> alternative at the bottom receives zero points; the one next to it
> receives one point and so on up to the alternative on the top of the
> list which receives (n-1) points.
This is the implementation of the social choice function:
bordacount[ballots_, alternatives_] :=
Module[{}, GatherBy[SortBy[Total[Flatten[{Reverse[Range[0, Length[alternatives] - 1]]*Transpose[ballots]}]] /. {Plus -> List, Times -> List},
First], First][[-1, All, 2]]]
which can be applied like so:
bordacount[ballots, alternatives]
(*{a}*)
> Hare System
>
> The procedure is as follows: we delete the alternative(s) that is on
> top of the fewest of the individual preference lists. The resulting
> individual preference lists are at least one shorter than before. We
> then repeat this process; the last alternative(s) is/are the
> winner(s).
Our Wolfram Language is
haresystem[ballots_] :=
Module[{},
Sort[DeleteDuplicates[Flatten[If[Length[#[[-1, 1]]] > 0, #[[-1, 1]], #[[-2, 1]]] &@
NestWhileList[((# //. (# -> Nothing & /@ Complement[alternatives, #[[All, 1]]])) //. (# -> Nothing & /@
GatherBy[SortBy[Tally[#[[All, 1]]], Last], Last][[1, All, 1]])) &, ballots, Length[DeleteDuplicates[Flatten[#, Infinity]]] > 1 &]]]]]
which is evaluated like so:
haresystem[ballots]
(*{a}*)
> Sequential pairwise voting
>
> We start with a fixed order of alternatives {a,b,c,d,..}. The first
> alternative is evaluated against the second in a one-to-one contest.
> The winning alternative (or both if there is a tie) then is compared
> to the third alternative. An alternative is deleted at the end of any
> round if it loses a one-to-one contest.
We will use a new ballot for the evaluation for clarity:
alternatives = {a, b, c, d}
ballots = Table[RandomSample[alternatives, Length[alternatives]], {k, 4}]
(*{{d, b, a, c}, {d, a, c, b}, {d, c, b, a}, {b, d, c, a}}*)
We also need a fixed agenda:
fixedagenda = alternatives[[RandomSample[RandomSample[Range[Length[alternatives]]]]]]
(*{a, b, c, d}*)
The social choice function is then implemented like so:
concorcetindivloser[ballots_List, bin_, acomp_] :=
Module[{},
If[acomp === bin,
"NA", (Which[#1[[1, 2]] === Length[ballots]/2, {"W",
"W"}, #1[[-1, 1, 1]] === bin, "W", True, "L"] &)[
SortBy[Tally[(Select[#1, #1 == acomp || #1 == bin &] &) /@
ballots], Last]]]]
seqpair[ballots_, fixedagenda_] :=
Module[{winners = {fixedagenda[[1]]}},
Do[AppendTo[winners, fixedagenda[[k]]];
winners =
Pick[winners, ! MemberQ[Flatten[#], "L"] & /@
Outer[concorcetindivloser[ballots, #1, #2] &, #, #] &@
winners], {k, 2, Length[fixedagenda]}]; winners]
We can evaluate this like so:
seqpair[ballots, fixedagenda]
(*{b}*)
> Dictatorship
>
> One of the voters is chosen and called "dictator". The alternative on
> top if their individual preference list is the social choice.
We can implement this function like so:
dictatorhip[ballots_, k_] := {ballots[[k, 1]]}
where $k$ is the index of the person who is the "dictator".
dictatorhip[ballots, 3]
(*{d}*)
A closer look at the social choice procedures
----------
All of these social choice functions are implemented in different parts of the world. It turns out that they have the curious habit of not agreeing on the social choice even if the ballots are the same!
Let's suppose we have the following ballots:
ballots = {{a, b, c, d, e}, {a, d, b, e, c}, {a, d, b, e, c}, {c, b, d, e, a}, {c, d, b, a, e}, {b, c, d, a, e}, {e, c, d, b, a}};
So there are 7 voters and the lists are the preference lists of them. Here is another representation:
Transpose@ballots // MatrixForm
![enter image description here][5]
Let's apply Condorcet's method first:
condorcet[ballots, alternatives]
(*{}*)
That means that Condorcet's method does not produce a social choice. This is obviously a critical problem of a social choice function, because it would not be suitable to lead the decision making process to a conclusion.
Let's try the plurality rule:
plurality[ballots]
(*{a}*)
It determines that the social choice is a. So the electorate has decided for alternative a.
Next up: the Borda Count:
bordacount[ballots, alternatives]
(*{b}*)
So according to the Borda Count b is the winner - in spite of the ballots not having changed!
What about the Hare system?
haresystem[ballots]
(*{c}*)
Next up, the sequential voting system. Let's choose the fixed agenda
fixedagenda = {a, b, c, d, e};
We then obtain:
seqpair[ballots, fixedagenda]
(*{d}*)
So under this social choice function d is the winner.
Last the dictatorship. Let's make person 7 the dictator:
dictatorhip[ballots, 7]
(*{e}*)
Now, e is the winner. Curious...
So in this simple case existing social choice procedures would have led to all different alternatives in spite of the ballots being exactly the same. This is a bit frightening because all these systems are actually being used today.
Optimal social choice procedure
----------
The question is whether there is an "optimal" social choice procedure. The answer to that question turns out to be no. First we define the following five desirable properties:
1. Always a winner. The procedure should always produce a winner. We have already seen that for example Condorcet's method does not give a winner for the last example.
2. Condorcet winner criterion. A social choice procedure is said to satisfy the Condorcet winner criterion (CWC) provided that\[LongDash]if there is a Condorcet winner\[LongDash]then it alone is the social choice.
3. Pareto condition. A social choice procedure is pareto if for every pair x and y of alternatives we have: If everyone prefers x to y, then y is not a social choice.
4. Monotonicity. A social choice procedure is monotone if for any x the following holds: If x is the social choice (or tied for such) and someone changes his or her preference list by moving x up one spot (that is, exchanging x's position with that of the alternative immediately above x on his or her list), then x should still be the social choice (or tied for such).
5. Independence of Irrelevant Alternatives. For every pair of alternative x and y we have: If the social choice set includes x but not y, and one or more voters change their preferences, but no one changes his or her mind about whether x is preferred to y or y to x, then the social choice set should not change so as to include y.
These are very (!) desirable properties and reasonable requirements. Unfortunately, the following theorem holds:
**THEOREM.**
There is no social choice procedure for three or more alternatives that satisfies the always-a-winner criterion, independence of irrelevant alternatives, and the Condorcet winner criterion.
There is also a more general theorem by Arrow, that -roughly speaking- that there cannot be a social choice function fulfilling some basic properties.
What happens for larger constituencies?
----------
Let's create a larger set of voters and hence individual preference lists.
alternatives = {a, b, c, d, e};
and
constituency = Table[RandomSample[alternatives], 200];
Condorcet:
condorcet[constituency, alternatives]
(*{}*)
Plurality:
plurality[constituency]
(*{b}*)
Borda count:
bordacount[constituency, alternatives]
(*{e}*)
Hare system:
haresystem[constituency]
(*{d}*)
Sequential pairwise:
fixedagenda = alternatives[[RandomSample[RandomSample[Range[Length[alternatives]]]]]];
(*{e}*)
That is not very encouraging. But perhaps the outcomes depend on the population size?
Monitor[results =
Table[{pop,
Table[constituency = Table[RandomSample[alternatives], pop];
fixedagenda =
alternatives[[
RandomSample[
RandomSample[Range[Length[alternatives]]]]]]; {Length[
DeleteCases[#, {}]], Length /@ #} &@
Union[Sort /@ {condorcet[constituency, alternatives],
plurality[constituency],
bordacount[constituency, alternatives],
haresystem[constituency],
seqpair[constituency, fixedagenda]}], {50}]}, {pop, 10, 500,
10}];, pop]
I have varied the population size of the electorate from 10 to 500 in steps of 10. The variable results looks like this (here for 500 voters):
results[[-1]]
(*{500, {{2, {0, 1, 1}}, {2, {0, 1, 2}}, {3, {0, 1, 1, 2}}, {1, {0,
1}}, {3, {0, 1, 1, 1}}, {2, {0, 1, 1}}, {2, {0, 1, 1}}, {2, {0, 1,
1}}, {3, {0, 1, 2, 2}}, {1, {0, 1}}, {2, {0, 1, 1}}, {2, {0, 1,
1}}, {1, {0, 1}}, {1, {0, 1}}, {1, {0, 1}}, {3, {0, 1, 1,
2}}, {4, {0, 1, 1, 2, 3}}, {1, {0, 1}}, {1, {0, 1}}, {1, {0,
1}}, {1, {0, 1}}, {1, {0, 1}}, {3, {0, 1, 1, 2}}, {1, {0,
1}}, {2, {0, 1, 1}}, {2, {0, 1, 2}}, {2, {0, 1, 1}}, {2, {0, 1,
1}}, {2, {0, 1, 1}}, {1, {0, 1}}, {2, {0, 1, 3}}, {1, {0,
1}}, {3, {0, 1, 1, 1}}, {1, {0, 1}}, {2, {0, 1, 1}}, {2, {0, 1,
1}}, {2, {0, 1, 2}}, {1, {0, 1}}, {3, {0, 1, 1, 1}}, {2, {0, 1,
1}}, {2, {0, 1, 1}}, {3, {0, 1, 1, 2}}, {3, {0, 1, 1, 1}}, {2, {0,
1, 1}}, {4, {0, 1, 1, 1, 2}}, {1, {0, 1}}, {2, {0, 1,
1}}, {2, {0, 1, 1}}, {1, {0, 1}}, {2, {0, 1, 1}}}}*)
The first 500 is the number of voters. Then there are 50 "realisations", i.e. 50 independent elections where voters make random choices. Each of the 50 sublists has a first entry which gives the number of different results using different social choice functions and ignoring empty sets, i.e. "no-winner" scenarios. Let's plot the histogram of the number of different outcomes using the various social choice functions:
Histogram[results[[-1, 2, All, 1]],
AxesLabel -> {"number of different outcomes", "number of elections"},
ImageSize -> Large, LabelStyle -> Directive[Bold, Medium]]
So in most cases the social choice functions give different results, even if there are 500 voters. Then we can count how many winners there are in cases of different results:
results[[-1, 2, All, 2]]
(*{{0, 1, 1}, {0, 1, 2}, {0, 1, 1, 2}, {0, 1}, {0, 1, 1, 1}, {0, 1,
1}, {0, 1, 1}, {0, 1, 1}, {0, 1, 2, 2}, {0, 1}, {0, 1, 1}, {0, 1,
1}, {0, 1}, {0, 1}, {0, 1}, {0, 1, 1, 2}, {0, 1, 1, 2, 3}, {0,
1}, {0, 1}, {0, 1}, {0, 1}, {0, 1}, {0, 1, 1, 2}, {0, 1}, {0, 1,
1}, {0, 1, 2}, {0, 1, 1}, {0, 1, 1}, {0, 1, 1}, {0, 1}, {0, 1,
3}, {0, 1}, {0, 1, 1, 1}, {0, 1}, {0, 1, 1}, {0, 1, 1}, {0, 1,
2}, {0, 1}, {0, 1, 1, 1}, {0, 1, 1}, {0, 1, 1}, {0, 1, 1, 2}, {0, 1,
1, 1}, {0, 1, 1}, {0, 1, 1, 1, 2}, {0, 1}, {0, 1, 1}, {0, 1,
1}, {0, 1}, {0, 1, 1}}*)
So all cases contain cases of "no winner", when there is a zero in the sublist. This comes mostly from Condorcet's method, which often does not give a winner. The ones indicate that in the respective outcomes there was only one winner. A two or three means that the winner was not unique, i.e. a tie. So basically of all the outcomes only the ones lead to a decision. So only
N[Count[#, 1]/Length[#]] &@Flatten[results[[-1, 2, All, 2]]]
(*0.568*)
57% of the cases lead to a unique winner. If we ignore the no-winner scenarios we see that only in
N[Count[#, 1]/Length[#]] &@(Length[#] - 1 & /@ results[[-1, 2, All, 2]])
(*0.34*)
about 34% of the cases the methods agree on the winner.
Let's see how these numbers change for different sizes of electorates. First, we compute the percentage of "unique winner scenarios" for different sizes of the electorate (ignoring no winner scenarios):
ListLinePlot[Table[{results[[k, 1]], N[Count[#, 1]/Length[#]] &@Flatten[results[[k, 2, All, 2]]]}, {k, 1, 50}],
FrameLabel -> {"size of electorate", "percentage unique winner"}, LabelStyle -> Directive[Bold, Medium], ImageSize -> Large,
PlotTheme -> "Marketing"]
![enter image description here][6]
With a bit more CPU time, we can study whether there is a saturation effect for larger population sizes...
Let's look at the agreement of the different social choice procedures:
ListLinePlot[Table[{results[[k, 1]], N[Count[#, 1]/Length[#]] &@(Length[#] - 1 & /@ results[[k, 2, All, 2]])}, {k, 1, 50}],
FrameLabel -> {"size of electorate", "percentage agreement"},
LabelStyle -> Directive[Bold, Medium], ImageSize -> Large, PlotTheme -> "Marketing"]
![enter image description here][7]
So, more often than not, the social choice procedures to not agree on the winner. This might suggest a type of independence of the population size.
Conclusion
----------
Even if the ballots are fixed deciding what the voters meant to say is by no means trivial. "Reading" the voters' minds even if they have spoken is mathematically speaking very problematic. For the case of only two alternatives (Brexit/No-Brexit) and an odd number of voters the thing appears to be easier, but as we will see later, there are many more issues.
**This is not meant to say that we should give up or ignore the outcome of elections.** To maintain order and piece as a society we need to agree on a particular social choice procedure and accept the outcome. We need to take decisions and these decisions have to be based on established rules. The only issue is that sometimes the social choice ("will of the people") depends on the social choice function and not solely on the ballots!
We have a type of "societal contract" to accept the outcome of elections under the rules (social choice function) that is agreed upon before the elections. This does not necessarily mean, however, that politicians can claim that this is the "will of the people"; it is the decision made based on a ballot and an agreed upon social choice procedure.
[1]: http://community.wolfram.com/groups/-/m/t/1222959
[2]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Screenshot2017-11-1423.26.23.png&userId=48754
[3]: http://community.wolfram.com/web/bschelter
[4]: https://www.amazon.co.uk/Mathematics-Politics-Strategy-Voting-Power/dp/1441926615
[5]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Screenshot2017-11-1422.42.52.png&userId=48754
[6]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Screenshot2017-11-1423.46.02.png&userId=48754
[7]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Screenshot2017-11-1423.47.36.png&userId=48754Marco Thiel2017-11-14T23:53:48ZProper use of PredictorMeasurements[]?
http://community.wolfram.com/groups/-/m/t/1208606
I've been playing around with `Predict[]` with multi-dimensional datasets and, for small training sets anyway, things seem to work correctly. For example,
trainingset = {<|"age" -> 47, "sex" -> "M", "height" -> 100,
"weight" -> 60|>, <|"age" -> 22, "sex" -> "M", "height" -> 90,
"weight" -> 55|>, <|"age" -> 43, "sex" -> "M", "height" -> 110,
"weight" -> 61|>, <|"age" -> 23, "sex" -> "F", "height" -> 100,
"weight" -> 41|>, <|"age" -> 33, "sex" -> "F", "height" -> 80,
"weight" -> 50|>, <|"age" -> 43, "sex" -> "F", "height" -> 70,
"weight" -> 51|>};
testset = {<|"age" -> 37, "sex" -> "M", "height" -> 100|>, <|
"age" -> 22, "sex" -> "M", "height" -> 90|>, <|"age" -> 43,
"sex" -> "F", "height" -> 80|>, <|"age" -> 33, "sex" -> "F",
"height" -> 70|>};
p1 = Predict[trainingset -> "weight", PerformanceGoal -> "Quality",
Method -> "RandomForest"];
We can get predictions from the p1 `PredictorFunction` with
Map[Append[#, "prediction" -> p1[#]] &, testset] (* this works *)
I can then compute residuals, etc., myself.
Since version 10, Wolfram Language has included the function `PredictorMeasurements[]`, and the documentation suggests that I should be able to get the predictions above, plus residual reports and other information, with
PredictorMeasurements[p1, testset]
But this does not work. I get the following error: `PredictorMeasurements::bdfmt: Argument {<|age->37,sex->M,height->100,weight->60|>,<|age->22,sex->M,height->90|>,<|age->43,sex->F,height->80|>,<|age->33,sex->F,height->70|>} should be a rule or a list of rules.`
What am I missing?Michael Stern2017-10-25T16:44:44ZPredicting Winning Odds of Italian Serie A Soccer 1934-2017
http://community.wolfram.com/groups/-/m/t/1218215
Dear Community! Although gambling is not my hobby, the validation of predictor engine inside Wolfram Mathematica 11.2 is. Sooner or later someone may get huge benefits in gambling using predictors as they have been developed already today. I got LaLiga "Outcomes" 100 % but my example will be Italian Serie A analysis, yrs 1934-2017. It is downloadable from
https://github.com/jalapic/engsoccerdata
Note that the source was found by google'ing. I am avoiding to be too lengthy and therefore only minimum amount of outputs are given. For necessary prediction only **goals** and **dates** are needed. I have always wondered whether it is useful to put into prediction as much statistical data as possible and it turns out that this must be avoided i.e. the right data for right prediction should be selected for best results. The CSV file should be first downloaded and a little-bit modified in order to use it as Mher Davtyan used it in his original example:
http://community.wolfram.com/groups/-/m/t/908804
then the path may be seen from the import function below:
oddsData =
SemanticImport[
"C:\\Users\\user\\Documents\\engsoccerdata-master\\data-raw\\italy3.\
csv", <|1 -> "Integer", 2 -> "String", 3 -> "Integer", 4 -> "String",
5 -> "String", 7 -> Automatic, 8 -> Automatic|>]
giving an output:
![Input data][1]
Yes, it consists of 25784 games. The collection of additions to Dataset are listed below and explained later:
scoreAdded =
oddsData[All,
Append[#,
"Score" ->
ToString[Slot["hgoal"]] <> ":" <> ToString[Slot["vgoal"]]] &]
homeAndAway =
scoreAdded[
Association["Home" -> GroupBy[#home &],
"Away" -> GroupBy[#visitor &]]];
data01 = (Flatten[(Transpose[{homeAndAway["Home",
scoreAdded[#]["home"], All, "Nr"] // Normal,
Thread[Rule[
homeAndAway["Home", scoreAdded[#]["home"], All,
"home"] // Normal,
homeAndAway["Home", scoreAdded[#]["home"], All,
"hgoal"] // Normal // Accumulate]]}] & /@
Range[1, 25784, 1]) // Union, 1] // Sort)[[All, 2]][[All,
2]];
data02 = (Flatten[(Transpose[{homeAndAway["Away",
scoreAdded[#]["visitor"], All, "Nr"] // Normal,
Thread[Rule[
homeAndAway["Away", scoreAdded[#]["visitor"], All,
"visitor"] // Normal,
homeAndAway["Away", scoreAdded[#]["visitor"], All,
"vgoal"] // Normal // Accumulate]]}] & /@
Range[1, 25784, 1]) // Union, 1] // Sort)[[All, 2]][[All,
2]];
data03 = (Flatten[(Transpose[{homeAndAway["Away",
scoreAdded[#]["visitor"], All, "Nr"] // Normal,
Thread[Rule[
homeAndAway["Away", scoreAdded[#]["visitor"], All,
"visitor"] // Normal,
homeAndAway["Away", scoreAdded[#]["visitor"], All,
"hgoal"] // Normal // Accumulate]]}] & /@
Range[1, 25784, 1]) // Union, 1] // Sort)[[All, 2]][[All,
2]];
data04 = (Flatten[(Transpose[{homeAndAway["Home",
scoreAdded[#]["home"], All, "Nr"] // Normal,
Thread[Rule[
homeAndAway["Home", scoreAdded[#]["home"], All,
"home"] // Normal,
homeAndAway["Home", scoreAdded[#]["home"], All,
"vgoal"] // Normal // Accumulate]]}] & /@
Range[1, 25784, 1]) // Union, 1] // Sort)[[All, 2]][[All,
2]];
addIndividualGoalsCumul[d_] :=
Join[d, Association["homeTeamHomeRating" -> (data01[[d["Nr"]]]),
"VisitorAwayRating" -> (data02[[d["Nr"]]]),
"homeTeamHomeConceded" -> (data04[[d["Nr"]]]),
"VisitorAwayConceded" -> (data03[[d["Nr"]]])]];
ratingsAdded = Map[addIndividualGoalsCumul[#] &, scoreAdded]
extractDate[a_] :=
Append[a, <|
"DateSuccessive" ->
FromDigits[StringSplit[a["Date"], "-"][[1]]] 100 - 190000 +
FromDigits[StringSplit[a["Date"], "-"][[2]]]|>]
dateNew = Map[extractDate[#] &, ratingsAdded]
addOutcome[a_Association] :=
If[a["hgoal"] == a["vgoal"], Append[a, Association["Outcome" -> 0]],
If[a["hgoal"] > a["vgoal"], Append[a, Association["Outcome" -> 1]],
Append[a, Association["Outcome" -> -1]]]]
outcomeAdded = Map[addOutcome[#] &, dateNew]
where data01, data02, data03 and data04 compute the cumulative scores and conceded scores for every team, the "DateSuccessive" modifies the "Date": 1934-06-23 to 03406 and 2017-05-28 to 11705 i.e. year and month, and "Outcome" is similar as in original post. The result table is:
![Final Dataset 1][2]
![Final Dataset 2][3]
The test data starts from index:
In[96]:= Round[0.99218 Length[outcomeAdded]]
Out[96]= 25582
and consists of all 202 games played on yr 2017 for test set for Classifier:
![Testset][4]
Above we see the Testset and needed data using the code:
n = Round[0.99218 Length[outcomeAdded]];
{trainingSet, testSet} =
TakeDrop[outcomeAdded[
All, {"DateSuccessive", "homeTeamHomeRating", "VisitorAwayRating",
"homeTeamHomeConceded", "VisitorAwayConceded", "hgoal", "vgoal",
"Outcome"}], n];
c = Classify[trainingSet -> "Outcome", Method -> "LogisticRegression"] cm = ClassifierMeasurements[c, testSet -> "Outcome"]
with following results:
![enter image description here][5]
Visualisation:
In[138]:= actaulValue = testSet[All, "Outcome"] // Normal
Out[138]= {1, 1, -1, -1, 1, 1, 1, 0, -1, -1, 1, 1, 1, 1, 1, 0, 1, -1, \
0, -1, -1, 1, 1, 1, 0, 1, -1, -1, 1, 1, -1, 0, 1, 0, 0, 1, -1, 0, 1, \
-1, -1, 1, 0, 0, -1, 1, -1, 1, -1, 1, -1, -1, 1, 1, -1, -1, 1, -1, 1, \
-1, 1, 0, 1, 1, -1, -1, -1, 1, 1, 1, 0, -1, 1, -1, 1, -1, 0, -1, 1, \
0, -1, 0, 1, -1, 1, 0, -1, -1, 0, -1, 1, 0, 1, -1, 1, 1, 1, 1, -1, \
-1, -1, 1, 1, 0, 1, 1, 0, -1, -1, 1, -1, 1, 1, -1, -1, 1, -1, 0, -1, \
0, 0, -1, 0, 0, 1, -1, -1, 1, -1, 1, 0, 1, 1, -1, 0, 0, 1, 0, -1, 0, \
1, 0, 1, 1, -1, 1, 1, -1, -1, 0, 1, -1, 0, 0, 1, 1, 0, -1, -1, -1, 1, \
-1, 0, 1, 0, 1, 1, 1, -1, -1, 0, 0, 0, 1, 1, 1, 1, -1, 1, 1, 0, -1, \
-1, 1, -1, 1, 1, -1, 1, 1, 0, 1, 1, -1, 1, 1, 0, 1, 1, 1, -1, 1}
In[139]:= predictedValue =
c[Normal[testSet[[All, Values]]][[All, ;; 7]]]
Out[139]= {1, 1, -1, -1, 1, 1, 1, 0, -1, -1, 1, 1, 1, 1, 1, 0, 1, -1, \
0, -1, -1, 1, 1, 1, 0, 1, -1, -1, 1, 1, -1, 0, 1, 0, 0, 1, -1, 0, 1, \
-1, -1, 1, 0, 0, -1, 1, -1, 1, -1, 1, -1, -1, 1, 1, -1, -1, 1, -1, 1, \
-1, 1, 0, 1, 1, -1, -1, -1, 1, 1, 1, 0, -1, 1, -1, 1, -1, 0, -1, 1, \
0, -1, 0, 1, -1, 1, 0, -1, -1, 0, -1, 1, 0, 1, -1, 1, 1, 1, 1, -1, \
-1, -1, 1, 1, 0, 1, 1, 0, -1, -1, 1, -1, 1, 1, -1, -1, 1, -1, 0, -1, \
0, 0, -1, 0, 0, 1, -1, -1, 1, -1, 1, 0, 1, 1, -1, 0, 0, 1, 0, -1, 0, \
1, 0, 1, 1, -1, 1, 1, -1, -1, 0, 1, -1, 0, 0, 1, 1, 0, -1, -1, -1, 1, \
-1, 0, 1, 0, 1, 1, 1, -1, -1, 0, 0, 0, 1, 1, 1, 1, -1, 1, 1, 0, -1, \
-1, 1, -1, 1, 1, -1, 1, 1, 0, 1, 1, -1, 1, 1, 0, 1, 1, 1, -1, 1}
In[140]:= difference = actaulValue - predictedValue
Out[140]= {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
In[141]:= N@Normalize[Counts[Abs[difference]], Total]
Out[141]= <|0 -> 1.|>
Concluding that all yr 2017 game winners were correctly classified.
Now, prediction with slightly larger set:
In[143]:= n = Round[0.99 Length[outcomeAdded]]
Out[143]= 25526
In[144]:= {trainingSetFull, testSetFull} = TakeDrop[outcomeAdded, n];
{trainingSet1,
testSet1} = {trainingSetFull[
All, {"DateSuccessive", "homeTeamHomeRating", "VisitorAwayRating",
"homeTeamHomeConceded", "VisitorAwayConceded", "hgoal", "vgoal",
"Outcome"}],
testSetFull[
All, {"DateSuccessive", "homeTeamHomeRating", "VisitorAwayRating",
"homeTeamHomeConceded", "VisitorAwayConceded", "hgoal", "vgoal",
"Outcome"}]};
![enter image description here][6]
In[156]:= predictedValue =
Round[p[Normal[testSet1[[All, Values]]][[All, ;; 7]]], 1]
Out[156]= {1, 1, -1, -1, 1, 0, 1, -1, 1, -1, 1, 0, 1, -1, 1, 1, 0, 1, \
1, 1, -1, 1, 0, 1, 1, 0, 1, 1, -1, -1, 0, -1, 1, -1, -1, 1, 1, 1, 1, \
1, 0, 1, 0, 1, 1, -1, -1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, -1, -1, 1, \
1, 1, 0, -1, -1, 1, 1, 1, 1, 1, 0, 1, -1, 0, -1, -1, 1, 1, 1, 0, 1, \
0, -1, 1, 1, -1, 0, 1, 0, 0, 0, -1, 0, 1, -1, -1, 1, 0, 0, -1, 1, -1, \
1, -1, 1, -1, -1, 1, 1, -1, -1, 1, -1, 1, -1, 1, 0, 1, 1, -1, -1, -1, \
1, 1, 1, 0, -1, 1, -1, 1, -1, 0, -1, 1, 0, -1, 0, 1, -1, 1, 0, -1, \
-1, 0, -1, 1, 0, 1, -1, 1, 1, 1, 1, -1, -1, -1, 1, 1, 0, 1, 1, 0, -1, \
-1, 1, -1, 1, 1, -1, -1, 1, -1, 0, -1, 0, 0, -1, 0, 0, 1, -1, -1, 1, \
-1, 1, 0, 1, 1, -1, 0, 0, 1, 0, -1, 0, 1, 0, 1, 0, -1, 1, 1, -1, -1, \
0, 1, -1, 0, 0, 1, 1, 0, -1, -1, -1, 1, -1, 0, 1, 0, 1, 1, 1, -1, -1, \
0, 0, 0, 1, 1, 1, 1, -1, 1, 1, 0, -1, -1, 1, -1, 1, 1, -1, 1, 1, 0, \
1, 1, -1, 0, 1, 0, 1, 1, 1, -1, 1}
In[157]:= actaulValue = testSet1[All, "Outcome"] // Normal
Out[157]= {1, 1, -1, -1, 1, 0, 1, -1, 1, -1, 1, 0, 1, -1, 1, 1, 0, 1, \
1, 1, -1, 1, 0, 1, 1, 0, 1, 1, -1, -1, 0, -1, 1, -1, -1, 1, 1, 1, 1, \
1, 0, 1, -1, 1, 1, -1, -1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, -1, -1, \
1, 1, 1, 0, -1, -1, 1, 1, 1, 1, 1, 0, 1, -1, 0, -1, -1, 1, 1, 1, 0, \
1, -1, -1, 1, 1, -1, 0, 1, 0, 0, 1, -1, 0, 1, -1, -1, 1, 0, 0, -1, 1, \
-1, 1, -1, 1, -1, -1, 1, 1, -1, -1, 1, -1, 1, -1, 1, 0, 1, 1, -1, -1, \
-1, 1, 1, 1, 0, -1, 1, -1, 1, -1, 0, -1, 1, 0, -1, 0, 1, -1, 1, 0, \
-1, -1, 0, -1, 1, 0, 1, -1, 1, 1, 1, 1, -1, -1, -1, 1, 1, 0, 1, 1, 0, \
-1, -1, 1, -1, 1, 1, -1, -1, 1, -1, 0, -1, 0, 0, -1, 0, 0, 1, -1, -1, \
1, -1, 1, 0, 1, 1, -1, 0, 0, 1, 0, -1, 0, 1, 0, 1, 1, -1, 1, 1, -1, \
-1, 0, 1, -1, 0, 0, 1, 1, 0, -1, -1, -1, 1, -1, 0, 1, 0, 1, 1, 1, -1, \
-1, 0, 0, 0, 1, 1, 1, 1, -1, 1, 1, 0, -1, -1, 1, -1, 1, 1, -1, 1, 1, \
0, 1, 1, -1, 1, 1, 0, 1, 1, 1, -1, 1}
In[158]:= difference = Round[actaulValue - predictedValue, 0.01]
Out[158]= {0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., \
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., \
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -1., 0., 0., 0., 0., 0., \
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., \
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., \
-1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., \
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., \
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., \
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., \
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., \
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., \
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., \
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., \
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., \
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., \
0., 0., 0., 0., 0., 0.}
In[159]:= N@Normalize[Counts[Abs[difference]], Total]
Out[159]= <|0. -> 0.98062, 1. -> 0.0193798|>
showing 98 % accuracy.
## 10 Random Samples ##
randomMatches = RandomSample[testSetFull, 10]
![enter image description here][7]
Goals prediction is not so exact:
![enter image description here][8]
and for visiting team:
![enter image description here][9]
giving 60 % of accuracy.
Having "Outcome" correct, I will just show two tables of Predicted and Actual scores:
![enter image description here][10]
## Conclusions ##
Clearly, game scores are harder to predict, almost 60 % are correct and 38 % differs with 1 goal. The predictor seems to work better when adding time-scale. Over the time the team's performance is changing and this must be reflected.
[1]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Capture1.JPG&userId=94243
[2]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Capture2.JPG&userId=94243
[3]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Capture3.JPG&userId=94243
[4]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Capture4.JPG&userId=94243
[5]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Capture5.JPG&userId=94243
[6]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Capture6.JPG&userId=94243
[7]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Capture7.JPG&userId=94243
[8]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Capture8.JPG&userId=94243
[9]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Capture9.JPG&userId=94243
[10]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Capture11.JPG&userId=94243Tanel Telliskivi2017-11-08T01:18:39ZThe voters have spoken - but what did they say? (Part 3)
http://community.wolfram.com/groups/-/m/t/1222959
Introduction
----------
Elections are fundamental to our democracies and way of life. The electorate bestows power on the lawmakers and public officials whose legitimisation is derived solely from the fact that they are elected. In recent times, many elections, such as the presidential elections the United States and the Brexit and general elections in the United Kingdom have seen a very pronounced polarisation the electorate. Can mathematics and the Wolfram Language help to better understand this crucial process of the most fundamental decision making process in societies? The main question I want to discuss in a sequence of posts is what can be learnt from elections. So this post is about computational politics.
It is important to notice that democracies rely on a general acceptance of the result of elections. Without that consensus a peaceful co-existence is at risk. Societies agree on certain electoral procedures and the results have to be understood to be final. There are, however, more fundamental questions such as how do we derive the "voters will" from ballots. That is a highly non-trivial problem. Also, elected politicians often claim that they represent the "will of the people". But is that true and what is the "will of the people"? Hence, my question: the voters have spoken - but what did they say?
This is the third of a series of posts:
1. [Electoral systems - interpreting the will of the voters][1]
2. Gerrymandering - shaping the voters' will to your liking
3. **Analysis of the Brexit vote - how do we know what did the voters really want?**
This part of the sequence of posts will analyse the results of the Brexit vote in the United Kingdom. The question asked was basically a yes/no question and with small majority the social choice was to conduct Brexit. The rules of this process have been followed and the decision is binding. There is, however, a related question: Did the outcome in this particular case actually give sufficient evidence that this is the "will of the people" as many politicians state. If we want to evaluate the power of evidence there are mathematically established procedures. Similar procedures are for example applied in medical trials to evaluate the evidence provided by data to support a particular hypothesis. We will apply this type of procedure to the Brexit elections.
All posts were developed with [Bjoern Schelter][2], who is also a member of this community. He is indeed the expert on these methods.
**Disclaimer:** this post might be modified slightly over the next couple of days.
Data download and preparation
----------
We will start by obtaining some of the data required for this post. The [central dataset is provided by the electoral commission][3]. It is attached to this post. We download and them import the data like so:
SetDirectory[NotebookDirectory[]]
votingData = Import["EU-referendum-result-data.csv"];
Here is a short extract of the data:
votingData[[1 ;; 5]] // Transpose // TableForm
![enter image description here][4]
Next, we will need some data on the local authorities/electoral districts from [here][5]. We then import:
localauthorities = Import["Local_Authority_Districts_December_2016_Generalised_Clipped_Boundaries_in_Great_Britain.shp"];
explanationShape = Import["Local_Authority_Districts_December_2016_Full_Clipped_Boundaries_in_Great_Britain.csv"];
These data sets contain information about the shape and location of the local authority boundaries and data on how to match that with the voting outcome in the first file. To get a first glance at the data we re-arrange and organise the data like so:
Monitor[areas =
Table[QuantityMagnitude[UnitConvert[If[Length[#] > 1, Total[GeoArea /@ #], GeoArea[#]]], "SI"] & @(localauthorities[[1, 2, 1, 2, 1, 2, 1, 1, 1]][[k]]), {k, 1, Length[localauthorities[[1, 2, 1, 2, 1, 2, 1, 1, 1]]]}];, k]
and
translateshpvote = SortBy[(Flatten /@ Transpose[{SortBy[Transpose[{Range[380], areas}], Last], SortBy[explanationShape[[2 ;;]], #[[9]] &]}])[[All, {1, 4}]], First];
The first representation is not very telling. It shows the outcome of the election.
Graphics[Flatten[
Table[{ColorDataFunction["TemperatureMap", "Gradients", {0, 1}, Blend["TemperatureMap", #]& ][(Select[
votingData[[2 ;;]], #[[4]] == translateshpvote[[k, 2]] &][[1]])[[-3]]/100.], localauthorities[[1, 2, 1, 2, 1, 2, 1, 1, 1]][[k]]}, {k, 1, 380}],1], AspectRatio -> GoldenRatio]
![enter image description here][6]
The "blue-r" the colour the higher the percentage of leave votes. Yellow indicates a preference for remain. We will need to clean the data further and will then plot it with a threshold to see which regions decided for Leave and which ones for Remain:
substitute = ({#[[1]][[{1, 4}]] -> #[[2]][[{1, 4}]], #[[2]][[{1,
4}]] -> #[[1]][[{1, 4}]]} &@((Flatten /@
Transpose[{SortBy[Transpose[{Range[380], areas}], Last],
SortBy[explanationShape[[2 ;;]], #[[9]] &]}])[[205 ;; 206]]))
translateshpvoteclean = translateshpvote /. substitute;
Graphics[Flatten[
Table[{ColorDataFunction["TemperatureMap", "Gradients", {0, 1}, Blend["TemperatureMap", #]& ][
If[# > 0.5, 1, 0] & @((Select[Select[votingData[[2 ;;]], #[[1]] < 381 &], #[[4]] == translateshpvoteclean[[k, 2]] &][[1]])[[-3]]/100.)],
localauthorities[[1, 2, 1, 2, 1, 2, 1, 1, 1]][[k]]}, {k, 1, 380}],1], AspectRatio -> GoldenRatio]
![enter image description here][7]
The strong divide between England and Scotland becomes apparent from that figure. We will use the following style of the image in much of the remainder of the post:
Graphics[Flatten[
Table[{ColorData[
"RedGreenSplit"][((Select[
Select[votingData[[2 ;;]], #[[1]] < 381 &], #[[4]] ==
translateshpvoteclean[[k, 2]] &][[1]])[[-3]]/100.)],
localauthorities[[1, 2, 1, 2, 1, 2, 1, 1, 1]][[k]]}, {k, 1, 380}],
1], Background -> Black, AspectRatio -> GoldenRatio]
![enter image description here][8]
What does the Brexit election outcome tell us? What did voters want?
----------
UK wide the percentage of leave votes (with respect to all valid votes) is:
N[Total[votingData[[2 ;;, 13]]]/Total[votingData[[2 ;;, 11]]]]
51.89%. This is in percentage of all valid votes. The percentage of leave with respect to the entire electorate is:
N[Total[votingData[[2 ;;, 13]]]/Total[votingData[[2 ;;, 6]]]]
37.4%. This is the percentage of the entire electorate who voted for remain.
N[Total[votingData[[2 ;;, 12]]]/Total[votingData[[2 ;;, 6]]]]
34.7%. The percentage of non-voters is:
N[(Total[votingData[[2 ;;, 6]]] - Total[Flatten@votingData[[2 ;;, 12 ;; 13]]])/Total[votingData[[2 ;;, 6]]]]
27.8%. This means that we do not know the will of about 28% of all voters. Following the predefined social choice procedure, the vote was narrowly won by leave. That does not necessarily imply that this is "the will of the majority of people".
What would the result be with indirect representation?
----------
If we had a representational vote we would have to count the districts for remain and leave
{Count[#, "Remain"], Count[#, "Leave"]} &@((#1 > #2) & @@@ votingData[[2 ;;, {12, 13}]] //. {False -> "Leave", True -> "Remain"})
{119,263}. This result is even more extreme for leave. This is because the population in the local authorities is very different.
Histogram[votingData[[2 ;;, 6]], PlotRange -> All, PlotTheme -> "Marketing", ImageSize -> Large,
FrameLabel -> {"population", "number of authorities"}, LabelStyle -> Directive[Bold, 15]]
![enter image description here][9]
This shows that there are large fluctuations in the population of the areas. Let's represent that slightly differently. We first collect data about Leave votes, Electorate, Turnout and Electorate.
electionEUresolved =
Transpose@({#[[1]], #[[2]], #[[3]], #[[4]]} &@Transpose@((({#[[13]], #[[11]], #[[9]], #[[6]]} & /@ (Reverse@SortBy[votingData[[2 ;;]], #[[9]] &])))));
BubbleChart[{#[[3]], N@#[[1]]/#[[2]], #[[4]]} & /@ electionEUresolved,
BubbleSizes -> Small,
ChartLabels ->
Placed[(Reverse@SortBy[votingData[[2 ;;]], #[[9]] &])[[
All, {5, 3}]], Tooltip],
Epilog -> {Red, Line[{{50, 0.5}, {100, 0.5}}]},
PlotTheme -> "Marketing", BubbleScale -> "Area", ImageSize -> Large,
FrameLabel -> {"turnout", "percentage leave votes"},
LabelStyle -> Directive[Bold, 15]]
![enter image description here][10]
The bubble sizes (areas) correspond to the size of the electorate. Let's see whether there is a correlation with turnout:
lm = LinearModelFit[{#[[3]], N@#[[1]]/#[[2]]} & /@ electionEUresolved,x, x, Weights -> electionEUresolved[[All, 4]]]
![enter image description here][11]
The following table shows that the positive slope is highly significant.
lm["ParameterTable"]
![enter image description here][12]
We can add that to the figure:
Show[BubbleChart[{#[[3]], N@#[[1]]/#[[2]], #[[4]]} & /@
electionEUresolved, BubbleSizes -> Small,
ChartLabels ->
Placed[(Reverse@SortBy[votingData[[2 ;;]], #[[9]] &])[[
All, {5, 3}]], Tooltip],
Epilog -> {Red, Line[{{50, 0.5}, {100, 0.5}}]},
PlotTheme -> "Marketing", BubbleScale -> "Area", ImageSize -> Large,
FrameLabel -> {"turnout", "percentage leave votes"},
LabelStyle -> Directive[Bold, 15]],
Plot[lm[x], {x, 50, 85}, PlotStyle -> White]]
![enter image description here][13]
Seeing such a linear dependence, raises the natural question to which extent the result is biased by such correlations in the data. Linear Models are a powerful tool to account for various potential influencing factors on the final outcome of the election.
Linear Modelling
----------
Linear Models are a generalisation of linear regression; linear models allow us to model a Gaussian Distributed outcome by various inputs that may or may not be continuous. To demonstrate this we use the simple voting data; the dependent, outcome, variable is the percentage of people who voted to leave the EU.
percentleave = N[votingData[[2 ;;, 13]]/votingData[[2 ;;, 11]]];
collatedData = (Flatten /@
Transpose[{Table[
Select[votingData, explanationShape[[k, 2]] === #[[4]] &], {k,
2, Length[explanationShape[[1 ;;]]]}],
explanationShape[[2 ;;]]}]);
We can then prepare the data for the linear model:
inputLinearModel =
Flatten /@ Transpose[{# - Mean[#] &@ collatedData[[All, 9]], collatedData[[All, 3]], # - Mean[#] &@ (collatedData[[All, 6]]/collatedData[[All, 30]]),
N[collatedData[[All, 13]]/collatedData[[All, 11]]]}];
The steps above generate the table that defines the input into the linear model. All continuous input variables have been centralised such that their mean is zero. This is not needed but simplifies the interpretability of the results. This input variables to the linear model are turnout, i.e., the fraction of people who voted in the given local administrative area, region, i.e., the rough geographical region (non continuous, nominal variable), and a proxy for the population density.
lmodel = LinearModelFit[inputLinearModel, {turnout, region, popdensity}, {turnout, region, popdensity}, NominalVariables -> region];
lmodel["ParameterTable"]
![enter image description here][14]
The model output shows the estimates with respect to the intercept (1) in the table. The missing region is Yorkshire and The Humber such that Mathematica relates all results to this region. Note, that the estimate of 0.548879 means that Yorkshire and The Humber would have voted to leave the EU with 54.89%; this is the model output and does not necessarily match the actual result of the election. Since all continuous variables are mean zero, these 54.89% would be the result for Yorkshire and The Humber under the assumption that the have a typical UK average population density and turnout. The model can obviously be used to approximate the result for Yorkshire and The Humber better by evaluating the function with the correct values for the area. All other influencing factors are estimated with respect to this result. In other words to obtain the result for Scotland, the Scotland result would need to be added to the Yorkshire and The Humber result; so the best model guess for Scotland if otherwise indistinguishable from the while UK would have been -23.93%+54.89%=30.96%. This obviously deviates from the actual result for Scotland, which is likely caused by the other covariates, the turnout and population density in Scotland are different from the average UK value.
Next we want to analyse the same model but with data for Gibraltar and Ni added. We first need the respective areas of the regions:
fixNIGI =
Join[#[[1 ;; -2]],
ConstantArray[Missing[],
8], {#[[-1]]}, {Missing[]}] & /@ (Flatten /@
Transpose[{Select[
votingData, #[[1]] > 380 &], {QuantityMagnitude[
UnitConvert[
Entity["AdministrativeDivision", {"NorthernIreland",
"UnitedKingdom"}]["Area"], Quantity[1, (("Meters")^2)]]],
QuantityMagnitude[
UnitConvert[Entity["Country", "Gibraltar"]["Area"],
Quantity[1, (("Meters")^2)]]]}}]);
Then we can add the data to the table, create the input the linear model and estimate it:
collatedDatacomplete = Join[collatedData, fixNIGI];
inputLinearModelcomplete =
Flatten /@ Transpose[{# - Mean[#] &@ collatedDatacomplete[[All, 9]],
collatedDatacomplete[[All, 3]], # - Mean[#] &@ (collatedDatacomplete[[All, 6]]/collatedDatacomplete[[All, 30]]),
N[collatedDatacomplete[[All, 13]]/collatedDatacomplete[[All, 11]]]}];
lmodelcomplete = LinearModelFit[inputLinearModelcomplete, {turnout, region, popdensity}, {turnout, region, popdensity}, NominalVariables -> region]
lmodelcomplete["ParameterTable"]
![enter image description here][15]
The values match the result above quite well, obviously there is now a result for NI as well. The key thing about linear models is that one can combine the various areas to get a area weighted estimate for the election result. To achieve this we need to obtain the parameters from the table above.
parametermatrix = Delete[lmodelcomplete["ParameterTableEntries"], {{2}, {-1}}]
To achieve a area weighted result, we need to obtain the weights for each region which is defined as the number of entries per area divided by the total number of entries:
weights = N[#/Total[#]] &@RotateRight[SortBy[Tally[inputLinearModelcomplete[[All, 2]]], First]][[All, 2]]
As all parameters are given with respect to one area and present differences to this area, we need to add all estimates to the intercept, see example for Scotland above. This is achieved through linear algebra, by defining a so called contrast matrix.
contrast = IdentityMatrix[Length[weights]];
contrast[[1]] = ConstantArray[1, Length[weights]];
contrast = Transpose[contrast]
Multiplying the matrix to the parameter estimates gives the response functions, the model results for each area.
responsefunctions = contrast.parametermatrix[[All, 1]]
Multiplying this with the weights for each area results in a weighted mean for the UK. It is worth noting that this result is no longer influenced by the turnout and the average population density as these are 0 for the UK, as we subtracted their overall mean in the beginning.
weights.responsefunctions
So the model's referendum result would have been 52.99% in favour of leave, so higher than the actual result. Importantly, Linear Models provide errors for each parameter estimate, such that we can estimate the overall uncertainty of this result. As standard errors are not additive but variances are, we obtain the estimated standard error for this result as follows, assuming independence of the estimates
seresponsefunctions = contrast.(parametermatrix[[All, 2]]^2)
The final standard error is
Sqrt[weights.seresponsefunctions]
0.0255668. This means that best model guess is that 52.99% p/m 2.56% are in favour of leave, using a Gaussian approximation, this implies that there is a
Probability[x < -(weights.responsefunctions - 0.5)/Sqrt[weights.seresponsefunctions], x \[Distributed] NormalDistribution[0, 1]]
12.11% chance that this result was obtained although the majority of the UK electorate does actually not want to leave the EU. We emphasise that by no means are we challenging the result of the referendum, we are just arguing that given model assumptions the results obtained are actually compatible with a population split in which the majority does not want to leave the EU. In many fields such as Science and Medicine, a 12.11% 'risk' that the result is actually a false positive, i.e., the conclusions drawn and actions taken are wrong, is considered too high, 5% or 1% are typical values, in particle Physics the threshold is much lower. So it remains to be investigated if statements that the election results reflect the true opinion of the population. Again, we do not challenge the outcome of the election as there are clear rules to the election that are known in advance such as that it is everybody's free choice to vote or not and how the final result is obtained. There is just a potential non-causal link to people's real opinions, which is often assumed but not necessarily correctly so.
Getting more data...
----------
An interesting thing to check is whether we can learn something from adding additional data on the various voting districts. The data is from [here][16]. We import the file with the data
additionaldata = Import["EU-referendum-results-and-characteristics-data.xlsx"];
We also use the electoral region id data from above:
ids = collatedDatacomplete[[All, 4]];
Then we need to deal with incomplete data/clean the data up:
fillers = ConstantArray[Missing[], #] & /@ Length /@ additionaldata[[2 ;; 11, 4]];
We now can generate a fixed data table:
fulldatatable =
Table[Join[collatedDatacomplete[[k]], Flatten@Table[If[Length[#] > 0, #, fillers[[i - 1]]] &@
Select[additionaldata[[i]], MemberQ[#, ids[[k]]] &], {i, 2, 11}]] /. "" -> Missing[], {k, 1, Length[ids]}];
To better work with the data we then generate a look-up table for the various columns:
datadescriptors =
Rule @@@ (Transpose[{Range[Length[#]], #}] & @
Join[votingData[[1]], explanationShape[[1]],
additionaldata[[2, 9]], additionaldata[[3, 1]],
additionaldata[[4, 8]], additionaldata[[5, 9]],
additionaldata[[6, 8]], additionaldata[[7, 4]],
additionaldata[[8, 8]], additionaldata[[9, 4]],
additionaldata[[10, 3]], additionaldata[[11, 3]]])
![enter image description here][17]
The following shows again a part of the table of all the data we can use now:
![enter image description here][18]
We can then generate some first part of the input for the linear model.
inputLinearModel =
Flatten /@ Transpose[{# - Mean[#] &@ collatedData[[All, 9]], collatedData[[All, 3]], # - Mean[#] &@ (collatedData[[All, 6]]/collatedData[[All, 30]]),
N[collatedData[[All, 13]]/collatedData[[All, 11]]]}];
Again we need to clean the data
fulldatatable[[All, 198]] /. "#" -> Missing[]
subMissing[x_] := If[StringContainsQ[ToString[FullForm[x]], "Missing"], Missing[], x ]
SetAttributes[subMissing, Listable]
This generates the full input for the linear model:
inputLinearModelcomplete =
Flatten /@
Transpose[{fulldatatable[[All, 3]], (# - Mean[#]) &@
fulldatatable[[All,
9]], (# - Mean[#]) &@ (fulldatatable[[All, 6]]/
fulldatatable[[All, 30]]), (# -
Mean[DeleteMissing[#]]) &@ (subMissing /@ (fulldatatable[[
All, 198]]/fulldatatable[[All, 6]])), (# -
Mean[DeleteMissing[#]]) &@ (fulldatatable[[All, 243]] /.
"#" -> Missing[]), (# -
Mean[DeleteMissing[#]]) &@ (subMissing /@ (fulldatatable[[
All, 227]]/fulldatatable[[All, 6]])),
N[fulldatatable[[All, 13]]/fulldatatable[[All, 11]]]}];
Here are a couple of lines of the input data we choose for our linear model:
inputLinearModelcomplete[[1 ;; 20]] // TableForm
![enter image description here][19]
Some more cleaning:
inputLinearModelcomplete = Select[subMissing /@ inputLinearModelcomplete, ! MemberQ[#, Missing[]] &];
Next we fit the linear model:
lmodelfulldata =
LinearModelFit[inputLinearModelcomplete, {region, turnout, popdensity, highqual, earning, nchildren}, {region, turnout, popdensity, highqual,
earning, nchildren}, NominalVariables -> {region}];
lmodelfulldata["ParameterTable"]
![enter image description here][20]
It is easy to see that the income and the number of children in the household significantly influence the preference for/against Brexit. We now use the parameter matrix to compute the weights and contrasts as before:
parametermatrix = lmodelfulldata["ParameterTableEntries"][[1 ;; 12]];
weights = N[#/Total[#]] &@RotateRight[SortBy[Tally[inputLinearModelcomplete[[All, 2]]], First]][[All, 2]];
contrast = IdentityMatrix[Length[weights]];
contrast[[1]] = ConstantArray[1, Length[weights]]; contrast = Transpose[contrast];
With the contrasts we can calculate the response functions:
responsefunctions = contrast[[1 ;; 12, 1 ;; 12]].parametermatrix[[1 ;; 12, 1]]
What is the best estimate for the "will of the voters"?
----------
We an now also use the linear model to estimate what the most likely scenario (based on the model) for the "underlying will of the voter" is given the data from above and the result of the election. We first prepare the data we need, which this time includes the outcome of the election.
inputLinearModelcompletewRcode =
Flatten /@
Transpose[{fulldatatable[[All, 4]],
fulldatatable[[All, 3]], (# - Mean[#]) &@
fulldatatable[[All,
9]], (# - Mean[#]) &@ (fulldatatable[[All, 6]]/
fulldatatable[[All, 30]]), (# -
Mean[DeleteMissing[#]]) &@ (subMissing /@ (fulldatatable[[
All, 198]]/fulldatatable[[All, 6]])), (# -
Mean[DeleteMissing[#]]) &@ (fulldatatable[[All, 243]] /.
"#" -> Missing[]), (# -
Mean[DeleteMissing[#]]) &@ (subMissing /@ (fulldatatable[[
All, 227]]/fulldatatable[[All, 6]])),
N[fulldatatable[[All, 13]]/fulldatatable[[All, 11]]]}];
This is how the data looks:
inputLinearModelcompletewRcode[[1 ;; 20]] // TableForm
![enter image description here][21]
Again we clean the data a bit:
inputLinearModelcompletewRcode = Select[subMissing /@ inputLinearModelcompletewRcode, ! MemberQ[#, Missing[]] &];
Now we can use the linear model to estimate the actual percentage of yes/no votes in the different voting areas:
codepercent = Transpose[{inputLinearModelcompletewRcode[[All, 1]], MapThread[lmodelfulldata, Transpose[inputLinearModelcompletewRcode[[All, 2 ;; -2]]]]}]
First we represent the actual election outcome. Green corresponds to "Remain" and red to "Leave" votes.
Graphics[Flatten[
Table[{ColorData[
"RedGreenSplit"][((Select[
Select[votingData[[2 ;;]], #[[1]] < 381 &], #[[4]] ==
translateshpvoteclean[[k, 2]] &][[1]])[[-3]]/100.)],
localauthorities[[1, 2, 1, 2, 1, 2, 1, 1, 1]][[k]]}, {k, 1, 380}],
1], Background -> Black, AspectRatio -> GoldenRatio]
![enter image description here][22]
We can now calculate the best estimate of the Leave/Remain split based on our model:
polygonscoloured =
Table[{ColorData["RedGreenSplit"][1. - ((Select[codepercent, #[[1]] == translateshpvoteclean[[k, 2]] &][[1]])[[-1]]) /. Missing[] -> 0.],
localauthorities[[1, 2, 1, 2, 1, 2, 1, 1, 1]][[k]]}, {k, 1, Length[translateshpvote]}];
Graphics[polygonscoloured, AspectRatio -> GoldenRatio, Background -> Black]
![enter image description here][23]
The split between England and Scotland has become even more pronounced. Note that we haven't dealt very cleanly with the few missing data cases. Next, we can compare the outcome of the elections with the best estimate of the actual Remain/Leave distribution.
Next, we compare the election outcome to our best model estimate:
GraphicsRow[{
Graphics[
Flatten[Table[{ColorData[
"RedGreenSplit"][((Select[
Select[votingData[[2 ;;]], #[[1]] < 381 &], #[[4]] ==
translateshpvoteclean[[k, 2]] &][[1]])[[-3]]/100.)],
localauthorities[[1, 2, 1, 2, 1, 2, 1, 1, 1]][[k]]}, {k, 1,
380}], 1], Background -> Black, AspectRatio -> GoldenRatio],
Graphics[Table[{ColorData["RedGreenSplit"][
1 - ((Select[
codepercent, #[[1]] ==
translateshpvoteclean[[k, 2]] &][[1]])[[-1]]) /.
Missing[] -> 0.],
localauthorities[[1, 2, 1, 2, 1, 2, 1, 1, 1]][[k]]}, {k, 1,
Length[translateshpvote]}], Background -> Black,
AspectRatio -> GoldenRatio],
Graphics[Flatten[
Table[{ColorData[
"RedGreenSplit"][(0.5 +
3.*((1 - ((Select[
codepercent, #[[1]] ==
translateshpvoteclean[[k, 2]] &][[1]])[[-1]] /.
Missing[] ->
0.)) - ((Select[
Select[votingData[[2 ;;]], #[[1]] <
381 &], #[[4]] ==
translateshpvoteclean[[k, 2]] &][[1]])[[-3]]/
100.)))],
localauthorities[[1, 2, 1, 2, 1, 2, 1, 1, 1]][[k]]}, {k, 1,
380}], 1], Background -> Black, AspectRatio -> GoldenRatio]},
ImageSize -> Full]
![enter image description here][24]
On the left we have the outcome of the elections. In the middle we represent the estimate of the remain/leave ratio based on our linear model. The right figure shows the size of the "correction". The model shows a strong separation between Scotland and most regions in England. The exception in England is the region around London.
Area weighted model
----------
Next we will pool the different regions to better understand regional difference.
codepercent = Transpose[{inputLinearModelcompletewRcode[[All, 1]], MapThread[lmodelfulldata, Transpose[inputLinearModelcompletewRcode[[All, 2 ;; -2]]]]}]
We basically set all the inputs to zero apart from the region data
codepercentregion =
Transpose[{inputLinearModelcompletewRcode[[All, 1]],
MapThread[lmodelfulldata,
Transpose[(Flatten /@
Transpose[{inputLinearModelcompletewRcode[[All, 1 ;; 2]],
ConstantArray[0, {382, 6}]}])[[All, 2 ;; -2]]]]}];
GraphicsRow[{Graphics[
Flatten[Table[{ColorData[
"RedGreenSplit"][((Select[
Select[votingData[[2 ;;]], #[[1]] < 381 &], #[[4]] ==
translateshpvoteclean[[k, 2]] &][[1]])[[-3]]/100.)],
localauthorities[[1, 2, 1, 2, 1, 2, 1, 1, 1]][[k]]}, {k, 1,
380}], 1], Background -> Black, AspectRatio -> GoldenRatio],
Graphics[Table[{ColorData["RedGreenSplit"][
1 - (((If[# == {}, {Select[
votingData, #[[4]] ==
translateshpvoteclean[[k, 2]] &][[1, {4, -2}]]/{1,
100.}}, #] &@
Select[codepercentregion, #[[1]] ==
translateshpvoteclean[[k, 2]] &])[[1]])[[-1]])],
localauthorities[[1, 2, 1, 2, 1, 2, 1, 1, 1]][[k]]}, {k, 1,
Length[translateshpvote]}], Background -> Black,
AspectRatio -> GoldenRatio],
Graphics[Flatten[
Table[{ColorData[
"RedGreenSplit"][(0.5 +
3.*((1 - (((If[# == {}, {Select[
votingData, #[[4]] ==
translateshpvoteclean[[k, 2]] &][[1, {4, -2}]]/{1,
100.}}, #] &@
Select[
codepercentregion, #[[1]] ==
translateshpvoteclean[[k, 2]] &])[[
1]])[[-1]])) - (((If[# == {}, {Select[
votingData, #[[4]] ==
translateshpvoteclean[[k, 2]] &][[1, {4, -2}]]/{1,
100.}}, #] &@
Select[
Select[
votingData[[2 ;;]], #[[1]] < 381 &], #[[4]] ==
translateshpvoteclean[[k, 2]] &])[[1]])[[-3]]/
100.)))],
localauthorities[[1, 2, 1, 2, 1, 2, 1, 1, 1]][[k]]}, {k, 1,
380}], 1], Background -> Black, AspectRatio -> GoldenRatio]},
ImageSize -> Full]
![enter image description here][25]
This provides the smoothed picture as regions are pooled and weighted according to their representation in the UK. As before the leftmost image represents the outcome of the election, the middle one is the best region weighted estimate and the right figure shows the differences. The strong divide between the different regions is enhanced in this representation.
Population weighted representation
----------
The representations above have the critical drawback that equal area sizes do not necessarily correspond to equally sized electorates. We can represent everything such that area corresponds to population size. This will decrease the size of the regions with low population densities such as Scotland. The following file (attached) contains a hexagonal lattice that describes the areas in the UK.
hexagonshape = Import["GB_Hex_Cartogram_LAs.shp"]
![enter image description here][26]
The following file contains a lookup table of regions vs area codes.
hexagonlegend = Import["GB_Hex_Cartogram_LAs.dbf"];
hexagonlegend[[1 ;; 10, {3, 2}]] // TableForm
![enter image description here][27]
We now need to translate between the hexagonal structure and the voting areas from our model.
tranlatehexagonvoteclean =
Table[Select[hexagonlegend, #[[2]] == translateshpvoteclean[[i, 2]] &], {i, 1,Length[translateshpvoteclean]}][[All, 1, 1]];
We can then represent the estimate from our model in an population/area adjusted representation.
Graphics[Table[{ColorData["RedGreenSplit"][1 - (( If[# == {}, {Select[votingData, #[[4]] ==
translateshpvoteclean[[k, 2]] &][[1, {4, -2}]]/{1, 100.}}, #] &@
Select[codepercent, #[[1]] == translateshpvoteclean[[k, 2]] &])[[1]][[-1]]) /. Missing[] -> 0.],
hexagonshape[[1, 2, 1, 2]][[tranlatehexagonvoteclean[[k]]]]}, {k, 1, Length[tranlatehexagonvoteclean]}], Background -> Black]
![enter image description here][28]
Preliminary summary
----------
These calculations show that we might have to perform some sort of estimation procedure to better understand the "will of the people". The method we have used is quite standard in for example medical studies. This is, however, not a suitable way of decision making in societies. As said many times before, the decision is based on an agreed upon procedure. It is not based on a statistical analysis of the data, which as in our case is model dependent.
Bootstrapping
----------
Note, that there are different approaches to try to mathematically guesstimate what the people who did not vote think about Brexit. Alternatively to the method described above we could pool all data from all local authorities and scale up to the entire electorate (unnecessary) and bootstrap on that.
The argument of many politicians would be that the non-voters essentially behave like the voters. In that case we basically scale up the results from the voters to the entire electorate. The bootstrapping in this case would look like this:
Monitor[fullpoolresults =
Table[Count[RandomChoice[Total /@ Transpose[{N@#[[13]]/#[[11]]*#[[6]], #[[6]] - N@#[[13]]/#[[11]]*#[[6]]} & /@ (Reverse@
SortBy[votingData[[2 ;;]], #[[9]] &])] -> {"Leave", "Remain"}, 46500001], "Remain"]/46500001., {i, 1, 100}];, i]
We can fit a smooth kernel distribution to the outcomes of the bootstrapping procedure:
distfullpool = SmoothKernelDistribution[fullpoolresults]
As expected the probability that more than 50% of the electorate would have wanted to Remain is (numerically) zero.
Probability[x > 0.50, x \[Distributed] distfullpool]
(*0*)
So, under this model assumption it is certain that people really wanted Brexit.
The assumption that the electorate is that homogeneous is quite questionable. We might therefore choose to bootstrap with respect to the local authorities. In other words, we assume that it is on the authority level that voters are similar.
Monitor[localauthresults =
Table[N@(Total[#[[All, 13]]/#[[All, 11]]*#[[All, 6]]]/
Total[#[[All, 6]]]) &@
RandomChoice[(Reverse@SortBy[votingData[[2 ;;]], #[[9]] &])[[All,
6]] -> (Reverse@SortBy[votingData[[2 ;;]], #[[9]] &]),
Length[votingData[[2 ;;]]]], {i, 1, 1000}], I];
This bootstraps for the percentage of Remain votes. We can then again fit a distribution:
distlocalauthorities = SmoothKernelDistribution[localauthresults]
and plot the result:
Plot[PDF[distlocalauthorities, x], {x, 0.45, 0.55}, PlotRange -> All,
FrameLabel -> {"percent leave", "probablity density"},
LabelStyle -> Directive[Bold, 18], ImageSize -> Large,
PlotTheme -> "Marketing"]
![enter image description here][29]
This means that the average of the bootstrap results for Remain votes is below 50%.
Mean[distlocalauthorities]
0.494498. This is, however, quite close to 50%. There is a substantial area above the 50% level:
Probability[x > 0.50, x \[Distributed] distlocalauthorities]
0.17122. This suggests that there is a 17% probability that the voter actually wanted to tell us that they want to remain. This is a substantial fraction/probability. Under these assumptions it is by no means clear that the electorate wanted to leave. Once again, this is not how we take democratic decisions. The decision making process is determined by the outcome of the election and not by an estimate of the voters' will. The point is that it is questionable whether politicans can claim that the voters' will is clear.
The last approach is to bootstrap over the local authorities according to the number of ballots cast:
localvotersresults =
Table[N@(Total[#[[All, 13]]]/Total[#[[All, 11]]]) &@RandomChoice[(Reverse@SortBy[votingData[[2 ;;]], #[[9]] &])[[All, 11]] -> (Reverse@SortBy[votingData[[2 ;;]], #[[9]] &]), Length[votingData[[2 ;;]]]], {10000}];
Once more we estimate the distribution:
localvotersdist = SmoothKernelDistribution[localvotersresults]
and plot the resulting probability density function:
Plot[PDF[localvotersdist, x], {x, 0.45, 0.55}, PlotRange -> All,
FrameLabel -> {"percent leave", "probablity density"},
LabelStyle -> Directive[Bold, 18], ImageSize -> Large,
PlotTheme -> "Marketing"]
![enter image description here][30]
Probability[x < 0.50, x \[Distributed] localvotersdist]
0.487381. So under this assumption we would interpret the voters' will as nearly completely undecided. There is a 50%/50% split, i.e. it is practically equally likely that the voters wanted to remain as that voters wanted to leave.
Conclusion
----------
"The will of the voters" is difficult to understand in close elections. It is by no means as clear that the winning alternative of an election represents the "will of the people" when elections are very close. Politicians often like to stress that they represent the will of the people after winning an election. Mathematically speaking, this is not necessarily true. Note, that this is irrelevant for the decision making process. As a society we have decided to abide by fixed electoral rules. Only people who actually vote will make their voices heard. We can mathematically try to figure out what the entire electorate wanted, or to be precise make probability statements about what they wanted. These statements are clearly model dependent.
[1]: http://community.wolfram.com/groups/-/m/t/1221950
[2]: http://community.wolfram.com/web/bschelter
[3]: https://www.electoralcommission.org.uk/find-information-by-subject/elections-and-referendums/past-elections-and-referendums/eu-referendum/electorate-and-count-information
[4]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Screenshot2017-11-1600.47.30.png&userId=48754
[5]: http://esriuktechnicalsupportopendata-techsupportuk.opendata.arcgis.com/datasets/686603e943f948acaa13fb5d2b0f1275_0
[6]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Screenshot2017-11-1600.55.48.png&userId=48754
[7]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Screenshot2017-11-1600.59.21.png&userId=48754
[8]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Screenshot2017-11-1601.01.56.png&userId=48754
[9]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Screenshot2017-11-1601.09.35.png&userId=48754
[10]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Screenshot2017-11-1601.09.59.png&userId=48754
[11]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Screenshot2017-11-1601.12.00.png&userId=48754
[12]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Screenshot2017-11-1601.12.45.png&userId=48754
[13]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Screenshot2017-11-1601.13.40.png&userId=48754
[14]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Screenshot2017-11-1601.17.37.png&userId=48754
[15]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Screenshot2017-11-1601.20.53.png&userId=48754
[16]: https://www.google.co.uk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0ahUKEwihifGz0cHXAhXHKcAKHdtTDU0QFggmMAA&url=http://researchbriefings.files.parliament.uk/documents/CBP-7639/EU-referendum-results-and-characteristics-data.xlsx&usg=AOvVaw3RFJU1dCOJM-tukE0VZGpx
[17]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Screenshot2017-11-1601.28.13.png&userId=48754
[18]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Screenshot2017-11-1601.29.24.png&userId=48754
[19]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Screenshot2017-11-1601.32.00.png&userId=48754
[20]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Screenshot2017-11-1601.33.34.png&userId=48754
[21]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Screenshot2017-11-1601.37.36.png&userId=48754
[22]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Screenshot2017-11-1601.39.19.png&userId=48754
[23]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Screenshot2017-11-1601.40.41.png&userId=48754
[24]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Screenshot2017-11-1601.42.25.png&userId=48754
[25]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Screenshot2017-11-1601.44.22.png&userId=48754
[26]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Screenshot2017-11-1601.47.25.png&userId=48754
[27]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Screenshot2017-11-1601.49.14.png&userId=48754
[28]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Screenshot2017-11-1601.51.29.png&userId=48754
[29]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Screenshot2017-11-1601.55.45.png&userId=48754
[30]: http://community.wolfram.com//c/portal/getImageAttachment?filename=Screenshot2017-11-1601.58.42.png&userId=48754Marco Thiel2017-11-16T02:01:31ZClassifying Japanese characters from the Edo period
http://community.wolfram.com/groups/-/m/t/1221098
Introduction
----------
I recently came across a post for a computer program that in some fields intends to compete with the Wolfram Language and which has a toolbox for Machine Learning. I wanted to compare the work described with the workflow in Mathematica. The challenge is to classify old Japanese Characters from texts from the so-called Edo period:
Style[StringTake[WikipediaData["Edo period"], 601], 16]
![enter image description here][1]
The characters are written in running style and apparently are difficult to read for Japanese speakers today. We will use two approaches of machine learning in the Wolfram Language to tackle this problem.
Historical Background
-----------------------------
Before we proceed, let's have a look at the Tokugawa shogunate. I will use code borrow from [@Vitaliy Kaurov][at0] from his great article on ["Computational history: countries that are gone"][2]. The following lines compute the historic boundaries of the shogunate:
Monitor[tokugawaPOLY =
Table[EntityValue[Entity["HistoricalCountry", "TokugawaShogunate"],
EntityProperty["HistoricalCountry",
"Polygon", {"Date" -> DateObject[{t}]}]], {t, 1600, 1870}];, t]
We then only filter for "changes" in the boundaries:
tokugawaPOLY // Length
tokugawaPOLYcomp =
DeleteMissing[
DeleteDuplicates[Transpose[{Range[1600, 1870], tokugawaPOLY}],
Last[#1] == Last[#2] &], 1, 2];
tokugawaPOLYcomp // Length
In the 271 I consider in the first place there are only 4 changes to the boarders, which we can plot like so:
GeoGraphics[{EdgeForm[Red], GeoStyling[Opacity[.1]], #} & /@
tokugawaPOLYcomp[[All, 2]], GeoProjection -> "Mercator",
ImageSize -> 800, GeoBackground -> GeoStyling["StreetMap"],
GeoRange -> Entity["Country", "Japan"],
GeoRangePadding -> Quantity[800, "Kilometers"], GeoZoomLevel -> 6]
![enter image description here][3]
So the boarders were indeed very stable for a long period reflecting the "no more wars" philosophy of the day.
Arts and culture
----------
The stability led to a very rich art and literature scene. Here are some images that reflect the style of the time:
ImageCollage["Thumbnail" /. Normal[Normal[WebImageSearch["Edo Period"]]], Method -> "Rows"]
![enter image description here][4]
Our dataset
----------
Also literature flourished during this period. Our dataset will be taken from the [Center for Open Data in the Humanities][5]. The keen reader might have noticed that the page is in Japanese. Mathematica can translate the page, and [so can Google][6]. The dataset is called the "Japanese classic registered glyph data set"; 日本古典籍字形データセット.
TextTranslation["日本古典籍字形データセット"]
"Japan classics-type datasets". Following the first link (grey box) on the original Japanese page one gets to t[his page][7], from the bottom of which (look for a download symbol and the text "(TAR+GZ 24.57 MB)" the dataset can be downloaded. This is the google translation of the reference:
> Japanese Classic Statement Character Data Set" (Kokugaku Kenkyu Other
> Collection / CODH Processing)
which in the original is "日本古典籍字形データセット』（国文研ほか所蔵／CODH加工".
The glyphs come from "bamboo letters" from 15 classic texts; the dataset contains more than 20000 glyphs of the 10 most frequent symbols. These letters were annotated by the "National Institute of Letters of Japan" (google). The dataset was also used in a [data/machine learning challenge][8].
After extracting the file to a folder on the desktop I can import the images with:
characterImages = Import["/Users/thiel/Desktop/pmjt_sample_20161116/train_test_file_list.h5", "Data"];
Here is a sample of the glyphs:
Grid[Partition[ImageAdjust[Image[#/256.]] & /@ characterImages[[3, 1 ;; 24]], 6], Frame -> All]
![enter image description here][9]
Similar to the MNIST dataset, they consist of 28 times 28 pixels. They give the grey level as integers, which is why I divide by 256. This shows the pixelation of the glyphs:
Image[characterImages[[3, 12]]/256.] // ImageAdjust
![enter image description here][10]
The dataset conveniently contains a training and a test set. There are
characterImages[[3]] // Length
19909 glyphs in the training set. All of these are annotated:
Flatten[characterImages[[4]]]
![enter image description here][11]
We can convert the training data to images like so:
Monitor[trainingset = Table[ImageAdjust[Image[characterImages[[3, k]]/256.]] -> Flatten[characterImages[[4]]][[k]], {k, 1, 19909}];, k]
The test set contains
characterImages[[1]] // Length
3514 glyphs. The annotation of which are found in
characterImages[[2]] // Flatten // Short
![enter image description here][12]
This is how we prepare the test set data for our machine learning:
Monitor[testset = Table[ImageAdjust[Image[characterImages[[1, k]]/256.]] -> Flatten[characterImages[[2]]][[k]], {k, 1, 3514}];, k]
What are the 10 most frequent symbols?
----------
In one of the annotation files we find the character codes of the symbols so that we can map the classification to actual symbols. These are the glyphs we will consider:
Rasterize /@ (Style[
FromCharacterCode[
ToExpression["16^^" <> StringJoin[Characters[#]]]],
100] & /@ {"3057", "306B", "306E", "3066", "308A", "3092",
"304B", "304F", "304D", "3082"})
![enter image description here][13]
We can also make an image collage of them:
ImageCollage[
Image /@ (Rasterize /@ (Style[
FromCharacterCode[
ToExpression["16^^" <> StringJoin[Characters[#]]]],
100] & /@ {"3057", "306B", "306E", "3066", "308A", "3092",
"304B", "304F", "304D", "3082"})), Method -> "Rows"]
![enter image description here][14]
I though it would be nice if they were in a more calligraphic form. So I tried to used ImageRestyle to try and make them more calligraphic. I think that the attempt was not very successful, but I want to show it anyway.
First I download a couple of calligraphic symbols:
caligraphy = "Thumbnail" /. Normal[Normal[WebImageSearch["japanese calligraphy"]]]
I liked the 5th one, which I will use for the style transfer:
calisymbols =
ImageRestyle[#, caligraphy[[5]],
PerformanceGoal ->
"Quality"] & /@ (Rasterize /@ (Style[
FromCharacterCode[
ToExpression["16^^" <> StringJoin[Characters[#]]]],
100] & /@ {"3057", "306B", "306E", "3066", "308A", "3092",
"304B", "304F", "304D", "3082"}))
![enter image description here][15]
We can also remove the background to get an alternative representation:
![enter image description here][16]
Machine Learning (Classify)
----------
Our first approach to classify the glyphs will be via Classify. The classification is very fast (only a couple of seconds on my machine) and quite accurate:
standardcl = Classify[trainingset, ValidationSet -> testset, PerformanceGoal -> "Quality"]
We can first calculate the ClassifierMeasurements
cmstandard = ClassifierMeasurements[standardcl, testset]
and the accuracy:
cmstandard["Accuracy"]
With 92.5% this is quite accurate and also much faster than the competitor product. The confusion plot shows that it is a very reasonable accuracy - in particular when considering that native speakers have problems identifying the glyphs.
cmstandard["ConfusionMatrixPlot"]
![enter image description here][17]
Here are some glyphs that the Classifier identified as class "0":
cmstandard["Examples" -> {0, 0}][[1 ;; 20]]
![enter image description here][18]
They look quite diverse, but so does the original set:
RandomSample[Select[trainingset, #[[2]] == 0 &], 20]
![enter image description here][19]
It is quite impressive that the Classify function manages to achieve this level of accuracy. We can, however, do better!
Machine Learning (NetTrain)
----------
We can improve the accuracy by training our own custom made network, which in this case will be the standard net that is often used for the MNIST dataset in Mathematica. It turns out that we do not even need a GPU for the training.
lenet = NetChain[{ConvolutionLayer[20, 5], Ramp, PoolingLayer[2, 2],
ConvolutionLayer[50, 5], Ramp, PoolingLayer[2, 2], FlattenLayer[],
500, Ramp, 10, SoftmaxLayer[]},
"Output" -> NetDecoder[{"Class", Range[0, 9]}],
"Input" -> NetEncoder[{"Image", {28, 28}, "Grayscale"}]]
![enter image description here][20]
This we can then train:
lenet = NetTrain[lenet, trainingset, ValidationSet -> testset, MaxTrainingRounds -> 20];
![enter image description here][21]
The training is done in about 4:30 minutes, but it appears that 20 training rounds are not necessary. Here are some results when applying the classifier:
imgs = Keys@RandomSample[testset, 60]; Thread[imgs -> lenet[imgs]]
![enter image description here][22]
Let's see how well it does:
cm = ClassifierMeasurements[lenet, testset]
cm["ConfusionMatrixPlot"]
![enter image description here][23]
This corresponds to an accuracy of
cm["Accuracy"]
95.6%, which is really impressive given the dataset. The following table illustrates the difficulty:
Grid[Table[
Prepend[Select[trainingset, #[[2]] == k &][[1 ;; 10]][[All, 1]],
ImageResize[calisymbols[[k + 1]], 28]], {k, 0, 9}], Frame -> All,
Background -> {{Red}, None}]
![enter image description here][24]
The first (red) column is derived from the glyph based on the character code - using our "calligraphy-style". The remaining columns show items from the training set, i.e. glyphs that were manually classified to belong to the given group.
Conclusion
----------
I do not speak/read Japanese and have no more than Wikipedia knowledge about the Edo period. I still find it quite amazing that using the Wolfram Language's high level Classify function and one of the pre-implemented networks, we can achieve quite a remarkable classification. I am sure that the network can be improved. I would also like to see a more complete dataset with more than 10 different symbols.
[at0]: http://community.wolfram.com/web/vitaliyk
[1]: http://community.wolfram.com//c/portal/getImageAttachment?filename=ScreenShot2017-11-13at23.24.03.png&userId=48754
[2]: http://community.wolfram.com/groups/-/m/t/852277
[3]: http://community.wolfram.com//c/portal/getImageAttachment?filename=ScreenShot2017-11-13at23.30.18.png&userId=48754
[4]: http://community.wolfram.com//c/portal/getImageAttachment?filename=ScreenShot2017-11-13at23.33.24.png&userId=48754
[5]: http://codh.rois.ac.jp/char-shape/
[6]: https://translate.google.com/translate?hl=en&sl=auto&tl=en&u=http://codh.rois.ac.jp/char-shape/
[7]: http://codh.rois.ac.jp/char-shape/book/
[8]: https://translate.google.com/translate?hl=en&sl=ja&tl=en&u=http://codh.rois.ac.jp/seminar/japanese-character-challenge-20170210/&sandbox=1
[9]: http://community.wolfram.com//c/portal/getImageAttachment?filename=ScreenShot2017-11-13at23.56.19.png&userId=48754
[10]: http://community.wolfram.com//c/portal/getImageAttachment?filename=ScreenShot2017-11-13at23.57.57.png&userId=48754
[11]: http://community.wolfram.com//c/portal/getImageAttachment?filename=ScreenShot2017-11-13at23.59.36.png&userId=48754
[12]: http://community.wolfram.com//c/portal/getImageAttachment?filename=ScreenShot2017-11-14at00.01.19.png&userId=48754
[13]: http://community.wolfram.com//c/portal/getImageAttachment?filename=ScreenShot2017-11-14at00.04.22.png&userId=48754
[14]: http://community.wolfram.com//c/portal/getImageAttachment?filename=ScreenShot2017-11-14at00.05.04.png&userId=48754
[15]: http://community.wolfram.com//c/portal/getImageAttachment?filename=ScreenShot2017-11-14at00.07.44.png&userId=48754
[16]: http://community.wolfram.com//c/portal/getImageAttachment?filename=ScreenShot2017-11-14at00.08.53.png&userId=48754
[17]: http://community.wolfram.com//c/portal/getImageAttachment?filename=ScreenShot2017-11-14at00.15.09.png&userId=48754
[18]: http://community.wolfram.com//c/portal/getImageAttachment?filename=ScreenShot2017-11-14at00.17.07.png&userId=48754
[19]: http://community.wolfram.com//c/portal/getImageAttachment?filename=ScreenShot2017-11-14at00.18.25.png&userId=48754
[20]: http://community.wolfram.com//c/portal/getImageAttachment?filename=ScreenShot2017-11-14at00.21.43.png&userId=48754
[21]: http://community.wolfram.com//c/portal/getImageAttachment?filename=ScreenShot2017-11-14at00.22.50.png&userId=48754
[22]: http://community.wolfram.com//c/portal/getImageAttachment?filename=ScreenShot2017-11-14at00.27.23.png&userId=48754
[23]: http://community.wolfram.com//c/portal/getImageAttachment?filename=ScreenShot2017-11-14at00.28.27.png&userId=48754
[24]: http://community.wolfram.com//c/portal/getImageAttachment?filename=ScreenShot2017-11-14at00.40.41.png&userId=48754Marco Thiel2017-11-14T00:49:30Z[✓] Get list of industry segments using FinancialData?
http://community.wolfram.com/groups/-/m/t/1222525
In the helpfile I found two examples to get a list of industry-segments:
Take[FinancialData["DrugManufacturers-Major", "Members"], 20]
FinancialData["Publishing-Books", "Members"]
I would like to get a list of companies from different industrial area's. I tried several names like for example `FinancialData["banks", "Members"]` with the same result: Missing["NotAvailable"].
Is there a list of industrie-segments availableMichiel van Mens2017-11-15T08:17:12Z