Message Boards Message Boards

Avoid Interpreter ZIP Code issues? // Unreliable access to knowledge base


Hi, Wolfram Community.

I've been trying to run a line like this:

zips = Interpreter["ZIPCode"] /@ {"95014", "01545", "94087", "95129", "01810", "10471", "02067", "01720", ... }

Over a line with some 1200 zipcodes. The goal with this is having each of the elements in the list recognized as the entity ZIP code, and then assign those zip codes a scalar. (See here). .

Every time I run the line I get a different result, i.e. it seems like depending on whether the knowledge base is available I get some of the codes turned into entities and sometimes I don't.

Here are a few screenshots of my output. At first everything looks fine:

enter image description here

Then it comes trouble, in different flavors every time:

enter image description here

enter image description here

Does anyone know what is going on here, and most importantly, how can I have this computation done right in a reliable way? Any ideas are welcome.

POSTED BY: Jorge Mahecha
8 months ago

Dear Jorge,

that does sometimes happen when one request lots of data from the servers and the internet connection is a bit flaky or the server very busy. Here are some remarks:

  1. You apply the Interpreter function one by one:

    zips = Interpreter["ZIPCode"] /@ {"95014", "01545", "94087", "95129", "01810", "10471", "02067", "01720"}

    Instead you might want to try and run them "in one go".

     zips = Interpreter["ZIPCode"] [ {"95014", "01545", "94087", "95129", "01810", "10471", "02067", "01720"}]

    That is much more efficient and saves time. You should get better results, but it does not necessarily resolve your problem.

  2. This is not a really good solution, but you can try to iterate the procedure. For example you can run it once like so:

    zip1 = Transpose[{#, Interpreter["ZIPCode"][#] } & @(ToString /@ Range[85001, 85055])]

    where I use the Range command to generate a list of zip codes. This command leads to a result like this:

enter image description here

Now we can iterate the procedure until there is no change:

iterativeList=NestWhileList[(Transpose[{#, Interpreter["ZIPCode"][#] } & @Select[#, (Head[#[[2]]] === Failure) &][[All, 1]]]) &, zip1, Unequal, All]

The idea is to select those that have not been interpreted correctly and do so until there is no change. The result would be this:

Join[Select[zip1, Head[#[[2]]] === Entity &], DeleteDuplicatesBy[Reverse@Flatten[iterativeList, 1], #[[1]] &]]

In my case that still contains some Failures, but we can eliminate them like this:

DeleteDulpicates[Select[Join[Select[zip1, Head[#[[2]]] === Entity &], DeleteDuplicatesBy[Reverse@Flatten[iterativeList, 1], #[[1]] &]], Head[#[[2]]] === Entity &]]

It is not quite ideal, but you get slightly better results.

  1. You could also try this:

    Entity["ZIPCode", #]&/@ (ToString /@ Range[85001, 85055])

That is much much faster, at least in my case. Perhaps you can try it and let me know how it works.



PS: I assume that the list of zip codes is nothing secret. Could you post it, i.e. attach it as a csv file or so?

POSTED BY: Marco Thiel
8 months ago

Dear Marco, Thank you very much for your valuable insights.

There are several things I'd like to comment. First, long story short, I was lucky enough to run the thing and getting no errors at all, once. Only once in the 10 or more times I've tried. I immediately saved the results, of course. If any of the higher powers is reading this too, please know that this in undoubtedly a 100% on the Wolfram servers, I'm sorry to say. I hope this issue could be improved.

I compared the one by one and the all at once options you mentioned, and in fact the all at once version ran in roughly a third of the time the one by one did:

zips1 = Timing[Interpreter["ZIPCode"] /@ {"95014", "01545", ... } gives 294.467 seconds (and a ton of errors) zips2 = Timing[Interpreter["ZIPCode"] [{"95014", "01545",...}] gives 102.366 seconds and a fair share of errors too.

After cleaning up the data I came with a list of 1177 zip codes and students (which I'm uploading for the sake of the exercise; it's not secret, indeed). After I got the zip codes list with no errors I associated it with the number of students per zip code:

zipst = Transpose[{zips, students}]

And then did this:


And I obtained the following: enter image description here

Which is great in principle, but introduces a different set of challenges, like zooming-in relevant regions to make colors actually visible. I'm playing now with different options for GeoRegionValuePlot to improve this visualization.

The data for the this problem is attached (ZipSt.xls).

POSTED BY: Jorge Mahecha
8 months ago

Hi Jorge,

thank you for posting the data. It is easier to understand what we are talking about. I am not really very familiar with this type of thing, but here is something that might be useful:

zipstudents = Import["~/Desktop/ZipSt.xls"][[1, 2 ;;]];
zipentities = Entity["ZIPCode", ToString[#]] & /@ zipstudents[[All, 1]];
entityzipstudents = Transpose[Join[{zipentities}, Transpose[zipstudents]]];

puts the data in a useful format. It is faster than the Interpreter approach and does a reasonable job.

Then you can use the new DynamicGeoGraphics:

DynamicGeoGraphics[Flatten[{EdgeForm[Black], FaceForm[ColorData["TemperatureMap"][Log[#[[2]]]/Log[417.]]], 
Polygon[#[[1]]]} & /@ Select[entityzipstudents[[All, {1, -1}]], Head[#[[1]]["Polygon"]] =!= Missing &]]]

You should obtain a dynamical interface. It is a bit sluggish, but works:

enter image description here

You can move the centre of the image with the mouse and use the +/- at the lower right corner to zoom in or out. It is more responsive if you first zoom in a bit and then move the centre.



PS: The colour-scaling is of course a matter of taste.

POSTED BY: Marco Thiel
8 months ago

It also appears that there were lots of attendees from around the Boston area:

enter image description here

Given that you are from Boston College

StringTake[WikipediaData["Boston College"], 1985]

enter image description here

you will probably be interested in that area. You can also calculate the distance between the different zip code areas and the Boston College:

Quiet[distances = 
  GeoDistance[Entity["University", "BostonCollege::m4rnc"], #] & /@  cleanentities[[All, 1]]]

Here is a histogram of these distances:

Histogram[Cases[distances, _Quantity], 200, ImageSize -> Large, 
 PlotTheme -> "Marketing", LabelStyle -> Directive[Bold, Medium]]

enter image description here

Of course this is not really a fair histogram, because there are different numbers of participants, so we have to include that:

weighteddistances = 
Cases[Flatten[ConstantArray[#[[1]], #[[2]]] & /@ Transpose[{distances, Floor /@ cleanentities[[All, -1]]}]], _Quantity];

and then

Histogram[weighteddistances, 200, ImageSize -> Large, PlotTheme -> "Marketing", LabelStyle -> Directive[Bold, Medium]]

enter image description here

The average travel distance is (ignoring the zip codes we could not identify):

Mean@QuantityMagnitude@UnitConvert[weighteddistances, Quantity[1, "Miles"]]

1026.43 miles.



POSTED BY: Marco Thiel
8 months ago

Thank you, Marco.

I just want to add a couple things to your amazing contributions. First, in regard to the issue of actually getting the data in the form that is required, I got this great suggestion from the people at tech support. It involves defining an object called data:

data = {"95014", "01545", "94087", "95129", "01810",...

And then turn it into entities:

ziplist = Map[Entity["ZIPCode", #] &, data]

Provided that you have a curated list where every single entry is an actual zip code, this procedure seems to work reliably. Working in this way I was able to get a map of an interesting region that you identified as well (MA) by doing this:

GeoRegionValuePlot[zipst, GeoRange -> {{41.5, 43.}, {-72., -70.}}, 
 GeoLabels -> (Tooltip[#1, ZIPCodeData[#2, "Cities"]] &)]

This results in a nice map with tooltips that looks like this:

enter image description here

If you export to a html file, the tooltips show up. Many thanks for taking the time to discuss these issues.

POSTED BY: Jorge Mahecha
8 months ago

Group Abstract Group Abstract