Message Boards Message Boards

How to speed up entity queries?

Posted 7 years ago

Mathematica provides access to a huge amount of curated data. But most of this is so slow and so inconvenient to retrieve that it is literally next to useless.

Take this simple query as an example:

t = AbsoluteTime[];
Entity["Plant", "Species:GlycineMax"]["TaxonomyGraph"] // AbsoluteTiming
AbsoluteTime[] - t

This took 3.5 minutes (!!!) on my machine, despite AbsoluteTiming reporting merely 9 seconds. On a second try, after restart, it took 4.5 minutes.

This is a typical problem whenever trying to retrieve curated data. Even the 9 seconds would be much too slow for anything else than a one-time interactive query. Use in a program (loop) is out of the question.

Is there a fix for this kind of problem?

Given the great amount of effort Wolfram put into developing this functionality, why are these basic usability issues not being fixed? I am a bit puzzled because being "knowledge-based" is the main marketing point of the Wolfram Language.

Does anyone on this forum seriously use these functions? If yes, how can you manage the terrible performance?

POSTED BY: Szabolcs Horvát
4 Replies
Posted 2 years ago

This has some time now so I'm not aware if this existed when the post was made. I definitely lack the expertise of the other commentators. However as version 13.0 there is a useful function called

EntityPrefetch[] (*For example EntityPrefetch["Plant"] *)

In[1]:= t = AbsoluteTime[];
EntityPrefetch["Plant"] // AbsoluteTiming
AbsoluteTime[] - t

Out[2]= {4.69674, Success[
 "Prefetch", <|"MessageTemplate" -> "Prefetch successful.", 
   "Values" -> 26194950, "Type" -> "Plant"|>]}

Out[3]= 4.7544438

That mostly bypasses the issue. I made the test and it took no more than 6 minutes to access 4000 entries, correctly collecting data and such. Sorry I didn't document the whole process as once you download some Entity you can't time it again unless you delete the downloaded files, for which I do not know the files locations.

POSTED BY: Updating Name
Posted 7 years ago

Never display entities:

t = AbsoluteTime[];
AbsoluteTiming[
  ent = Entity["Plant", "Species:GlycineMax"]["TaxonomyGraph"]][[1]]
AbsoluteTime[] - t

2.05579

2.057142

AbsoluteTiming[ToBoxes[ent]][[1]]

79.7398

A previous spelunking session showed me that EntityValue makes calls to Internal`MWACompute, which if I remember correctly just calls the Wolfram|Alpha API (you can actually completely spelunk how it makes these calls I believe; haven't figured out how to abuse that yet.)

The display call clearly asks for way to much data, which it stores in $UserBaseDirectory/Knowledgebase. So I think for this plant dataset the first time you evaluate that it downloads a bunch of data, which causes the slowdown you see.

I tried to illustrate that:

retDat =
  AssociationMap[
   With[{
      ent = AbsoluteTiming[RandomEntity[#]], 
      size = AbsoluteTiming[EntityValue[#, "EntityCount"]]},
     <|
      "Size" -> size[[2]],
      "DisplayTime" -> AbsoluteTiming[ToBoxes[ent[[2]]]][[1]],
      "RetrievalTime" -> ent[[1]],
      "SizeRetrievalTime" -> size[[1]]
      |>
     ] &,
   entNames (* A cached version of EntityValue[] *)
   ];

ListPlot[
 KeyValueMap[
  Callout[{#2["Size"], #2["DisplayTime"]}, #] &,
  retDat
  ]
 ]

blb

But this seems pretty random so I don't really know Maybe RandomEntity is messing with things. More likely I'm just wrong about that. The data may also be thrown off if the EntityValue retrieves via a paclet mechanism.

Unfortunately I've got no way to really work around this slow-down, except for never displaying an Entity (which I try to never do).

POSTED BY: b3m2a1 ​ 

Here 8.49393 in the Entity[], absolute time difference 127.83453 (2 min 7), Mathematica 10.4 on this machine, Windows 10 64 Bit Prof Update 1709. If you do the same thing again (0.0114 vs. 10.7457 - Computers are obviously intended to do it again).

But know, keep your socks on, if the notebook is closed, Mathematica exits too, then the Notebook opened again, it does again an 'Initializing Knowledge Base Connection' but returns in 4.68854 vs. 6.31252. Then the question is, what has been returned?

enter image description here

Usually the solution is to localize or cache the data needed, if that is possible. Only from time to time a check whether curators did change something should be done.

POSTED BY: Udo Krause

I'm also curious on what takes what amount of time? Is it setting up the connection? Is it the size (large size)? is it the interpretation? is it the database lookup?

2 minute 50 on my machine btw…

POSTED BY: Sander Huisman
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract