Message Boards Message Boards


Retrieve a property for *all* stars with StarData?

Posted 2 years ago
6 Replies
2 Total Likes

We can easily get properties of individual stars with StarData, i.e. StarData[star, property]. But this is not very useful. I might as well look up the star in Wikipedia or elsewhere.

Such a database becomes useful not as an encylopedia, but as a comprehensive dataset which we can use to calculate various statistics and relationships between properties. For example, plot a Hertzsprung–Russel diagram.

How can I retrieve a property of all stars in the database?

This seems to work in principle, but in practice it takes such a long time that it is plainly unusable. It would probably take more than an hour to finish.

magn = StarData[EntityClass["Star", All], "AbsoluteMagnitude"];

In that time I could google up several easier to use databases, download the data, figure out how to import it to Mathematica, etc. And that's just one property.

Is there a better way then? Is there something wrong with the syntax I am using that causes this to be so slow? If not, then what is the point of all these *Data functions, given that several of them are practically unusable unless we're satisfied with looking at items (stars) one by one? Is there anyone on this forum who is able to make real practical use of this function? If yes, how?

StarData contains about 100,000 stars. That's not a lot. 100,000 floating point number take less than a megabyte of storage and arithmetic with such arrays typically takes less than a millisecond.

6 Replies

Hi Szabolcs,

If I do:

magn = StarData[All, "AbsoluteMagnitude"];

I get something like this:

enter image description here

Which finishes in a minute or so (steps of 2500). But then it goes again downloading some more stuff but in steps of only 64 which takes indeed forever. Are you seeing the same? I agree, this should be within a second. But all the *Data functions in Mathematica have been notoriously slow since their introduction (V6?).

I agree, steps of 2500 (not to mention 64) seems ridiculously small for today's standards.

I honestly don't know what takes so long, is it the connection method, the protocol, the server, the databases?

Yes, that is what happens. I think that first it downloads all the stars, then it downloads the property for each.

Not all *Data functions are so slow. The new, Entity-based ones are much worse than the older ones.

It turns out that AstronomicalData is actually usable here

magn = AstronomicalData["Star", "AbsoluteMagnitude"];

This first downloads a big part (or all?) of the AstronomicalData dataset, which will be kept on the hard drive. The downloads completes in a couple of minutes. Once that is done, it's still not exactly fast, but it is unquestionably usable. It evaluates in about 22 seconds on my computer.

Compare that to StarData, which will re-download for every session and is plainly unusable.

There was another like this one that found different behavior for StarData and AstronomicalData:

The only result there was to use parallelisation; because each kernel than creates a connection and downloads data...

I like the EntityValue method, since you can have it return a nested Association

EntityValue[StarData[], {"EffectiveTemperature", "AbsoluteMagnitude"}, "EntityPropertyAssociation"];

But it's running unbelievably slow for me today, the same timing as the StarData[.....] method.

You can have that forms returned also with StarData directly, and the syntax is nice. Clearly better than AstronomicalData. It's the performance alone that ruins this.

Posted 2 years ago

StarData has indeed much worse performance than AstronomicalData and a few more hickups.

  1. Retrieving all entities, e.g., StarData[] takes a few minutes. However allNames=StarData["EntityCanonicalNames"] downloads almost immediately with a list of all star names that can be used to query again (this idiom needs to be found first, while the StarData[] is more obvious). There appear to be no integrity checks. Sometimes the list is corrupt, so it needs to be checked first.
  2. Combined downloads, e.g., y = StarData[x, {"AlphanumericName", "DistanceFromSun"}] take forever, e.g., 6h+ for absolutely no reason, not even throttling should be that slow. This was doable ok with AstronomicalData. In this slow mode there also occurred spurious retrieval failures, e.g., { {"HIP109905", Missing["RetrievalFailure"]}}, where the object should clearly be available. Distances also have mixed units, e.g., kpc for the failed case above, instead of the default ly, requiring further manual cleanup.
  3. StarData[allNames,"singleProperty"] appears to be running a bit faster, stepping first in units of 64 instead of 1, but sometimes throttling in steps down to 1. So if I want to sort for DistanceFromSun and retrieve data for a subset of e.g., 1000 stars, the fastest is to download the entity list of canonical names, then all the distances, then Join the two lists, takes ~50min, best case, possibly several hours. This is still much faster than the 6h+ with combined properties. In any case all ridiculously slow.
  4. The entity queries appear to be somewhat unwieldy. Retrieval of the entity name list in no time shows that there is probably an issue with retrieving entities and the inferences necessary to retrieve more than one property.
  5. In addition sometimes the kernel or all of MMA ( dies, requiring several hours of download again..., so careful and cache your data locally once retrieved (Somehow AstronomicalData appeared to do more local caching. Could someone please confirm?). The old data caching infrastructure for curated data appears unfortunately gone, it may be a good idea to reestablish it. Even caching the whole database locally maybe an effective workaround for now.

In the current form StarData does not appear viable for work, only for single toy queries in Wolfram Alpha. The data amounts are small (e.g., 6 x 1000 Entities/values) and my notebook is under 1MB.

As a classic professional MMA user I would much prefer that those data issues are fixed than to have more tablet functionality and phone apps implemented. Can I have a choice using entities or just numerical values? Do I really need those units everywhere, buying one more inference layer?

(Remark: slowness is unlikely due to my machine or internet connection. The machine is a hexacore E5 workstation with 64GB ECC memory and Windows 8.1 providing a very robust high performance environment.)

Further input and hints for optimization are of course appreciated!

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
or Discard

Group Abstract Group Abstract