Message Boards Message Boards

Notebook Assistant : a review

I've been testing the usefulness of the new Notebook Assistant with Mathematica 14.2, and I have some thoughts. I've been a user of Mathematica since version 1.1, and have witnessed personally some of the boom and bust cycles of artificial intelligence. In the 1990s, I developed and implemented a highly successful expert system, which, although is primitive compared to current AI efforts, was state-of-the-art at the time.

All of the demos of the notebook assistant are impressive, and it is remarkable that it works as well as it does. Once I got to the hands-on stage, I realized that this is really a 0.6 release, rather than a solid 1.0 release.

When the system works, it is quite impressive. However, the stochastic nature of the underlying LLM and its tendency to 'hallucinate' (I believe that this is the current descriptive term for "making things up") limit the real-world usefulness of the tool.

That is not to say that it is not useful at all. It can suggest tings that an experienced user may employ. However, usefulness is a relative term. I recall that John Cage used imperfections in manuscript paper as a useful source of ideas. This does not mean that paper imperfections are a real compositional technique -- the technique and artistry were in Cage's mind.

As for the Notebook assistant, its tendency to hallucinate limits its utility in operation. I had expected that the code output by the assistant would have been verified for correct syntax before it was presented to the user. After all, WL checks my syntax, and I get instant feedback by all the red and orange blocks that I had a misplaced bracket or comma. It should be easy to verify syntax, at least, and if it fails have the AI try again without showing me the bad code.

When asked, the assistant can verify code, and it is often successful, but not often enough for practical purposes. Even when errors are pointed out, it often cannot fix problems.

In many cases, problems can be resolved interactively, but it would have been significantly faster for me to simply write the code in the first place without the assistant.

It is my belief that these issues are inherent in the design of the LLMs, and no tweaking of the way the LLM works (pre- or post-processing, for example) will resolve the problem.

There is ample literature on the shortcomings of LLMs, starting with the GIGO issue -- the source material being sexist, racist, and of overall questionable quality it is no surprise that the LLMs's output is sexist, racist, and at best mediocre. The basis of for the language model is an extreme formalist approach. This may make some sense in mathematics, where we can provide (it is hoped) an exact definition for each term and operation, but this is patently untrue for natural language, and I submit, for even constructed languages.

I would suggest the book, Hermeneutics : A very short Introduction, by Jens Zimmermann, or the talk I gave at a recent WTC on the topic of hermeneutics.

One thing I noticed in my explorations: several times, after a failure to suggest working code, I would type in a correct solution in my notebook. Once I did this, the assistant would invariably use this solution, even after I had restarted Mathematica. This indicates to me that it is not regenerating responses using the LLM technique, but is 'remembering' in some fashion my code. If this is read (I think it is) it would make testing the assistant more difficult, since it would appear to work better than it really does.

I can see the LLMs are not the only possible model. It will take some research, but I believe that a language model that embraces metaphor and deep context can be constructed, and perhaps will not require the brute-force methods that current models do. In addition, Wolfram already had a natural language processing engine (Wolfram|Alpha, etc.) that while it does not have the range of the assistant, can possibly be expanded to handle the code-generating aspects of the process.

I am intrigued by the promise of the Notebook Assistant.

I would welcome a way to discover all the functionality in the core Wolfram Language, and especially the repositories. (In early versions, each release had a hard cover book, which I read cover to cover. Current versions have so many functions that Stephen himself has stated that he is not aware of all of them.)

I would welcome a second pair of eyes when I am trying to debug code. The current assistant is simply not competent enough to be relied on.

I would welcome an assistant that would handle the boilerplate of taking my code and preparing it for the Function Repository, for example.

I have seen demos where all of these tasks were successfully performed, so I know that these should be achievable goals. For a practicing WL user, the current iteration is more a proof of concept than a reliable tool.

I really want this idea to work. I have a license for a year, and will be evaluating the assistant from time to time.

Bottom line: the current iteration of Notebook Assistant is a pricy toy rather than a practical tool for most WL users.

5 Replies

George and others. Of course, the so-called hallucinations are problems that must be brought to an absolute minimum. Unlike some people, I have a lot of patience to put up with obvious errors and find a way of getting at the truth. How do I know the difference? I don't always, but fortunately hallucinations don't seem to be based in difficulty! Sometimes, I can ask a competing Ai in a different context. Sometimes ask another for a proof while putting it off as my own naive idea. Sometimes do a little research. Sometimes the hallucination just makes no sense!

As far as "reasoning limitations go," like proving something clearly provable, but I've never seen written out and am too lazy to think of on my own, giving the Notebook assistant hints from my best reasoning as I go and asking for a step or two at a time seems to work well so far.

Here is an example of where we both just had to hold each other's hand to scratch out a proof that was sufficient for me:

I think it would be useful to have user input on what the Notebook Assistant should do, and this may help the software engineers craft a tool that is useful in the real world.

The discussion is not interested in how these goals are attained. I may be skeptical about the general utility of LLMs, but that is really not the point. Wolfram Language and its ecosystem has grown large and complicated enough that even the most experienced users can use help.

Some of the items on this wish list have been suggested by demos of the current iteration, plus some of my own exploration, so I think that they can be realized.

The main issue in any of this is what 'reliable' really means in context. For example, LLMs may produce useable output 90% of the time for general use, but 90% is not sufficient for mission critical tasks, and I count producing code as mission critical.

So here is my first stab at a list:

  1. Discovery The assistant should be able to find useful functions in both the built-in language and function repositories. That is, if I ask the assistant to make a function or answer a question, it will (always) make use of built-in functions or vetted functions in the function repository. Stephen has commented that the language has grown so large that even he cannot remember all the functions.

  2. Validation This wish has several parts. The first is that any code it presents should have been validated before the user sees it. Validation comes at several levels, of course. The lowest level, and one that the code should always meet, is syntax validation. (The code will not create any red or orange boxes, for example.) The next level is that the code should pass some basic unit tests. This, I realize is a non-trivial task, but I think it is doable with the current level of technology, and probably not requiring machine learning of any kind. The code should be checked to see if it actually answers the user's question. I think that this is the intent of the Notebook assistant, and may be difficult to achieve without technology beyond LLMs. Again, there are degrees of compliance, and getting 'close' for suitable definition of 'close' may be good enough. I understand that there will probably be some iteration needed to get a satisfactory solution, so the object is to get close enough so that the amount of back-and-forth is minimized.

  3. Debugging This wish is related to the second one, except there the object is validation of code that the assistant creates, where here the object is to fix or improve the user's code. Clearly, if the assistant can do this task with the user's code, it should be able to do the task with its own code. [In my experience, this is the most frustrating aspect of the current Notebook Assistant: it can't fix simple errors in its own code, let alone more subtle problems with my code. ]

  4. Boilerplate operations Stephen has hinted at this task. Once I have some function working, I should be able to ask the assistant to prepare it for the function repository, for example. There are a whole lot of other tasks of a similar nature.


That's it for now. I would really appreciate comments and additions.

Wolfram Language (indeed any advanced software system) needs this level of technology to deal with increasingly complicated tasks. Beginners, as well as advanced users need help, although of different types. I think that the community can help with the design as well a being beta testers for new systems.

Posted 1 day ago

This is a great review of the Notebook Assistant. It’s impressive but has issues like incorrect code and missing syntax checks. Your point about it ‘remembering’ past inputs is interesting. Does this happen in all notebooks or just the same one? Also, do you think combining AI with Wolfram’s existing tools could make it more reliable?

POSTED BY: Jaymal Raja

I did get some entertainment value trying to get the notebook assistant to do the right thing, but mostly it highlighted the limitations of the LLM model.

It reminds me of the old Eliza program from the late 1960s. I programmed a very slightly enhanced version of the program in BASIC in the 1970s. Surprisingly, some grad students (in physical biochemistry) were convinced of its intelligence. The simulation was convincing, as long as you stayed within a very narrow domain of discourse. LLMs are immensely more complex and have a much wider domain, but there are limits.

Like any model, it is bad, but it can be useful. The Ptolemaic model of the solar system was quite useful if all you wanted was a prediction of where the planets would be within a certain error. It is entirely useless for navigating from earth to Mars.

LLMs seem to be useful, as long as their tendency to hallucinate does not matter. Being able to write legal code seems to be beyond their domain of use, though.

For myself, I have given the people running the Wolfram Issue Tracker something to work on, but I have other things to do. I look at this release as a massive beta test, and I look forward to people with more patience (and spare time) providing the feedback to turn this into a product worthy of the main product.

As I indicated, on philosophical grounds, I do not think that LLMs will ever be free of the tendency to hallucinate. The field at present seems to be dominated by strict positivists, though, so it will take people with a different view of language to come up with a better model.

Posted 1 month ago

From comment - "Bottom line: the current iteration of Notebook Assistant is a pricy toy rather than a practical tool for most WL users."

Totally agree with you - I am a Hobbyist, been using Mathematica for approx. 20 years - I immediately "invested" in the annual subscription Notebook Assistant on release in December and could not get it to give me any meaningful result without syntax errors - very disappointed.

As a MINIMUM requirement - Notebook Assistant SHOULD return code the works and does NOT have syntax errors. I also subscribe to ChatGPT Plus and Gemini advanced to help me with my coding and syntax errors - Both of those were working better than Wolframs Notebook Assistant to correct syntax and produce working code. That said I recently upgraded to Mathematica 14.2 and the Notebook Assistant is working better. So glad other people have been feeling similar frustration...

I expected that an expensive paid additional add-on like Wolfram's Notebook Assistant would have produced perfect syntax error free code - initially I was very disappointed, but I think it maybe better now - let's see...

POSTED BY: Lea Rebanks
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract