The environment in which we want to train the agent is a control system where the the current state of the environment is identified by a classifier which quantizes a continuous range of environment states into a smaller set of good and bad states. So our problem is similar to the common pole balancing example with some quantization of the environment states using a CNN.
Your comment on using SystemModeler for building the complex environment but not to train the agents is exactly what I was interested in understanding. Based on this comment it appears that we may be able to stay within the Mathematica framework for our implementation if we can adequately represent our environment with a CNN. Do you agree?
Thanks