This sounds like a really interesting project! I am also new to SystemModeler, so I am really just brainstorming here.
There has been work on camera models in Modelica. You would need to know the physical coordinates of all of the trackers and run them through the camera model for each camera. Also, I would think computing occlusions would be difficult, but if you are familiar with ray-tracing strategies, I suppose that would work.
After you can model how the tracker coordinates would turn into pixel measurements in the CMOS chips, you could use blob detection to extract the finalized xy pixel-space coordinates of the trackers visible to each camera. Otherwise, you could just assume the tracker has a tiny size and project it through the camera equation and round to an integer to get the center pixel value.
Once you have that, you can actually implement triangulation of the locations of each tracker. This step is fuzzy to me, as I am not familiar with mocap enough to know how trackers are typically corresponded between multiple camera views. Perhaps you will need to run some numeric minimization process, similar to what is done in visual odometry algorithms where features need to be corresponded from one image to the next.
As far as cameras go, you could use FLIR cameras (very good frame rates & resolution available, multi-camera synchronization, but very expensive). Whatever you go with, you probably want to use global shutter so you do not have to worry about rolling pixel updates from the CMOS. You could use a micro-controller to control IR-LED rings for the cameras and trigger the shutter at the right time. You could integrate this into the camera model similar to how the authors of the linked paper model the shutter as a physical process.
I would love to know if you have had any more thoughts about this project in the past couple of days!