Good points from Bianca and Richard. Maybe your questions highlight some paths for experimentation while exposing some possible problems. The main one I think is the choice of programming language, and even the questions like what is a program and are there different ways of thinking.
I am not an expert and hardly familiar with educational research but found some old articles from the 1980s. This one is about the cognitive effects of learning computer programming. They examined Logo, the claims made about it, and concluded that it neither helped learn rigorous thinking (traditional math) nor was there evidence that it helped mathematical explorations (something the computer advocates were claiming). The latter could be the thing to test about Wolfram education.
It is worth noting that "what is a program" has evolved quite a bit. Nowadays we think of all the different ways of interacting, from traditional code to dynamic graphical manipulations to natural language Wolfram Alpha queries, that all of these count as programming. You also have the simple programs like cellular automata. One thing this means is the notion that bugs in a program is not really relevant like it was (or like it is with other programming languages).
Maybe it is naïve to say this, but I would claim that rigorous thinking is not the exclusive domain of mathematics. I am not sure, but it seems to me that different programming languages promote different types of thinking. (Anyone disagree?) Maybe the right place thing to test is for someone exposed to Wolfram Education in its various guises, whether they can independently explore a problem. This is the main claim (I think it is) that learning this language will let you explore solutions to problems you could not solve before.
If the subjects did not know how they were being tested, one measure could be the volume of material they produced, and another measure could be originality (high scores for distinct results). Of course, it would be easy to cheat those measures, but a bigger problem is that there could not be a control group who does not know how Wolfram Language and hence could not produce any material (except on paper). Likewise, it would not be a fair test to give a math problem to students who can use Wolfram Language and those who cannot. It would be like a road race where one guy gets a car.
It's hard to find a fair comparison because learning to program in a versatile thing like Wolfram Language impacts so many other potential activities. (music?)