Interesting post, Arnoud! It was odd to me that the pitch shift changed what word it sounded like he was saying. When I originally listened to the recording, if I focused on the lower frequencies I could hear "laurel" but focusing on the higher frequencies I heard "yanny". I first tried to see if I could get rid of the higher frequency noise and preserve more of the original voice with a low pass filter.
yl = Import["~/Downloads/yanny-laurel.mp4"]
lpyl = LowpassFilter[yl, Quantity[500, "Hertz"]];
Spectrogram[lpyl]

Then (by trial and error) I tried to filter off the high frequencies, and got some faint sounds, amplified it, filtered it one more time for a little extra low pitched noise, and used a low pass filter to get rid of the noise from all the amplification and filtering. I can somewhat hear a muffled whisper of "yanny" and less of laurel, perhaps others more knowledgeable about signal processing could venture further down this path.
hpyl=LowpassFilter[HighpassFilter[AudioAmplify[HighpassFilter[yl,Quantity[8000,"Hertz"]],200],Quantity[8000,"Hertz"]],Quantity[10000,"Hertz"]];
Spectrogram[hpyl]

I wasn't satisfied with the high pass filtered answer since it wasn't distinct enough that is certainly where "yanny" is coming from, but I thought the low pass filtered result made it pretty clear the original speaker was saying "laurel". The most interesting result I found was that doing the same pitch shift adjustments you made on the low pass filtered recording, it only sounds like "laurel" whether sped up or sped down, which was convincing enough for me.
audios = Table[AudioPitchShift[lpyl, r], {r, 0.5, 1.5, .1}];
a1 = AudioJoin[audios];
I attached all the sound clips from above.
Attachments: