Thanks @Eduardo Serna, I chose this network because it is currently one of the most suitable audio feature extractor in the Wolfram Neural Net Repository. The layers that were added are typical layers added to perform classification for sequences and I took inspiration from a classification example in BERT NN repo page for sentiment classification (see section: "Train a classifier model with the subword embeddings"), because it was text and hence a sequence too (audio is a sequence).
Concerning the mushroom classifier idea it would be cool to use Geographic Data and WeatherData to create some more advance classifier with priors. See below a toy example as a start for such mushroom classifier.
Recently, I learnt about an ambitious project trying to identify new species of mushrooms via their spores, which will be trapped automatically from air. It seems that we still don't know much about fungi, mycologists have estimated that only ~10 percent of the predicted number of fungi—which includes mushrooms and yeasts—has been described.