Recent computational models of the acquisition of spoken language via
grounding in perception exploit associations between the spoken and visual
modalities and learn to represent speech and visual data in a joint vector
space. A major unresolved issue from the point of ecological valid