Large-scale vision 2D vision language models, such as CLIP can be aligned with a 3D encoder to learn generalizable (open-vocabulary) 3d vision models. However, current methods require supervised pre-training for such alignment, and the performance of such 3D zero-shot models remains su