Head pose estimation (HPE) task requires a sophisticated understanding of 3D
spatial relationships and precise numerical output of yaw, pitch, and roll
Euler angles. Previous HPE studies are mainly based on Non-large language
models (Non-LLMs), which rely on close-up human heads cropped from the full
image as inputs and lack robustness in real-world scenario. In this paper, we
present a novel framework to enhance the HPE prediction task by leveraging the
visual grounding capability of CogVLM. CogVLM is a vision language model (VLM)
with grounding capability of predicting object bounding boxes (BBoxes), which
enables HPE training and prediction using full image information input. To
integrate the HPE task into the VLM, we first cop with the catastrophic
forgetting problem in large language models (LLMs) by investigating the
rehearsal ratio in the data rehearsal method. Then, we propose and validate a
LoRA layer-based model merging method, which keeps the integrity of parameters,
to enhance the HPE performance in the framework. The results show our
HPE-CogVLM achieves a 31.5\% reduction in Mean Absolute Error for HPE
prediction over the current Non-LLM based state-of-the-art in cross-dataset
evaluation. Furthermore, we compare our LoRA layer-based model merging method
with LoRA fine-tuning only and other merging methods in CogVLM. The results
demonstrate our framework outperforms them in all HPE metrics.

本研究使用 CogVLM 的视觉定位能力，提出了一种新的框架来增强头部姿态估计任务，通过改进大语言模型中的灾难遗忘问题和引入 LoRA 层模型合并方法，有效提高头部姿态估计性能，并且在多个指标上优于现有方法。

HPE-CogVLM：基于视觉语言模型的新头部姿势定位任务探索

HPE-CogVLM: New Head Pose Grounding Task Exploration on Vision Language  Model

Spatial relationships between objects provide important information for
text-based image retrieval. As users are more likely to describe a scene from a
real world perspective, using 3D spatial relationships rather than 2D
relationships that assume a particular viewing direction, one of the main
challenges is to infer the 3D structure that bridges images with users' text
descriptions. However, direct inference of 3D structure from images requires
learning from large scale annotated data. Since interactions between objects
can be reduced to a limited set of atomic spatial relations in 3D, we study the
possibility of inferring 3D structure from a text description rather than an
image, applying physical relation models to synthesize holistic 3D abstract
object layouts satisfying the spatial constraints present in a textual
description. We present a generic framework for retrieving images from a
textual description of a scene by matching images with these generated abstract
object layouts. Images are ranked by matching object detection outputs
(bounding boxes) to 2D layout candidates (also represented by bounding boxes)
which are obtained by projecting the 3D scenes with sampled camera directions.
We validate our approach using public indoor scene datasets and show that our
method outperforms baselines built upon object occurrence histograms and
learned 2D pairwise relations.

使用物理关系模型，通过将虚构的抽象物体布局与文本描述中存在的空间约束相匹配来从文本描述中推断 3D 结构，并通过将对象检测输出与表示为边界框的 2D 布局候选项进行匹配来评定图像排序，从而检索与场景的文本描述相匹配的图像，其性能优于基于对象出现直方图和学习的 2D 成对关系的基线方法。