Generating detailed and accurate descriptions for specific regions in images and videos remains a fundamental challenge for Vision-Language Models. We introduce the Describe Anything Model (DAM), a model designed for detailed localized captioning (DLC). DAM preserves both local details