Generating long-range skeleton-based human actions has been a challenging
problem since small deviations of one frame can cause a malformed action
sequence. Most existing methods borrow ideas from video generation, which
naively treat skeleton nodes/joints as pixels of images without c