Protein structure prediction is an important method for understanding gene translation and protein function in the domain of structural biology. AlphaFold introduced the Transformer model to the field of protein structure prediction with atomic accuracy. However, training and inference of the AlphaFold model are time-consuming and expensive because of the special performance characteristics and huge memory consumption. In this paper, we propose FastFold, a highly efficient implementation of protein structure prediction model for training and inference. FastFold includes a series of GPU optimizations based on a thorough analysis of AlphaFold's performance. Meanwhile, with \textit{Dynamic Axial Parallelism} and \textit{Duality Async Operation}, FastFold achieves high model parallelism scaling efficiency, surpassing existing popular model parallelism techniques. Experimental results show that FastFold reduces overall training time from 11 days to 67 hours and achieves $7.5\sim9.5\times$ speedup for long-sequence inference. Furthermore, We scaled FastFold to 512 GPUs and achieved an aggregate of 6.02 PetaFLOPs with 90.1\% parallel efficiency. The implementation can be found at https://github.com/hpcaitech/FastFold.

本文提出了FastFold，它是AlphaFold模型的有效实现，使用Dynamic Axial Parallelism和Duality Async Operations提高模型并行性的扩展效率，还提出AutoChunk来自动确定块策略以减少推理期间的内存成本，实验结果表明，FastFold将总的训练时间从11天缩短至67小时，在长序列推理中达到了7.5-9.5倍的加速，此外，我们将FastFold扩展到了512个GPU，实现了6.02 PetaFLOP/s的总吞吐量和90.1%的并行效率。

FastFold: 将AlphaFold的训练时间从11天缩短至67小时