We study the predictability of online speech on social media, and whether predictability improves with information outside a user's own posts. Recent work suggests that the predictive information contained in posts written by a user's peers can surpass that of the user's own posts. Motivated by the success of large language models, we empirically test this hypothesis. We define unpredictability as a measure of the model's uncertainty, i.e., its negative log-likelihood on future tokens given context. As the basis of our study, we collect a corpus of 6.25M posts from more than five thousand X (previously Twitter) users and their peers. Across three large language models ranging in size from 1 billion to 70 billion parameters, we find that predicting a user's posts from their peers' posts performs poorly. Moreover, the value of the user's own posts for prediction is consistently higher than that of their peers'. Across the board, we find that the predictability of social media posts remains low, comparable to predicting financial news without context. We extend our investigation with a detailed analysis about the causes of unpredictability and the robustness of our findings. Specifically, we observe that a significant amount of predictive uncertainty comes from hashtags and @-mentions. Moreover, our results replicate if instead of prompting the model with additional context, we finetune on additional context.

通过分析社交媒体上的在线言论的可预测性以及是否受到用户自己帖子以外的信息的提升，本研究利用大型语言模型实证测试了这一假设。结果显示，我们的研究对象包括超过五千个X（以前的Twitter）用户及其同行所发布的6,250,000个帖子，通过三个大小从10亿到700亿参数的大型语言模型，我们发现从用户的同行帖子预测该用户的帖子的性能较差。此外，与同行相比，用户自己的帖子对于预测的价值始终较高。总体而言，社交媒体帖子的可预测性较低，类似于在没有上下文情境的情况下预测财经新闻。我们通过详细分析预测不确定性的原因以及我们的结果的稳健性来扩展我们的研究。特别是我们观察到，两个重要因素导致了预测不确定性，即主题标签和@提及。此外，我们的结果得到了重复验证，即使我们不使用额外的上下文而是对附加上下文进行微调。

使用大型语言模型预测在线言辞的限制