50 Million Tweets, 50% Unclassified: A Statistical Anomaly

A dataset of 50 million tweets was processed using an incremental BERTopic model due to memory constraints. The model, which included MiniBatchKMeans clustering, was designed to assign each tweet to a topic. However, upon completion, half of the tweets were left unclassified, with null topics in the SQL database. This is unexpected as MiniBatchKMeans should theoretically assign all data points to clusters, suggesting no outliers. The unclassified tweets showed no apparent differences from those that were classified. This issue highlights a potential limitation or error in the application of the clustering algorithm or in the data processing pipeline.

Source: stackoverflow.com