A machine learning novice is tackling fraud detection with a highly imbalanced dataset where 96% of transactions are normal and only 4% are fraudulent. The training file, a hefty 32 GB, causes memory allocation errors even when attempting to read just 1 million rows. Despite reducing the dataset to 100,000 rows, the XGBoost model struggles to detect fraud cases effectively. The user has tried various XGBoost parameters but seeks advice on class balancing techniques and handling large datasets. Suggestions include using techniques like SMOTE for balancing, chunking data for memory management, and exploring other algorithms or ensemble methods to improve fraud detection accuracy.
Source: stackoverflow.com
