When given a dataset and a challenge to outperform a friend’s model, statisticians often follow a rigorous strategy. The dataset usually contains less than 100,000 rows and 100 columns, fitting comfortably into memory. The task is typically classification, regression, or time series forecasting with minimal data cleaning required. Here’s the approach:– **Establish a Test Harness:** If there’s a holdout test set, a train/test split sensitivity analysis is conducted to find a ratio that maintains data and performance distributions. Without a holdout set, 3×10-fold cross-validation is used to evaluate the existing model.– **Baseline and Spot Checking:** Start with dummy models to establish a baseline. Then, run a suite of scikit-learn models with default configurations, followed by standard configurations. Advanced models from libraries like xgboost or lightgbm are also tested.– **Hyperparameter Tuning:** Focus on top-performing models, using grid search or Bayesian optimization for tuning. Background processes run continuous optimization.– **Pipeline Optimization:** Experiment with data preprocessing and feature engineering to find unique transformations that enhance model performance.– **Ensemble Methods:** Combine the best models through stacking, voting, or averaging, with scheduled runs every 30 minutes to explore diverse model combinations.– **Iterate:** Continuously refine models, with background tasks always optimizing, while foreground tasks explore innovative ideas. Results and configurations are stored in an SQLite database for easy access and analysis.
Source: www.reddit.com









