In a project focused on predicting house prices in Bangalore, a data scientist encountered a significant challenge with the location data. The dataset, used for building a predictive model, includes variables such as location, area type, society, and total square footage. The location column, in particular, presented a difficulty due to its high cardinality, containing approximately 1000 unique values. An initial approach involved selecting the top-N values, a method suggested by Deepseek. However, this technique was found to exclude a substantial amount of data, potentially missing out on valuable insights. The data scientist is now seeking alternative methods to effectively handle this high-cardinality dimension, eager to learn from the broader data science community’s experiences.
Source: stackoverflow.com















