admin管理员组文章数量:1130349
I'm working on a project with a dataset that has quite a lot of missing values—really a lot.
Here's the output of colSums(is.na(dati_train)), showing the number of missing values per column:
> colSums(is.na(dati_train)) # Number of NAs per column
PAID POINT_OF_SALE EVENT_ID YEAR
0 0 0 0
MONTH N_SUBSCRIPTIONS PRICE PHONE_NUMBER
0 0 0 0
PROP_CONBINI PAYMENT_TYPE FAV_GENRE AGE
0 0 967 1723
DAYS_FROM_PROMO BOOKS_PAID N_TRANSACTIONS N_ITEMS
0 5574 5574 0
DATE_LAST_PURCHASE CUSTOMER_SINCE MAIL SUBSCR_CANC
5574 5574 0 0
MARGIN
5574
>
The dataset has around 17,000 observations, so dropping rows with missing values is not an option. Here's my current approach to handle the missing values, and I’d like your feedback:
For "FAV_GENRE" and "AGE": Since the number of missing values is relatively small, I’m considering using multiple imputation to fill them in.
For the other variables: The missing values are systematically distributed among them, so I was thinking of:
- Creating a new binary flag variable to indicate whether the value is missing or not.
- Training logistic regression and LDA models, including these flags as features. I’ve read that this is a common practice, but I’ve never done it before.
Using tree-based models like Random Forest and XGBoost: I know these models can handle missing values, but I’ve never worked with missing data in these algorithms. Are there any best practices I should follow?
Since I also have to make predictions on another dataset with similar missing value patterns, simply removing missing values is not an option. Does my approach make sense? Are there better alternatives in cases like this?
Let me know what you think—thanks in advance!
I'm working on a project with a dataset that has quite a lot of missing values—really a lot.
Here's the output of colSums(is.na(dati_train)), showing the number of missing values per column:
> colSums(is.na(dati_train)) # Number of NAs per column
PAID POINT_OF_SALE EVENT_ID YEAR
0 0 0 0
MONTH N_SUBSCRIPTIONS PRICE PHONE_NUMBER
0 0 0 0
PROP_CONBINI PAYMENT_TYPE FAV_GENRE AGE
0 0 967 1723
DAYS_FROM_PROMO BOOKS_PAID N_TRANSACTIONS N_ITEMS
0 5574 5574 0
DATE_LAST_PURCHASE CUSTOMER_SINCE MAIL SUBSCR_CANC
5574 5574 0 0
MARGIN
5574
>
The dataset has around 17,000 observations, so dropping rows with missing values is not an option. Here's my current approach to handle the missing values, and I’d like your feedback:
For "FAV_GENRE" and "AGE": Since the number of missing values is relatively small, I’m considering using multiple imputation to fill them in.
For the other variables: The missing values are systematically distributed among them, so I was thinking of:
- Creating a new binary flag variable to indicate whether the value is missing or not.
- Training logistic regression and LDA models, including these flags as features. I’ve read that this is a common practice, but I’ve never done it before.
Using tree-based models like Random Forest and XGBoost: I know these models can handle missing values, but I’ve never worked with missing data in these algorithms. Are there any best practices I should follow?
Since I also have to make predictions on another dataset with similar missing value patterns, simply removing missing values is not an option. Does my approach make sense? Are there better alternatives in cases like this?
Let me know what you think—thanks in advance!
Share Improve this question asked Dec 27, 2024 at 10:40 giulio lo verdegiulio lo verde 1 2- 1 First, I believe this should be asked on stackexchange and not stackoverflow as it is not code based. Second, how do you plan on doing LDA on categorical variables? – Onyambu Commented Dec 27, 2024 at 15:38
- It appears you want a conceptual discussion. I strongly advise you to move this discussion to crossvalidated/stack exchange to get answers and keep making just code related questions here. – hamagust Commented Dec 27, 2024 at 20:40
1 Answer
Reset to default 0I would try a multiple imputation model. That's actually not much missing data, and a multiple imputation model should handle it fine. See this writeup for more info.
I'm working on a project with a dataset that has quite a lot of missing values—really a lot.
Here's the output of colSums(is.na(dati_train)), showing the number of missing values per column:
> colSums(is.na(dati_train)) # Number of NAs per column
PAID POINT_OF_SALE EVENT_ID YEAR
0 0 0 0
MONTH N_SUBSCRIPTIONS PRICE PHONE_NUMBER
0 0 0 0
PROP_CONBINI PAYMENT_TYPE FAV_GENRE AGE
0 0 967 1723
DAYS_FROM_PROMO BOOKS_PAID N_TRANSACTIONS N_ITEMS
0 5574 5574 0
DATE_LAST_PURCHASE CUSTOMER_SINCE MAIL SUBSCR_CANC
5574 5574 0 0
MARGIN
5574
>
The dataset has around 17,000 observations, so dropping rows with missing values is not an option. Here's my current approach to handle the missing values, and I’d like your feedback:
For "FAV_GENRE" and "AGE": Since the number of missing values is relatively small, I’m considering using multiple imputation to fill them in.
For the other variables: The missing values are systematically distributed among them, so I was thinking of:
- Creating a new binary flag variable to indicate whether the value is missing or not.
- Training logistic regression and LDA models, including these flags as features. I’ve read that this is a common practice, but I’ve never done it before.
Using tree-based models like Random Forest and XGBoost: I know these models can handle missing values, but I’ve never worked with missing data in these algorithms. Are there any best practices I should follow?
Since I also have to make predictions on another dataset with similar missing value patterns, simply removing missing values is not an option. Does my approach make sense? Are there better alternatives in cases like this?
Let me know what you think—thanks in advance!
I'm working on a project with a dataset that has quite a lot of missing values—really a lot.
Here's the output of colSums(is.na(dati_train)), showing the number of missing values per column:
> colSums(is.na(dati_train)) # Number of NAs per column
PAID POINT_OF_SALE EVENT_ID YEAR
0 0 0 0
MONTH N_SUBSCRIPTIONS PRICE PHONE_NUMBER
0 0 0 0
PROP_CONBINI PAYMENT_TYPE FAV_GENRE AGE
0 0 967 1723
DAYS_FROM_PROMO BOOKS_PAID N_TRANSACTIONS N_ITEMS
0 5574 5574 0
DATE_LAST_PURCHASE CUSTOMER_SINCE MAIL SUBSCR_CANC
5574 5574 0 0
MARGIN
5574
>
The dataset has around 17,000 observations, so dropping rows with missing values is not an option. Here's my current approach to handle the missing values, and I’d like your feedback:
For "FAV_GENRE" and "AGE": Since the number of missing values is relatively small, I’m considering using multiple imputation to fill them in.
For the other variables: The missing values are systematically distributed among them, so I was thinking of:
- Creating a new binary flag variable to indicate whether the value is missing or not.
- Training logistic regression and LDA models, including these flags as features. I’ve read that this is a common practice, but I’ve never done it before.
Using tree-based models like Random Forest and XGBoost: I know these models can handle missing values, but I’ve never worked with missing data in these algorithms. Are there any best practices I should follow?
Since I also have to make predictions on another dataset with similar missing value patterns, simply removing missing values is not an option. Does my approach make sense? Are there better alternatives in cases like this?
Let me know what you think—thanks in advance!
Share Improve this question asked Dec 27, 2024 at 10:40 giulio lo verdegiulio lo verde 1 2- 1 First, I believe this should be asked on stackexchange and not stackoverflow as it is not code based. Second, how do you plan on doing LDA on categorical variables? – Onyambu Commented Dec 27, 2024 at 15:38
- It appears you want a conceptual discussion. I strongly advise you to move this discussion to crossvalidated/stack exchange to get answers and keep making just code related questions here. – hamagust Commented Dec 27, 2024 at 20:40
1 Answer
Reset to default 0I would try a multiple imputation model. That's actually not much missing data, and a multiple imputation model should handle it fine. See this writeup for more info.
本文标签:
版权声明:本文标题:python - Handling Systematic Missing Values in a Dataset for Logistic Regression, LDA, and Tree-Based Models - Stack Overflow 内容由热心网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:https://it.en369.cn/questions/1735944672a1363911.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。


发表评论