python - Handling Systematic Missing Values in a Dataset for Logistic Regression, LDA, and Tree-Based Models

admin管理员组
文章数量:1130349

I'm working on a project with a dataset that has quite a lot of missing values—really a lot.

Here's the output of colSums(is.na(dati_train)), showing the number of missing values per column:

> colSums(is.na(dati_train))   # Number of NAs per column
              PAID      POINT_OF_SALE           EVENT_ID               YEAR 
                 0                  0                  0                  0
             MONTH    N_SUBSCRIPTIONS              PRICE       PHONE_NUMBER
                 0                  0                  0                  0
      PROP_CONBINI       PAYMENT_TYPE          FAV_GENRE                AGE
                 0                  0                967               1723
   DAYS_FROM_PROMO         BOOKS_PAID     N_TRANSACTIONS            N_ITEMS
                 0               5574               5574                  0
DATE_LAST_PURCHASE     CUSTOMER_SINCE               MAIL        SUBSCR_CANC
              5574               5574                  0                  0
            MARGIN
              5574
>

The dataset has around 17,000 observations, so dropping rows with missing values is not an option. Here's my current approach to handle the missing values, and I’d like your feedback:

For "FAV_GENRE" and "AGE": Since the number of missing values is relatively small, I’m considering using multiple imputation to fill them in.
For the other variables: The missing values are systematically distributed among them, so I was thinking of:
- Creating a new binary flag variable to indicate whether the value is missing or not.
- Training logistic regression and LDA models, including these flags as features. I’ve read that this is a common practice, but I’ve never done it before.
Using tree-based models like Random Forest and XGBoost: I know these models can handle missing values, but I’ve never worked with missing data in these algorithms. Are there any best practices I should follow?

Since I also have to make predictions on another dataset with similar missing value patterns, simply removing missing values is not an option. Does my approach make sense? Are there better alternatives in cases like this?

Let me know what you think—thanks in advance!

I'm working on a project with a dataset that has quite a lot of missing values—really a lot.

Here's the output of colSums(is.na(dati_train)), showing the number of missing values per column:

> colSums(is.na(dati_train))   # Number of NAs per column
              PAID      POINT_OF_SALE           EVENT_ID               YEAR 
                 0                  0                  0                  0
             MONTH    N_SUBSCRIPTIONS              PRICE       PHONE_NUMBER
                 0                  0                  0                  0
      PROP_CONBINI       PAYMENT_TYPE          FAV_GENRE                AGE
                 0                  0                967               1723
   DAYS_FROM_PROMO         BOOKS_PAID     N_TRANSACTIONS            N_ITEMS
                 0               5574               5574                  0
DATE_LAST_PURCHASE     CUSTOMER_SINCE               MAIL        SUBSCR_CANC
              5574               5574                  0                  0
            MARGIN
              5574
>

The dataset has around 17,000 observations, so dropping rows with missing values is not an option. Here's my current approach to handle the missing values, and I’d like your feedback:

For "FAV_GENRE" and "AGE": Since the number of missing values is relatively small, I’m considering using multiple imputation to fill them in.
For the other variables: The missing values are systematically distributed among them, so I was thinking of:
- Creating a new binary flag variable to indicate whether the value is missing or not.
- Training logistic regression and LDA models, including these flags as features. I’ve read that this is a common practice, but I’ve never done it before.
Using tree-based models like Random Forest and XGBoost: I know these models can handle missing values, but I’ve never worked with missing data in these algorithms. Are there any best practices I should follow?

Let me know what you think—thanks in advance!

Share Improve this question asked Dec 27, 2024 at 10:40 giulio lo verde 1

1 First, I believe this should be asked on stackexchange and not stackoverflow as it is not code based. Second, how do you plan on doing LDA on categorical variables? – Onyambu Commented Dec 27, 2024 at 15:38
It appears you want a conceptual discussion. I strongly advise you to move this discussion to crossvalidated/stack exchange to get answers and keep making just code related questions here. – hamagust Commented Dec 27, 2024 at 20:40

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

I would try a multiple imputation model. That's actually not much missing data, and a multiple imputation model should handle it fine. See this writeup for more info.

I'm working on a project with a dataset that has quite a lot of missing values—really a lot.

Here's the output of colSums(is.na(dati_train)), showing the number of missing values per column:

> colSums(is.na(dati_train))   # Number of NAs per column
              PAID      POINT_OF_SALE           EVENT_ID               YEAR 
                 0                  0                  0                  0
             MONTH    N_SUBSCRIPTIONS              PRICE       PHONE_NUMBER
                 0                  0                  0                  0
      PROP_CONBINI       PAYMENT_TYPE          FAV_GENRE                AGE
                 0                  0                967               1723
   DAYS_FROM_PROMO         BOOKS_PAID     N_TRANSACTIONS            N_ITEMS
                 0               5574               5574                  0
DATE_LAST_PURCHASE     CUSTOMER_SINCE               MAIL        SUBSCR_CANC
              5574               5574                  0                  0
            MARGIN
              5574
>

The dataset has around 17,000 observations, so dropping rows with missing values is not an option. Here's my current approach to handle the missing values, and I’d like your feedback:

For "FAV_GENRE" and "AGE": Since the number of missing values is relatively small, I’m considering using multiple imputation to fill them in.
For the other variables: The missing values are systematically distributed among them, so I was thinking of:
- Creating a new binary flag variable to indicate whether the value is missing or not.
- Training logistic regression and LDA models, including these flags as features. I’ve read that this is a common practice, but I’ve never done it before.
Using tree-based models like Random Forest and XGBoost: I know these models can handle missing values, but I’ve never worked with missing data in these algorithms. Are there any best practices I should follow?

Let me know what you think—thanks in advance!

I'm working on a project with a dataset that has quite a lot of missing values—really a lot.

Here's the output of colSums(is.na(dati_train)), showing the number of missing values per column:

> colSums(is.na(dati_train))   # Number of NAs per column
              PAID      POINT_OF_SALE           EVENT_ID               YEAR 
                 0                  0                  0                  0
             MONTH    N_SUBSCRIPTIONS              PRICE       PHONE_NUMBER
                 0                  0                  0                  0
      PROP_CONBINI       PAYMENT_TYPE          FAV_GENRE                AGE
                 0                  0                967               1723
   DAYS_FROM_PROMO         BOOKS_PAID     N_TRANSACTIONS            N_ITEMS
                 0               5574               5574                  0
DATE_LAST_PURCHASE     CUSTOMER_SINCE               MAIL        SUBSCR_CANC
              5574               5574                  0                  0
            MARGIN
              5574
>

The dataset has around 17,000 observations, so dropping rows with missing values is not an option. Here's my current approach to handle the missing values, and I’d like your feedback:

For "FAV_GENRE" and "AGE": Since the number of missing values is relatively small, I’m considering using multiple imputation to fill them in.
For the other variables: The missing values are systematically distributed among them, so I was thinking of:
- Creating a new binary flag variable to indicate whether the value is missing or not.
- Training logistic regression and LDA models, including these flags as features. I’ve read that this is a common practice, but I’ve never done it before.
Using tree-based models like Random Forest and XGBoost: I know these models can handle missing values, but I’ve never worked with missing data in these algorithms. Are there any best practices I should follow?

Let me know what you think—thanks in advance!

Share Improve this question asked Dec 27, 2024 at 10:40 giulio lo verde 1

1 First, I believe this should be asked on stackexchange and not stackoverflow as it is not code based. Second, how do you plan on doing LDA on categorical variables? – Onyambu Commented Dec 27, 2024 at 15:38
It appears you want a conceptual discussion. I strongly advise you to move this discussion to crossvalidated/stack exchange to get answers and keep making just code related questions here. – hamagust Commented Dec 27, 2024 at 20:40

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

I would try a multiple imputation model. That's actually not much missing data, and a multiple imputation model should handle it fine. See this writeup for more info.

本文标签：

版权声明：本文标题：python - Handling Systematic Missing Values in a Dataset for Logistic Regression, LDA, and Tree-Based Models - Stack Overflow 内容由热心网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：https://it.en369.cn/questions/1735944672a1363911.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

369IT编程

python - Handling Systematic Missing Values in a Dataset for Logistic Regression, LDA, and Tree-Based Models - Stack Overflow

1 Answer 1

1 Answer 1

更多相关文章

MySQL5.7.32 64位解压缩版 windows操作系统安装教程图解

Go 语言安装教程（Windows 系统）

将32位Windows10升级到64位版本的方法

windows下，python3.6 32位、64位共存及开发工具vscode配置

关于Windows 7 64位系统 HP M1319f 打印机无法扫描的解决办法

在win7 32位系统中安装配置Python的机器学习包scikit-learn

32位系统无法运行64位系统安装文件

win7 32位与64位下载地址存档

VirtualBox或VM Ware只能安装32位系统的解决办法

JetBrains PyCharm windows32位 安装

matlab7.0 32位&amp;64位 下载和安装说明

Windows11系统p2psvc.dll文件丢失问题

Windows10 安装oracle 11gR2 client 32位的方法

VMcare 虚拟机 详细安装教程及解释（内含激活码）最新版本

【免费下载】 VNCViewer 注册码资源下载

Royal TSX许可证密钥(6.x后所有版本都可以用)

【亲测免费】 Proxifer 安装包与注册码

【亲测免费】 抖音直播伴侣推流密钥获取工具使用教程

windows下载安装远程桌面工具RealVNC-Server教程(RealVNC_E4_6_1版带注册码)

Sublime 32位 激活码

发表评论

推荐文章

win7无线热点_笔记本无法连接手机热点，你这问题至少有50%人出现过

如何查看电脑操作记录(电脑使用痕迹历史记录查看教程)_电脑痕迹工具

树莓派利用OpenCV的图像跟踪、人脸识别等

win10纯净版安装或升级图解

笔记本突然无线和有线都不能使用

热门文章

LaTex安装与配置保姆级教程

【AIGC学习笔记】扣子平台——精选有趣应用，探索无限可能

android 6.0长截屏,网页长截图app

windows丢失、损坏系统文件怎么办？教你几招搞定！

Qt Creator “Promote to“功能 笔记

诺基亚808手机软件java的那款_4100万像素之外有什么? 诺基亚808评测

wegame饥荒一直登录中_《掌上Wegame》腾讯出品可以手机玩电脑游戏的APP，目前已开放LOL...

【免费下载】 重温经典：Windows Server 2003 镜像文件推荐

vscode翻译插件

rx580默认频率1150_不到1300元的RX580 显卡能买吗？！迪兰 RX580 2048SP 4G X-Serial 开箱测试...

最新文章

Sublime 32位 激活码

windows下载安装远程桌面工具RealVNC-Server教程(RealVNC_E4_6_1版带注册码)

【亲测免费】 抖音直播伴侣推流密钥获取工具使用教程

【亲测免费】 Proxifer 安装包与注册码

Royal TSX许可证密钥(6.x后所有版本都可以用)

程序员刚毕业，先去大厂镀金还是先去小厂攒经验？

万象2008清空boss账户密码

【Tools】GitBook简明教程

oracle exadata celldisk 闪存盘受损导致性能下降

SDUT 2138 图结构练习——BFSDFS——判断可达性

WordPress get parent category taxonomy

Omit specific product categories from WooCommerce shortcode

Updating Posts table in database without overwriting user generated content

php - Use wp_get_recent_posts with search term

responsive - How to exclude an image size from the Wordpress srcset

JetBrains PyCharm windows32位安装

matlab7.0 32位&64位下载和安装说明

VMcare 虚拟机详细安装教程及解释（内含激活码）最新版本

【亲测免费】抖音直播伴侣推流密钥获取工具使用教程

Sublime 32位激活码

Qt Creator “Promote to“功能笔记

【免费下载】重温经典：Windows Server 2003 镜像文件推荐

Sublime 32位激活码

【亲测免费】抖音直播伴侣推流密钥获取工具使用教程