admin管理员组

文章数量:1130349

I'm working on a project with a dataset that has quite a lot of missing values—really a lot.

Here's the output of colSums(is.na(dati_train)), showing the number of missing values per column:

> colSums(is.na(dati_train))   # Number of NAs per column
              PAID      POINT_OF_SALE           EVENT_ID               YEAR 
                 0                  0                  0                  0
             MONTH    N_SUBSCRIPTIONS              PRICE       PHONE_NUMBER
                 0                  0                  0                  0
      PROP_CONBINI       PAYMENT_TYPE          FAV_GENRE                AGE
                 0                  0                967               1723
   DAYS_FROM_PROMO         BOOKS_PAID     N_TRANSACTIONS            N_ITEMS
                 0               5574               5574                  0
DATE_LAST_PURCHASE     CUSTOMER_SINCE               MAIL        SUBSCR_CANC
              5574               5574                  0                  0
            MARGIN
              5574
> 

The dataset has around 17,000 observations, so dropping rows with missing values is not an option. Here's my current approach to handle the missing values, and I’d like your feedback:

  1. For "FAV_GENRE" and "AGE": Since the number of missing values is relatively small, I’m considering using multiple imputation to fill them in.

  2. For the other variables: The missing values are systematically distributed among them, so I was thinking of:

    • Creating a new binary flag variable to indicate whether the value is missing or not.
    • Training logistic regression and LDA models, including these flags as features. I’ve read that this is a common practice, but I’ve never done it before.
  3. Using tree-based models like Random Forest and XGBoost: I know these models can handle missing values, but I’ve never worked with missing data in these algorithms. Are there any best practices I should follow?

Since I also have to make predictions on another dataset with similar missing value patterns, simply removing missing values is not an option. Does my approach make sense? Are there better alternatives in cases like this?

Let me know what you think—thanks in advance!

I'm working on a project with a dataset that has quite a lot of missing values—really a lot.

Here's the output of colSums(is.na(dati_train)), showing the number of missing values per column:

> colSums(is.na(dati_train))   # Number of NAs per column
              PAID      POINT_OF_SALE           EVENT_ID               YEAR 
                 0                  0                  0                  0
             MONTH    N_SUBSCRIPTIONS              PRICE       PHONE_NUMBER
                 0                  0                  0                  0
      PROP_CONBINI       PAYMENT_TYPE          FAV_GENRE                AGE
                 0                  0                967               1723
   DAYS_FROM_PROMO         BOOKS_PAID     N_TRANSACTIONS            N_ITEMS
                 0               5574               5574                  0
DATE_LAST_PURCHASE     CUSTOMER_SINCE               MAIL        SUBSCR_CANC
              5574               5574                  0                  0
            MARGIN
              5574
> 

The dataset has around 17,000 observations, so dropping rows with missing values is not an option. Here's my current approach to handle the missing values, and I’d like your feedback:

  1. For "FAV_GENRE" and "AGE": Since the number of missing values is relatively small, I’m considering using multiple imputation to fill them in.

  2. For the other variables: The missing values are systematically distributed among them, so I was thinking of:

    • Creating a new binary flag variable to indicate whether the value is missing or not.
    • Training logistic regression and LDA models, including these flags as features. I’ve read that this is a common practice, but I’ve never done it before.
  3. Using tree-based models like Random Forest and XGBoost: I know these models can handle missing values, but I’ve never worked with missing data in these algorithms. Are there any best practices I should follow?

Since I also have to make predictions on another dataset with similar missing value patterns, simply removing missing values is not an option. Does my approach make sense? Are there better alternatives in cases like this?

Let me know what you think—thanks in advance!

Share Improve this question asked Dec 27, 2024 at 10:40 giulio lo verdegiulio lo verde 1 2
  • 1 First, I believe this should be asked on stackexchange and not stackoverflow as it is not code based. Second, how do you plan on doing LDA on categorical variables? – Onyambu Commented Dec 27, 2024 at 15:38
  • It appears you want a conceptual discussion. I strongly advise you to move this discussion to crossvalidated/stack exchange to get answers and keep making just code related questions here. – hamagust Commented Dec 27, 2024 at 20:40
Add a comment  | 

1 Answer 1

Reset to default 0

I would try a multiple imputation model. That's actually not much missing data, and a multiple imputation model should handle it fine. See this writeup for more info.

I'm working on a project with a dataset that has quite a lot of missing values—really a lot.

Here's the output of colSums(is.na(dati_train)), showing the number of missing values per column:

> colSums(is.na(dati_train))   # Number of NAs per column
              PAID      POINT_OF_SALE           EVENT_ID               YEAR 
                 0                  0                  0                  0
             MONTH    N_SUBSCRIPTIONS              PRICE       PHONE_NUMBER
                 0                  0                  0                  0
      PROP_CONBINI       PAYMENT_TYPE          FAV_GENRE                AGE
                 0                  0                967               1723
   DAYS_FROM_PROMO         BOOKS_PAID     N_TRANSACTIONS            N_ITEMS
                 0               5574               5574                  0
DATE_LAST_PURCHASE     CUSTOMER_SINCE               MAIL        SUBSCR_CANC
              5574               5574                  0                  0
            MARGIN
              5574
> 

The dataset has around 17,000 observations, so dropping rows with missing values is not an option. Here's my current approach to handle the missing values, and I’d like your feedback:

  1. For "FAV_GENRE" and "AGE": Since the number of missing values is relatively small, I’m considering using multiple imputation to fill them in.

  2. For the other variables: The missing values are systematically distributed among them, so I was thinking of:

    • Creating a new binary flag variable to indicate whether the value is missing or not.
    • Training logistic regression and LDA models, including these flags as features. I’ve read that this is a common practice, but I’ve never done it before.
  3. Using tree-based models like Random Forest and XGBoost: I know these models can handle missing values, but I’ve never worked with missing data in these algorithms. Are there any best practices I should follow?

Since I also have to make predictions on another dataset with similar missing value patterns, simply removing missing values is not an option. Does my approach make sense? Are there better alternatives in cases like this?

Let me know what you think—thanks in advance!

I'm working on a project with a dataset that has quite a lot of missing values—really a lot.

Here's the output of colSums(is.na(dati_train)), showing the number of missing values per column:

> colSums(is.na(dati_train))   # Number of NAs per column
              PAID      POINT_OF_SALE           EVENT_ID               YEAR 
                 0                  0                  0                  0
             MONTH    N_SUBSCRIPTIONS              PRICE       PHONE_NUMBER
                 0                  0                  0                  0
      PROP_CONBINI       PAYMENT_TYPE          FAV_GENRE                AGE
                 0                  0                967               1723
   DAYS_FROM_PROMO         BOOKS_PAID     N_TRANSACTIONS            N_ITEMS
                 0               5574               5574                  0
DATE_LAST_PURCHASE     CUSTOMER_SINCE               MAIL        SUBSCR_CANC
              5574               5574                  0                  0
            MARGIN
              5574
> 

The dataset has around 17,000 observations, so dropping rows with missing values is not an option. Here's my current approach to handle the missing values, and I’d like your feedback:

  1. For "FAV_GENRE" and "AGE": Since the number of missing values is relatively small, I’m considering using multiple imputation to fill them in.

  2. For the other variables: The missing values are systematically distributed among them, so I was thinking of:

    • Creating a new binary flag variable to indicate whether the value is missing or not.
    • Training logistic regression and LDA models, including these flags as features. I’ve read that this is a common practice, but I’ve never done it before.
  3. Using tree-based models like Random Forest and XGBoost: I know these models can handle missing values, but I’ve never worked with missing data in these algorithms. Are there any best practices I should follow?

Since I also have to make predictions on another dataset with similar missing value patterns, simply removing missing values is not an option. Does my approach make sense? Are there better alternatives in cases like this?

Let me know what you think—thanks in advance!

Share Improve this question asked Dec 27, 2024 at 10:40 giulio lo verdegiulio lo verde 1 2
  • 1 First, I believe this should be asked on stackexchange and not stackoverflow as it is not code based. Second, how do you plan on doing LDA on categorical variables? – Onyambu Commented Dec 27, 2024 at 15:38
  • It appears you want a conceptual discussion. I strongly advise you to move this discussion to crossvalidated/stack exchange to get answers and keep making just code related questions here. – hamagust Commented Dec 27, 2024 at 20:40
Add a comment  | 

1 Answer 1

Reset to default 0

I would try a multiple imputation model. That's actually not much missing data, and a multiple imputation model should handle it fine. See this writeup for more info.

本文标签: