pandas - Extracting certain formate data fromexcel using python

admin管理员组
文章数量:1022964

#After exporting data into date frame it looks like below

data 1
id	datum
2000	2024.09.02
2903	2024.09.02
data 2
id	datum
4000	2024.09.02
4001	2024.09.02

#After exporting data into date frame it looks like below

data 1
id	datum
2000	2024.09.02
2903	2024.09.02
data 2
id	datum
4000	2024.09.02
4001	2024.09.02

#expected answer should look like this

id	datum	data
2000	2024.09.02	1
2903	2024.09.02	1
4000	2024.09.02	2
4001	2024.09.02	2

import pandas as pd
import numpy as np

# Step 1: Read the Excel file
# Replace 'your_file_path.xlsx' with the actual path to your Excel file
file_path = 'your_file_path.xlsx'

# Load the Excel file
xls = pd.ExcelFile(file_path)

# Assuming 'Datei2' is the name of the sheet we want to read
df = pd.read_excel(xls, sheet_name='Datei2')
df['Column1 - Copy'] = df['Column1']
split_columns = df['Column1 - Copy'].str.split(' ', expand=True)
split_columns.columns = [f'Column1 - Copy.{i+1}' for i in range(split_columns.shape[1])]
df = pd.concat([df, split_columns], axis=1)

# Step 6: Fill Down Values
df['Column1 - Copy.16'] = df['Column1 - Copy.16'].fillna(method='ffill')

Share Improve this question edited Nov 18, 2024 at 20:55 asked Nov 18, 2024 at 20:46 niki 135 bronze badges

1 Please, take some time to read how to ask and How to create a Minimal, Reproducible Example – LMC Commented Nov 18, 2024 at 20:47
2 What is the point of the added code? Your sample doesn't have any of these columns. Please edit your question to fix the code so that it meaningfully fits your sample data, and explain how it relates to either input or desired output. – ouroboros1 Commented Nov 18, 2024 at 21:29

Add a comment |

2 Answers 2

Sorted by: Reset to default 0

Assuming your excel file looks exactly like you put, you can use cumsum to put your data into groups and then remove any unnecessary data by filtering the dataframe:

df = pd.read_excel(<your file>, header=None, names=['id', 'datum'])

       id       datum
0  data 1         NaN
1      id       datum
2    2000  2024.09.02
3    2903  2024.09.02
4  data 2         NaN
5      id       datum
6    4000  2024.09.02
7    4001  2024.09.02

df['data'] = (df['datum'].isna()).cumsum()

        id       datum  data
0  data 1         NaN     1
1      id       datum     1
2    2000  2024.09.02     1
3    2903  2024.09.02     1
4  data 2         NaN     2
5      id       datum     2
6    4000  2024.09.02     2
7    4001  2024.09.02     2

df = df[~df['datum'].eq('datum')].dropna(how='any')

     id       datum  data
2  2000  2024.09.02     1
3  2903  2024.09.02     1
6  4000  2024.09.02     2
7  4001  2024.09.02     2

import pandas as pd

##### Step 1: Read the Excel file (replace 'your_file_path.xlsx' with the actual file path)
file_path = 'your_file_path.xlsx'

##### Assuming the data spans multiple blocks in a single sheet
df = pd.read_excel(file_path, header=None)

##### Step 2: Identify blocks of data and label them
##### A block starts with "data X" where X is the block number
df['block'] = df[0].str.extract(r'data (\d+)').ffill()

##### Step 3: Remove rows that are not part of the main data
df = df[~df[0].str.contains('data', na=False)]

##### Step 4: Rename columns and clean up the DataFrame
df.columns = ['id', 'datum', 'block']
df = df.dropna(subset=['id', 'datum'])

##### Step 5: Clean and format the columns
df['id'] = df['id'].astype(int)
df['datum'] = pd.to_datetime(df['datum'], errors='coerce').dt.strftime('%Y.%m.%d')
df['block'] = df['block'].astype(int)

#After exporting data into date frame it looks like below

data 1
id	datum
2000	2024.09.02
2903	2024.09.02
data 2
id	datum
4000	2024.09.02
4001	2024.09.02

#After exporting data into date frame it looks like below

data 1
id	datum
2000	2024.09.02
2903	2024.09.02
data 2
id	datum
4000	2024.09.02
4001	2024.09.02

#expected answer should look like this

id	datum	data
2000	2024.09.02	1
2903	2024.09.02	1
4000	2024.09.02	2
4001	2024.09.02	2

import pandas as pd
import numpy as np

# Step 1: Read the Excel file
# Replace 'your_file_path.xlsx' with the actual path to your Excel file
file_path = 'your_file_path.xlsx'

# Load the Excel file
xls = pd.ExcelFile(file_path)

# Assuming 'Datei2' is the name of the sheet we want to read
df = pd.read_excel(xls, sheet_name='Datei2')
df['Column1 - Copy'] = df['Column1']
split_columns = df['Column1 - Copy'].str.split(' ', expand=True)
split_columns.columns = [f'Column1 - Copy.{i+1}' for i in range(split_columns.shape[1])]
df = pd.concat([df, split_columns], axis=1)

# Step 6: Fill Down Values
df['Column1 - Copy.16'] = df['Column1 - Copy.16'].fillna(method='ffill')

Share Improve this question edited Nov 18, 2024 at 20:55 asked Nov 18, 2024 at 20:46 niki 135 bronze badges

1 Please, take some time to read how to ask and How to create a Minimal, Reproducible Example – LMC Commented Nov 18, 2024 at 20:47
2 What is the point of the added code? Your sample doesn't have any of these columns. Please edit your question to fix the code so that it meaningfully fits your sample data, and explain how it relates to either input or desired output. – ouroboros1 Commented Nov 18, 2024 at 21:29

Add a comment |

2 Answers 2

Sorted by: Reset to default 0

Assuming your excel file looks exactly like you put, you can use cumsum to put your data into groups and then remove any unnecessary data by filtering the dataframe:

df = pd.read_excel(<your file>, header=None, names=['id', 'datum'])

       id       datum
0  data 1         NaN
1      id       datum
2    2000  2024.09.02
3    2903  2024.09.02
4  data 2         NaN
5      id       datum
6    4000  2024.09.02
7    4001  2024.09.02

df['data'] = (df['datum'].isna()).cumsum()

        id       datum  data
0  data 1         NaN     1
1      id       datum     1
2    2000  2024.09.02     1
3    2903  2024.09.02     1
4  data 2         NaN     2
5      id       datum     2
6    4000  2024.09.02     2
7    4001  2024.09.02     2

df = df[~df['datum'].eq('datum')].dropna(how='any')

     id       datum  data
2  2000  2024.09.02     1
3  2903  2024.09.02     1
6  4000  2024.09.02     2
7  4001  2024.09.02     2

import pandas as pd

##### Step 1: Read the Excel file (replace 'your_file_path.xlsx' with the actual file path)
file_path = 'your_file_path.xlsx'

##### Assuming the data spans multiple blocks in a single sheet
df = pd.read_excel(file_path, header=None)

##### Step 2: Identify blocks of data and label them
##### A block starts with "data X" where X is the block number
df['block'] = df[0].str.extract(r'data (\d+)').ffill()

##### Step 3: Remove rows that are not part of the main data
df = df[~df[0].str.contains('data', na=False)]

##### Step 4: Rename columns and clean up the DataFrame
df.columns = ['id', 'datum', 'block']
df = df.dropna(subset=['id', 'datum'])

##### Step 5: Clean and format the columns
df['id'] = df['id'].astype(int)
df['datum'] = pd.to_datetime(df['datum'], errors='coerce').dt.strftime('%Y.%m.%d')
df['block'] = df['block'].astype(int)

本文标签： pandasExtracting certain formate data fromexcel using pythonStack Overflow

版权声明：本文标题：pandas - Extracting certain formate data fromexcel using python - Stack Overflow 内容由热心网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://it.en369.cn/questions/1745595317a2158131.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

369IT编程

pandas - Extracting certain formate data fromexcel using python - Stack Overflow

2 Answers 2

2 Answers 2

更多相关文章