admin管理员组文章数量:1022964
#After exporting data into date frame it looks like below
data 1 | |
---|---|
id | datum |
2000 | 2024.09.02 |
2903 | 2024.09.02 |
data 2 | |
id | datum |
4000 | 2024.09.02 |
4001 | 2024.09.02 |
#After exporting data into date frame it looks like below
data 1 | |
---|---|
id | datum |
2000 | 2024.09.02 |
2903 | 2024.09.02 |
data 2 | |
id | datum |
4000 | 2024.09.02 |
4001 | 2024.09.02 |
#expected answer should look like this
id | datum | data |
---|---|---|
2000 | 2024.09.02 | 1 |
2903 | 2024.09.02 | 1 |
4000 | 2024.09.02 | 2 |
4001 | 2024.09.02 | 2 |
import pandas as pd
import numpy as np
# Step 1: Read the Excel file
# Replace 'your_file_path.xlsx' with the actual path to your Excel file
file_path = 'your_file_path.xlsx'
# Load the Excel file
xls = pd.ExcelFile(file_path)
# Assuming 'Datei2' is the name of the sheet we want to read
df = pd.read_excel(xls, sheet_name='Datei2')
df['Column1 - Copy'] = df['Column1']
split_columns = df['Column1 - Copy'].str.split(' ', expand=True)
split_columns.columns = [f'Column1 - Copy.{i+1}' for i in range(split_columns.shape[1])]
df = pd.concat([df, split_columns], axis=1)
# Step 6: Fill Down Values
df['Column1 - Copy.16'] = df['Column1 - Copy.16'].fillna(method='ffill')
Share
Improve this question
edited Nov 18, 2024 at 20:55
niki
asked Nov 18, 2024 at 20:46
nikiniki
135 bronze badges
2
- 1 Please, take some time to read how to ask and How to create a Minimal, Reproducible Example – LMC Commented Nov 18, 2024 at 20:47
- 2 What is the point of the added code? Your sample doesn't have any of these columns. Please edit your question to fix the code so that it meaningfully fits your sample data, and explain how it relates to either input or desired output. – ouroboros1 Commented Nov 18, 2024 at 21:29
2 Answers
Reset to default 0Assuming your excel file looks exactly like you put, you can use cumsum
to put your data into groups and then remove any unnecessary data by filtering the dataframe:
df = pd.read_excel(<your file>, header=None, names=['id', 'datum'])
id datum
0 data 1 NaN
1 id datum
2 2000 2024.09.02
3 2903 2024.09.02
4 data 2 NaN
5 id datum
6 4000 2024.09.02
7 4001 2024.09.02
df['data'] = (df['datum'].isna()).cumsum()
id datum data
0 data 1 NaN 1
1 id datum 1
2 2000 2024.09.02 1
3 2903 2024.09.02 1
4 data 2 NaN 2
5 id datum 2
6 4000 2024.09.02 2
7 4001 2024.09.02 2
df = df[~df['datum'].eq('datum')].dropna(how='any')
id datum data
2 2000 2024.09.02 1
3 2903 2024.09.02 1
6 4000 2024.09.02 2
7 4001 2024.09.02 2
import pandas as pd
##### Step 1: Read the Excel file (replace 'your_file_path.xlsx' with the actual file path)
file_path = 'your_file_path.xlsx'
##### Assuming the data spans multiple blocks in a single sheet
df = pd.read_excel(file_path, header=None)
##### Step 2: Identify blocks of data and label them
##### A block starts with "data X" where X is the block number
df['block'] = df[0].str.extract(r'data (\d+)').ffill()
##### Step 3: Remove rows that are not part of the main data
df = df[~df[0].str.contains('data', na=False)]
##### Step 4: Rename columns and clean up the DataFrame
df.columns = ['id', 'datum', 'block']
df = df.dropna(subset=['id', 'datum'])
##### Step 5: Clean and format the columns
df['id'] = df['id'].astype(int)
df['datum'] = pd.to_datetime(df['datum'], errors='coerce').dt.strftime('%Y.%m.%d')
df['block'] = df['block'].astype(int)
#After exporting data into date frame it looks like below
data 1 | |
---|---|
id | datum |
2000 | 2024.09.02 |
2903 | 2024.09.02 |
data 2 | |
id | datum |
4000 | 2024.09.02 |
4001 | 2024.09.02 |
#After exporting data into date frame it looks like below
data 1 | |
---|---|
id | datum |
2000 | 2024.09.02 |
2903 | 2024.09.02 |
data 2 | |
id | datum |
4000 | 2024.09.02 |
4001 | 2024.09.02 |
#expected answer should look like this
id | datum | data |
---|---|---|
2000 | 2024.09.02 | 1 |
2903 | 2024.09.02 | 1 |
4000 | 2024.09.02 | 2 |
4001 | 2024.09.02 | 2 |
import pandas as pd
import numpy as np
# Step 1: Read the Excel file
# Replace 'your_file_path.xlsx' with the actual path to your Excel file
file_path = 'your_file_path.xlsx'
# Load the Excel file
xls = pd.ExcelFile(file_path)
# Assuming 'Datei2' is the name of the sheet we want to read
df = pd.read_excel(xls, sheet_name='Datei2')
df['Column1 - Copy'] = df['Column1']
split_columns = df['Column1 - Copy'].str.split(' ', expand=True)
split_columns.columns = [f'Column1 - Copy.{i+1}' for i in range(split_columns.shape[1])]
df = pd.concat([df, split_columns], axis=1)
# Step 6: Fill Down Values
df['Column1 - Copy.16'] = df['Column1 - Copy.16'].fillna(method='ffill')
Share
Improve this question
edited Nov 18, 2024 at 20:55
niki
asked Nov 18, 2024 at 20:46
nikiniki
135 bronze badges
2
- 1 Please, take some time to read how to ask and How to create a Minimal, Reproducible Example – LMC Commented Nov 18, 2024 at 20:47
- 2 What is the point of the added code? Your sample doesn't have any of these columns. Please edit your question to fix the code so that it meaningfully fits your sample data, and explain how it relates to either input or desired output. – ouroboros1 Commented Nov 18, 2024 at 21:29
2 Answers
Reset to default 0Assuming your excel file looks exactly like you put, you can use cumsum
to put your data into groups and then remove any unnecessary data by filtering the dataframe:
df = pd.read_excel(<your file>, header=None, names=['id', 'datum'])
id datum
0 data 1 NaN
1 id datum
2 2000 2024.09.02
3 2903 2024.09.02
4 data 2 NaN
5 id datum
6 4000 2024.09.02
7 4001 2024.09.02
df['data'] = (df['datum'].isna()).cumsum()
id datum data
0 data 1 NaN 1
1 id datum 1
2 2000 2024.09.02 1
3 2903 2024.09.02 1
4 data 2 NaN 2
5 id datum 2
6 4000 2024.09.02 2
7 4001 2024.09.02 2
df = df[~df['datum'].eq('datum')].dropna(how='any')
id datum data
2 2000 2024.09.02 1
3 2903 2024.09.02 1
6 4000 2024.09.02 2
7 4001 2024.09.02 2
import pandas as pd
##### Step 1: Read the Excel file (replace 'your_file_path.xlsx' with the actual file path)
file_path = 'your_file_path.xlsx'
##### Assuming the data spans multiple blocks in a single sheet
df = pd.read_excel(file_path, header=None)
##### Step 2: Identify blocks of data and label them
##### A block starts with "data X" where X is the block number
df['block'] = df[0].str.extract(r'data (\d+)').ffill()
##### Step 3: Remove rows that are not part of the main data
df = df[~df[0].str.contains('data', na=False)]
##### Step 4: Rename columns and clean up the DataFrame
df.columns = ['id', 'datum', 'block']
df = df.dropna(subset=['id', 'datum'])
##### Step 5: Clean and format the columns
df['id'] = df['id'].astype(int)
df['datum'] = pd.to_datetime(df['datum'], errors='coerce').dt.strftime('%Y.%m.%d')
df['block'] = df['block'].astype(int)
本文标签: pandasExtracting certain formate data fromexcel using pythonStack Overflow
版权声明:本文标题:pandas - Extracting certain formate data fromexcel using python - Stack Overflow 内容由热心网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://it.en369.cn/questions/1745595317a2158131.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论