admin管理员组

文章数量:1022964

#After exporting data into date frame it looks like below

data 1
id datum
2000 2024.09.02
2903 2024.09.02
data 2
id datum
4000 2024.09.02
4001 2024.09.02

#After exporting data into date frame it looks like below

data 1
id datum
2000 2024.09.02
2903 2024.09.02
data 2
id datum
4000 2024.09.02
4001 2024.09.02

#expected answer should look like this

id datum data
2000 2024.09.02 1
2903 2024.09.02 1
4000 2024.09.02 2
4001 2024.09.02 2
import pandas as pd
import numpy as np

# Step 1: Read the Excel file
# Replace 'your_file_path.xlsx' with the actual path to your Excel file
file_path = 'your_file_path.xlsx'

# Load the Excel file
xls = pd.ExcelFile(file_path)

# Assuming 'Datei2' is the name of the sheet we want to read
df = pd.read_excel(xls, sheet_name='Datei2')
df['Column1 - Copy'] = df['Column1']
split_columns = df['Column1 - Copy'].str.split(' ', expand=True)
split_columns.columns = [f'Column1 - Copy.{i+1}' for i in range(split_columns.shape[1])]
df = pd.concat([df, split_columns], axis=1)

# Step 6: Fill Down Values
df['Column1 - Copy.16'] = df['Column1 - Copy.16'].fillna(method='ffill')
Share Improve this question edited Nov 18, 2024 at 20:55 niki asked Nov 18, 2024 at 20:46 nikiniki 135 bronze badges 2
  • 1 Please, take some time to read how to ask and How to create a Minimal, Reproducible Example – LMC Commented Nov 18, 2024 at 20:47
  • 2 What is the point of the added code? Your sample doesn't have any of these columns. Please edit your question to fix the code so that it meaningfully fits your sample data, and explain how it relates to either input or desired output. – ouroboros1 Commented Nov 18, 2024 at 21:29
Add a comment  | 

2 Answers 2

Reset to default 0

Assuming your excel file looks exactly like you put, you can use cumsum to put your data into groups and then remove any unnecessary data by filtering the dataframe:

df = pd.read_excel(<your file>, header=None, names=['id', 'datum'])
       id       datum
0  data 1         NaN
1      id       datum
2    2000  2024.09.02
3    2903  2024.09.02
4  data 2         NaN
5      id       datum
6    4000  2024.09.02
7    4001  2024.09.02

df['data'] = (df['datum'].isna()).cumsum()
        id       datum  data
0  data 1         NaN     1
1      id       datum     1
2    2000  2024.09.02     1
3    2903  2024.09.02     1
4  data 2         NaN     2
5      id       datum     2
6    4000  2024.09.02     2
7    4001  2024.09.02     2
df = df[~df['datum'].eq('datum')].dropna(how='any')
     id       datum  data
2  2000  2024.09.02     1
3  2903  2024.09.02     1
6  4000  2024.09.02     2
7  4001  2024.09.02     2
import pandas as pd

##### Step 1: Read the Excel file (replace 'your_file_path.xlsx' with the actual file path)
file_path = 'your_file_path.xlsx'

##### Assuming the data spans multiple blocks in a single sheet
df = pd.read_excel(file_path, header=None)

##### Step 2: Identify blocks of data and label them
##### A block starts with "data X" where X is the block number
df['block'] = df[0].str.extract(r'data (\d+)').ffill()

##### Step 3: Remove rows that are not part of the main data
df = df[~df[0].str.contains('data', na=False)]

##### Step 4: Rename columns and clean up the DataFrame
df.columns = ['id', 'datum', 'block']
df = df.dropna(subset=['id', 'datum'])

##### Step 5: Clean and format the columns
df['id'] = df['id'].astype(int)
df['datum'] = pd.to_datetime(df['datum'], errors='coerce').dt.strftime('%Y.%m.%d')
df['block'] = df['block'].astype(int)

#After exporting data into date frame it looks like below

data 1
id datum
2000 2024.09.02
2903 2024.09.02
data 2
id datum
4000 2024.09.02
4001 2024.09.02

#After exporting data into date frame it looks like below

data 1
id datum
2000 2024.09.02
2903 2024.09.02
data 2
id datum
4000 2024.09.02
4001 2024.09.02

#expected answer should look like this

id datum data
2000 2024.09.02 1
2903 2024.09.02 1
4000 2024.09.02 2
4001 2024.09.02 2
import pandas as pd
import numpy as np

# Step 1: Read the Excel file
# Replace 'your_file_path.xlsx' with the actual path to your Excel file
file_path = 'your_file_path.xlsx'

# Load the Excel file
xls = pd.ExcelFile(file_path)

# Assuming 'Datei2' is the name of the sheet we want to read
df = pd.read_excel(xls, sheet_name='Datei2')
df['Column1 - Copy'] = df['Column1']
split_columns = df['Column1 - Copy'].str.split(' ', expand=True)
split_columns.columns = [f'Column1 - Copy.{i+1}' for i in range(split_columns.shape[1])]
df = pd.concat([df, split_columns], axis=1)

# Step 6: Fill Down Values
df['Column1 - Copy.16'] = df['Column1 - Copy.16'].fillna(method='ffill')
Share Improve this question edited Nov 18, 2024 at 20:55 niki asked Nov 18, 2024 at 20:46 nikiniki 135 bronze badges 2
  • 1 Please, take some time to read how to ask and How to create a Minimal, Reproducible Example – LMC Commented Nov 18, 2024 at 20:47
  • 2 What is the point of the added code? Your sample doesn't have any of these columns. Please edit your question to fix the code so that it meaningfully fits your sample data, and explain how it relates to either input or desired output. – ouroboros1 Commented Nov 18, 2024 at 21:29
Add a comment  | 

2 Answers 2

Reset to default 0

Assuming your excel file looks exactly like you put, you can use cumsum to put your data into groups and then remove any unnecessary data by filtering the dataframe:

df = pd.read_excel(<your file>, header=None, names=['id', 'datum'])
       id       datum
0  data 1         NaN
1      id       datum
2    2000  2024.09.02
3    2903  2024.09.02
4  data 2         NaN
5      id       datum
6    4000  2024.09.02
7    4001  2024.09.02

df['data'] = (df['datum'].isna()).cumsum()
        id       datum  data
0  data 1         NaN     1
1      id       datum     1
2    2000  2024.09.02     1
3    2903  2024.09.02     1
4  data 2         NaN     2
5      id       datum     2
6    4000  2024.09.02     2
7    4001  2024.09.02     2
df = df[~df['datum'].eq('datum')].dropna(how='any')
     id       datum  data
2  2000  2024.09.02     1
3  2903  2024.09.02     1
6  4000  2024.09.02     2
7  4001  2024.09.02     2
import pandas as pd

##### Step 1: Read the Excel file (replace 'your_file_path.xlsx' with the actual file path)
file_path = 'your_file_path.xlsx'

##### Assuming the data spans multiple blocks in a single sheet
df = pd.read_excel(file_path, header=None)

##### Step 2: Identify blocks of data and label them
##### A block starts with "data X" where X is the block number
df['block'] = df[0].str.extract(r'data (\d+)').ffill()

##### Step 3: Remove rows that are not part of the main data
df = df[~df[0].str.contains('data', na=False)]

##### Step 4: Rename columns and clean up the DataFrame
df.columns = ['id', 'datum', 'block']
df = df.dropna(subset=['id', 'datum'])

##### Step 5: Clean and format the columns
df['id'] = df['id'].astype(int)
df['datum'] = pd.to_datetime(df['datum'], errors='coerce').dt.strftime('%Y.%m.%d')
df['block'] = df['block'].astype(int)

本文标签: pandasExtracting certain formate data fromexcel using pythonStack Overflow