2022-10-16发表2022-10-16更新数据科学2 分钟读完 (大约353个字)0次访问

数据清洗2

针对一个新数据集可能需要用的一些操作(pandas，numpy) 混合版本2

import pandas as pd
import numpy as np
import os
import warnings
warnings.filterwarnings('ignore')
# 首先过滤掉恼人warning信息

增

两个df拼接

df = df.set_index('station_id').join(df2.set_index('station_id'))
# 将 df2 中的内容通过 station_id， matching 到df上
# 需要注意拼接后 df中的 station_id 变成了 index 
# 和新增列不能和原有列冲突

apply的使用

def func(row):
  return row.x + row.y
  
df['sum'] = df.apply(func, axis=1)

# 等价于
df['sum'] = df.apply(lambda row : row.x + row.y, axis=1)

apply 的加速

Link1 Zhihu

Link2 CSDN

删

删除残缺

1	df.dropna(inplace=True)

查

获取某几些数据

1
2
3

df = df.loc['columns1', 'columns2', 'columns3']
# or
df = df[['columns1', 'columns2', 'columns3']]

过滤一列或几列的数据

判断条件记得用 & ，|, e.g.,

df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])

import datetime
open_time = datetime.datetime.strptime('2019-07-25 00:00:00', '%Y-%m-%d %H:%M:%S')
close_time = datetime.datetime.strptime('2019-07-25 23:59:59', '%Y-%m-%d %H:%M:%S')

df = df[(df['start_time'] >= open_time) & (df['end_time'] <= close_time) & (df['end_time'] > df['start_time'])]

group_by 的使用

举例：如果想按照年份和城市统计数量

res = np.zeros(shape=[num_cities, num_years])
group = df.groupby(['city', 'year'])
for (c, y), sub in group:
    res[c, y] = len(sub)

改

改变数据类型

1	df['column'] = df['column'].astype(int)

改列名

1	df.rename(columns={'c1': 'c2', 'c3': 'c4'}, inplace=True)

数据清洗2

https://lionelsy.github.io/blog/2022/10/16/P22-pandas2/

作者

Shuyu Zhang

发布于

2022-10-16

更新于

2022-10-16

许可协议

#Python

数据清洗2

增

两个df拼接

apply的使用

apply 的加速

删

删除残缺

查

获取某几些数据

过滤一列或几列的数据

group_by 的使用

改

改变数据类型

改列名

作者

发布于

更新于

许可协议

喜欢这篇文章？打赏一下作者吧

评论

目录