2021-05-16发表2022-10-16更新数据科学2 分钟读完 (大约227个字)0次访问

数据清洗

针对一个新数据集可能需要用的一些简单操作(pandas, seaborn)

定义

1	df = pd.DataFrame(values, index, columns=['A', 'B']) # 自定义DataFrame

查看

df.dtypes	# 查看每一项的类型

df.isnull().sum() # 查看为每一项为nan的数目

df.loc[index, columns] # df中行为index， 列为columns的表

删除冗余

1	df.drop_duplicates(keep='first') # 删除df其中冗余项

时间

pd.to_datetime(df['time']) # 将df中‘time’项转换为时间格式

dates = pd.date_range("1 1 2016", periods=24*4, freq="15min") 
# 生成指定时间起点，长度，间隔的时间列表

记录残缺

1
2
3

import numpy as np
tag = np.isnan(df.values)
tag = tag.astype('float32') # 1表示残缺, 0表示存在

插值

1	df.interpolate(method='linear', limit_direction='forward', axis=0, inplace=True) # 针对残缺直接线性插值

简单可视化

import seaborn as sns
sns.set_theme(style="whitegrid")
show = df.loc['2017-01-01 14:00:00':'2017-01-02 14:00:00', stations[:3]]
sns.lineplot(data=show, palette="tab10", linewidth=2.5)

数据清洗

https://lionelsy.github.io/blog/2021/05/16/P15/

作者

Shuyu Zhang

发布于

2021-05-16

更新于

2022-10-16

许可协议

#Python

数据清洗

定义

查看

删除冗余

时间

记录残缺

插值

简单可视化

作者

发布于

更新于

许可协议

喜欢这篇文章？打赏一下作者吧

评论

目录