티스토리 뷰
Manipulating Data in Python3 with the Pandas.¶
<출처>https://www.kaggle.com/rtatman/manipulating-data-with-the-tidyverse/notebook (Manipulating Data with the Tidyverse, Rachael Tatman)
https://www.kaggle.com/crawford/agricultural-survey-of-african-farm-households (dataset: Agricultural Survey of African Farm Households)
https://www.kaggle.com/learn/pandas (Pandas, Aleksey Bilogur)
https://pandas.pydata.org/ (pandas)
https://assets.datacamp.com/blog_assets/PandasPythonForDataScience.pdf (Python For Data Science Cheat Sheet)
<함께보기>
Loading Pandas¶
In [1]:
import pandas as pd
Reading a dataset¶
In [2]:
farmdata = pd.read_csv("../datasets/data.csv")
In [3]:
farmdata.__class__
Out[3]:
In [4]:
farmdata.head()
Out[4]:
In [5]:
farmdata.shape
Out[5]:
In [6]:
farmdata.dtypes
Out[6]:
Selecting one or more columns¶
Using label, index and df.drop().
Selecting a column¶
In [7]:
farmdata.gender1.head()
Out[7]:
In [8]:
farmdata["gender1"].head()
Out[8]:
In [9]:
farmdata.drop(['gender1'],axis=1).head()
Out[9]:
Selecting the columns that start with "gender"¶
Using regular expression.
In [10]:
import re
In [11]:
farmdata[[i for i in list(farmdata) if re.search("^gender",i) != None]].head()
Out[11]:
Removing the columns that start with "gender"¶
Using regular expression.
In [12]:
farmdata[[i for i in list(farmdata) if re.search("^gender",i) == None]].head()
Out[12]:
Selecting two columns¶
In [13]:
farmdata[["gender1","gender2"]].head()
Out[13]:
Removing two columns¶
In [14]:
farmdata.drop(["gender1","gender2"],axis=1).head()
Out[14]:
Selecting several columns¶
Using regular expression.
In [15]:
farmdata[[i for i in list(farmdata) if re.search("[12]$",i) != None]].head()
Out[15]:
In [16]:
farmdata[[i for i in list(farmdata) if re.search("ge",i) != None]].head()
Out[16]:
In [17]:
farmdata[[i for i in list(farmdata) if re.search("(^age[1-9]$)|(^age1[01]$)",i) != None]].head()
Out[17]:
Selecting one or more rows¶
Using index
In [18]:
farmdata.loc[farmdata.vname == "Tikare"]
Out[18]:
In [19]:
farmdata.loc[farmdata.fplots >= 9]
Out[19]:
- < : less than
- <= : less than or equal to
- > : greater than
- >= : greater than or equal to
- == : equal to
- != : not equal to
- | : or
- & : and
In [20]:
farmdata.loc[(farmdata.vname == "Tikare") & (farmdata.gender3 > 1)]
Out[20]:
Adding new variables ("pd.concat()")¶
Testing with a sample data¶
In [21]:
SubsetOfFarmdata = farmdata[
[i for i in list(farmdata) if re.search("^gender",i) != None]
].head()
SubsetOfFarmdata
Out[21]:
In [22]:
CountOfMen = (SubsetOfFarmdata==1).sum(axis=1)
CountOfMen
Out[22]:
In [23]:
CountOfWomen = (SubsetOfFarmdata==2).sum(axis=1)
CountOfWomen
Out[23]:
In [24]:
SubsetOfFarmdata = pd.concat([SubsetOfFarmdata,CountOfMen,CountOfWomen],axis=1)
SubsetOfFarmdata = SubsetOfFarmdata.rename(columns={0:"CountOfMen", 1: "CountOfWomen"})
SubsetOfFarmdata
Out[24]:
Making a column, Men Plus Women¶
In [25]:
farmdata = pd.concat([farmdata,
((farmdata[[i for i in list(farmdata) if re.search("^gender",i) != None]] == 1) |
(farmdata[[i for i in list(farmdata) if re.search("^gender",i) != None]] ==2)).sum(axis=1)
],axis=1)
farmdata = farmdata.rename(columns={0:"MenPlusWomen"})
farmdata.head()
Out[25]:
Making a column, landowner¶
In [26]:
farmdata = pd.concat([farmdata,
((farmdata.tenure1 == 1) |
(farmdata.tenure1 == 2)).rename(0)
], axis=1)
farmdata = farmdata.rename(columns={0 : "landowner"})
farmdata.head()
Out[26]:
Changing the order of rows¶
In [31]:
farmdata[["MenPlusWomen","hhsize"]].sort_values(
by="MenPlusWomen").head()
Out[31]:
Setting "ascending = 0" for descending sort¶
In [34]:
farmdata[["MenPlusWomen","hhsize"]].sort_values(
by="MenPlusWomen",ascending = 0).head()
Out[34]:
NA is a value which is larger than any number in the function "arrange" (even infinity!).
In [68]:
pd.DataFrame([2**10, float('nan'), -2**5, float("inf"), float("-inf"), float('nan')]).sort_values(by=0)
Out[68]:
Summarizing variables ("pd.DF.describe()")¶
In [70]:
farmdata.describe()
Out[70]:
Grouping sets of observations ("pd.DF.groupby()")¶
In [75]:
farmdata.groupby(["landowner"])["fplots"].median()
Out[75]:
In [76]:
farmdata.groupby(["fplots"]).count()
Out[76]:
In [77]:
farmdata.groupby(["landowner","fplots"]).count()
Out[77]:
In [113]:
farmdata.groupby(["interviewer"]).count().loc[(farmdata.groupby(["interviewer"]).count() > 50)["Unnamed: 0"],["Unnamed: 0"]].sort_values(by="Unnamed: 0",ascending=0)
Out[113]:
'Statistics > DATA Manipulating' 카테고리의 다른 글
Building Web App with Shiny in R - Day 2 (0) | 2018.06.19 |
---|---|
Building Web App with Shiny in R - Day 1 (0) | 2018.06.15 |
Manipulating Data in R with the Tidyverse (0) | 2018.06.12 |