티스토리 뷰

Manipulating Data in Python3 with the Pandas

Manipulating Data in Python3 with the Pandas.

<출처>

https://www.kaggle.com/rtatman/manipulating-data-with-the-tidyverse/notebook (Manipulating Data with the Tidyverse, Rachael Tatman)

https://www.kaggle.com/crawford/agricultural-survey-of-african-farm-households (dataset: Agricultural Survey of African Farm Households)

https://www.kaggle.com/learn/pandas (Pandas, Aleksey Bilogur)

https://pandas.pydata.org/ (pandas)

https://assets.datacamp.com/blog_assets/PandasPythonForDataScience.pdf (Python For Data Science Cheat Sheet)


<함께보기>

Manipulating Data in R with the Tidyverse

Loading Pandas

In [1]:
import pandas as pd

Reading a dataset

In [2]:
farmdata = pd.read_csv("../datasets/data.csv")
/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py:2785: DtypeWarning: Columns (403,405,832,833,841,844,857,915,916,928,929,941,942,954,955,957,958,967,968,970,971,980,981,993,994,996,997,1006,1007,1009,1010,1022,1023,1035,1036,1048,1049,1071,1072,1084,1085,1087,1088,1097,1098,1100,1101,1110,1111,1113,1114,1126,1127,1139,1140,1149,1150,1152,1153,1162,1163,1165,1166,1178,1179,1191,1192,1204,1205,1217,1218,1227,1228,1230,1231,1240,1241,1243,1244,1256,1257,1269,1270,1282,1283,1295,1296,1306,1308,1309,1319,1321,1322,1335,1347,1348,1373,1374,1386,1387,1399,1400,1412,1413,1451,1452,1453,1480,1481,1482,1534,1535,1536,1537,1538,1539,1540,1541,1542,1543,1544,1545,1546,1547,1548,1549,1654,1669,1704,1732,1734,1735,1736,1737,1738,1739,1740,1741,1742,1743,1744,1746,1747,1748,1749,1750,1751,1752,1753) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
In [3]:
farmdata.__class__
Out[3]:
pandas.core.frame.DataFrame
In [4]:
farmdata.head()
Out[4]:
Unnamed: 0 hhcode gender1 gender2 gender3 gender4 gender5 gender6 gender7 gender8 ... longtermrainfallshifts2 longtermrainfallshifts3 adjtempshifts1_1 adjtempshifts2_1 adjtempshifts3_1 adjrainfallshifts1_1 adjrainfallshifts2_1 adjrainfallshifts3_1 interviewer vname
0 0 10010601 1.0 2.0 1.0 1.0 2.0 1.0 1.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN NaN . .
1 1 10010602 1.0 2.0 1.0 1.0 1.0 2.0 1.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN NaN . .
2 2 10010603 2.0 1.0 2.0 2.0 2.0 1.0 2.0 2.0 ... NaN NaN NaN NaN NaN NaN NaN NaN . .
3 3 10010604 2.0 2.0 2.0 1.0 1.0 1.0 1.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN NaN . .
4 4 10010605 1.0 2.0 1.0 1.0 1.0 2.0 1.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN NaN . .

5 rows × 1754 columns

In [5]:
farmdata.shape
Out[5]:
(9597, 1754)
In [6]:
farmdata.dtypes
Out[6]:
Unnamed: 0                   int64
hhcode                       int64
gender1                    float64
gender2                    float64
gender3                    float64
gender4                    float64
gender5                    float64
gender6                    float64
gender7                    float64
gender8                    float64
age1                       float64
age2                       float64
age3                       float64
age4                       float64
age5                       float64
age6                       float64
age7                       float64
age8                       float64
educ1                      float64
educ2                      float64
educ3                      float64
educ4                      float64
educ5                      float64
educ6                      float64
educ7                      float64
educ8                      float64
married1                   float64
married2                   float64
married3                   float64
married4                   float64
                            ...   
ad7616                     float64
ad7617                     float64
ad7618                     float64
ad7619                     float64
ad7620                     float64
ad7621                     float64
ad7622                     float64
ad7623                     float64
ad7624                      object
ad7625                     float64
clmadaptstrategy1           object
clmadaptstrategy2           object
clmadaptstrategy3           object
constraint1                 object
constraint2                 object
constraint3                 object
longtermtempshifts1         object
longtermtempshifts2         object
longtermtempshifts3         object
longtermrainfallshifts1     object
longtermrainfallshifts2     object
longtermrainfallshifts3    float64
adjtempshifts1_1            object
adjtempshifts2_1            object
adjtempshifts3_1            object
adjrainfallshifts1_1        object
adjrainfallshifts2_1        object
adjrainfallshifts3_1        object
interviewer                 object
vname                       object
Length: 1754, dtype: object

Selecting one or more columns

Using label, index and df.drop().

Selecting a column

In [7]:
farmdata.gender1.head()
Out[7]:
0    1.0
1    1.0
2    2.0
3    2.0
4    1.0
Name: gender1, dtype: float64
In [8]:
farmdata["gender1"].head()
Out[8]:
0    1.0
1    1.0
2    2.0
3    2.0
4    1.0
Name: gender1, dtype: float64
In [9]:
farmdata.drop(['gender1'],axis=1).head()
Out[9]:
Unnamed: 0 hhcode gender2 gender3 gender4 gender5 gender6 gender7 gender8 age1 ... longtermrainfallshifts2 longtermrainfallshifts3 adjtempshifts1_1 adjtempshifts2_1 adjtempshifts3_1 adjrainfallshifts1_1 adjrainfallshifts2_1 adjrainfallshifts3_1 interviewer vname
0 0 10010601 2.0 1.0 1.0 2.0 1.0 1.0 1.0 57.0 ... NaN NaN NaN NaN NaN NaN NaN NaN . .
1 1 10010602 2.0 1.0 1.0 1.0 2.0 1.0 1.0 61.0 ... NaN NaN NaN NaN NaN NaN NaN NaN . .
2 2 10010603 1.0 2.0 2.0 2.0 1.0 2.0 2.0 47.0 ... NaN NaN NaN NaN NaN NaN NaN NaN . .
3 3 10010604 2.0 2.0 1.0 1.0 1.0 1.0 1.0 51.0 ... NaN NaN NaN NaN NaN NaN NaN NaN . .
4 4 10010605 2.0 1.0 1.0 1.0 2.0 1.0 1.0 56.0 ... NaN NaN NaN NaN NaN NaN NaN NaN . .

5 rows × 1753 columns

Selecting the columns that start with "gender"

Using regular expression.

In [10]:
import re
In [11]:
farmdata[[i for i in list(farmdata) if re.search("^gender",i) != None]].head()
Out[11]:
gender1 gender2 gender3 gender4 gender5 gender6 gender7 gender8 gender9 gender10 ... gender29 gender30 gender31 gender32 gender33 gender34 gender35 gender36 gender37 gender38
0 1.0 2.0 1.0 1.0 2.0 1.0 1.0 1.0 1.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1.0 2.0 1.0 1.0 1.0 2.0 1.0 1.0 1.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2.0 1.0 2.0 2.0 2.0 1.0 2.0 2.0 2.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2.0 2.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0 2.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1.0 2.0 1.0 1.0 1.0 2.0 1.0 1.0 2.0 2.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 38 columns

Removing the columns that start with "gender"

Using regular expression.

In [12]:
farmdata[[i for i in list(farmdata) if re.search("^gender",i) == None]].head()
Out[12]:
Unnamed: 0 hhcode age1 age2 age3 age4 age5 age6 age7 age8 ... longtermrainfallshifts2 longtermrainfallshifts3 adjtempshifts1_1 adjtempshifts2_1 adjtempshifts3_1 adjrainfallshifts1_1 adjrainfallshifts2_1 adjrainfallshifts3_1 interviewer vname
0 0 10010601 57.0 50.0 28.0 15.0 44.0 23.0 20.0 17.0 ... NaN NaN NaN NaN NaN NaN NaN NaN . .
1 1 10010602 61.0 49.0 38.0 35.0 30.0 40.0 33.0 24.0 ... NaN NaN NaN NaN NaN NaN NaN NaN . .
2 2 10010603 47.0 29.0 13.0 27.0 4.0 0.0 21.0 3.0 ... NaN NaN NaN NaN NaN NaN NaN NaN . .
3 3 10010604 51.0 35.0 27.0 21.0 20.0 18.0 16.0 14.0 ... NaN NaN NaN NaN NaN NaN NaN NaN . .
4 4 10010605 56.0 49.0 27.0 25.0 24.0 17.0 22.0 16.0 ... NaN NaN NaN NaN NaN NaN NaN NaN . .

5 rows × 1716 columns

Selecting two columns

In [13]:
farmdata[["gender1","gender2"]].head()
Out[13]:
gender1 gender2
0 1.0 2.0
1 1.0 2.0
2 2.0 1.0
3 2.0 2.0
4 1.0 2.0

Removing two columns

In [14]:
farmdata.drop(["gender1","gender2"],axis=1).head()
Out[14]:
Unnamed: 0 hhcode gender3 gender4 gender5 gender6 gender7 gender8 age1 age2 ... longtermrainfallshifts2 longtermrainfallshifts3 adjtempshifts1_1 adjtempshifts2_1 adjtempshifts3_1 adjrainfallshifts1_1 adjrainfallshifts2_1 adjrainfallshifts3_1 interviewer vname
0 0 10010601 1.0 1.0 2.0 1.0 1.0 1.0 57.0 50.0 ... NaN NaN NaN NaN NaN NaN NaN NaN . .
1 1 10010602 1.0 1.0 1.0 2.0 1.0 1.0 61.0 49.0 ... NaN NaN NaN NaN NaN NaN NaN NaN . .
2 2 10010603 2.0 2.0 2.0 1.0 2.0 2.0 47.0 29.0 ... NaN NaN NaN NaN NaN NaN NaN NaN . .
3 3 10010604 2.0 1.0 1.0 1.0 1.0 1.0 51.0 35.0 ... NaN NaN NaN NaN NaN NaN NaN NaN . .
4 4 10010605 1.0 1.0 1.0 2.0 1.0 1.0 56.0 49.0 ... NaN NaN NaN NaN NaN NaN NaN NaN . .

5 rows × 1752 columns

Selecting several columns

Using regular expression.

In [15]:
farmdata[[i for i in list(farmdata) if re.search("[12]$",i) != None]].head()
Out[15]:
gender1 gender2 age1 age2 educ1 educ2 married1 married2 farmwork1 farmwork2 ... longtermtempshifts1 longtermtempshifts2 longtermrainfallshifts1 longtermrainfallshifts2 adjtempshifts1_1 adjtempshifts2_1 adjtempshifts3_1 adjrainfallshifts1_1 adjrainfallshifts2_1 adjrainfallshifts3_1
0 1.0 2.0 57.0 50.0 7.0 0.0 1.0 1.0 1.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1.0 2.0 61.0 49.0 6.0 5.0 1.0 1.0 1.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2.0 1.0 47.0 29.0 3.0 9.0 3.0 1.0 1.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2.0 2.0 51.0 35.0 1.0 6.0 3.0 3.0 1.0 2.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1.0 2.0 56.0 49.0 0.0 1.0 1.0 1.0 1.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 398 columns

In [16]:
farmdata[[i for i in list(farmdata) if re.search("ge",i) != None]].head()
Out[16]:
gender1 gender2 gender3 gender4 gender5 gender6 gender7 gender8 age1 age2 ... subs2get subs3get subs4get wagehhc wagehhf wagehhm wagehrc wagehrf wagehrm farmingexperience
0 1.0 2.0 1.0 1.0 2.0 1.0 1.0 1.0 57.0 50.0 ... 2.0 2.0 2.0 NaN NaN NaN NaN NaN NaN NaN
1 1.0 2.0 1.0 1.0 1.0 2.0 1.0 1.0 61.0 49.0 ... 2.0 2.0 2.0 NaN NaN NaN NaN NaN NaN NaN
2 2.0 1.0 2.0 2.0 2.0 1.0 2.0 2.0 47.0 29.0 ... 2.0 2.0 2.0 NaN NaN NaN NaN NaN 1000.0 NaN
3 2.0 2.0 2.0 1.0 1.0 1.0 1.0 1.0 51.0 35.0 ... 2.0 2.0 2.0 NaN NaN NaN 500.0 500.0 750.0 NaN
4 1.0 2.0 1.0 1.0 1.0 2.0 1.0 1.0 56.0 49.0 ... 2.0 2.0 2.0 300.0 500.0 500.0 NaN NaN NaN NaN

5 rows × 87 columns

In [17]:
farmdata[[i for i in list(farmdata) if re.search("(^age[1-9]$)|(^age1[01]$)",i) != None]].head()
Out[17]:
age1 age2 age3 age4 age5 age6 age7 age8 age9 age10 age11
0 57.0 50.0 28.0 15.0 44.0 23.0 20.0 17.0 15.0 13.0 15.0
1 61.0 49.0 38.0 35.0 30.0 40.0 33.0 24.0 56.0 24.0 21.0
2 47.0 29.0 13.0 27.0 4.0 0.0 21.0 3.0 17.0 15.0 12.0
3 51.0 35.0 27.0 21.0 20.0 18.0 16.0 14.0 13.0 9.0 3.0
4 56.0 49.0 27.0 25.0 24.0 17.0 22.0 16.0 11.0 11.0 38.0

Selecting one or more rows

Using index

In [18]:
farmdata.loc[farmdata.vname == "Tikare"]
Out[18]:
Unnamed: 0 hhcode gender1 gender2 gender3 gender4 gender5 gender6 gender7 gender8 ... longtermrainfallshifts2 longtermrainfallshifts3 adjtempshifts1_1 adjtempshifts2_1 adjtempshifts3_1 adjrainfallshifts1_1 adjrainfallshifts2_1 adjrainfallshifts3_1 interviewer vname
5 5 10010606 1.0 2.0 1.0 1.0 1.0 1.0 2.0 2.0 ... NaN NaN NaN NaN NaN NaN NaN NaN 186 Tikare
6 6 10010607 1.0 2.0 1.0 2.0 1.0 1.0 1.0 2.0 ... NaN NaN NaN NaN NaN NaN NaN NaN 186 Tikare
7 7 10010608 1.0 1.0 2.0 2.0 1.0 1.0 1.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN NaN 186 Tikare
8 8 10010609 1.0 2.0 2.0 1.0 2.0 2.0 2.0 2.0 ... NaN NaN NaN NaN NaN NaN NaN NaN 186 Tikare
9 9 10010610 1.0 2.0 1.0 1.0 1.0 2.0 1.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN NaN 186 Tikare

5 rows × 1754 columns

In [19]:
farmdata.loc[farmdata.fplots >= 9]
Out[19]:
Unnamed: 0 hhcode gender1 gender2 gender3 gender4 gender5 gender6 gender7 gender8 ... longtermrainfallshifts2 longtermrainfallshifts3 adjtempshifts1_1 adjtempshifts2_1 adjtempshifts3_1 adjrainfallshifts1_1 adjrainfallshifts2_1 adjrainfallshifts3_1 interviewer vname
4758 4758 50030430 1.0 2.0 1.0 2.0 1.0 1.0 2.0 2.0 ... NaN NaN NaN NaN NaN NaN NaN NaN 4 NaN
6839 6839 70062420 1.0 1.0 1.0 1.0 1.0 2.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 29 NaN
6947 6947 70090418 1.0 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 31 NaN

3 rows × 1754 columns

  • <  : less than
  • <= : less than or equal to
  • >  : greater than
  • >= : greater than or equal to
  • == : equal to
  • != : not equal to
  • |  : or
  • &  : and
In [20]:
farmdata.loc[(farmdata.vname == "Tikare") & (farmdata.gender3 > 1)]
Out[20]:
Unnamed: 0 hhcode gender1 gender2 gender3 gender4 gender5 gender6 gender7 gender8 ... longtermrainfallshifts2 longtermrainfallshifts3 adjtempshifts1_1 adjtempshifts2_1 adjtempshifts3_1 adjrainfallshifts1_1 adjrainfallshifts2_1 adjrainfallshifts3_1 interviewer vname
7 7 10010608 1.0 1.0 2.0 2.0 1.0 1.0 1.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN NaN 186 Tikare
8 8 10010609 1.0 2.0 2.0 1.0 2.0 2.0 2.0 2.0 ... NaN NaN NaN NaN NaN NaN NaN NaN 186 Tikare

2 rows × 1754 columns

Adding new variables ("pd.concat()")

Testing with a sample data

In [21]:
SubsetOfFarmdata = farmdata[
    [i for i in list(farmdata) if re.search("^gender",i) != None]
    ].head()
SubsetOfFarmdata
Out[21]:
gender1 gender2 gender3 gender4 gender5 gender6 gender7 gender8 gender9 gender10 ... gender29 gender30 gender31 gender32 gender33 gender34 gender35 gender36 gender37 gender38
0 1.0 2.0 1.0 1.0 2.0 1.0 1.0 1.0 1.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1.0 2.0 1.0 1.0 1.0 2.0 1.0 1.0 1.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2.0 1.0 2.0 2.0 2.0 1.0 2.0 2.0 2.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2.0 2.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0 2.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1.0 2.0 1.0 1.0 1.0 2.0 1.0 1.0 2.0 2.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 38 columns

In [22]:
CountOfMen = (SubsetOfFarmdata==1).sum(axis=1)
CountOfMen
Out[22]:
0    11
1    12
2     5
3     6
4     9
dtype: int64
In [23]:
CountOfWomen = (SubsetOfFarmdata==2).sum(axis=1)
CountOfWomen
Out[23]:
0    11
1     8
2     7
3     5
4    15
dtype: int64
In [24]:
SubsetOfFarmdata = pd.concat([SubsetOfFarmdata,CountOfMen,CountOfWomen],axis=1)
SubsetOfFarmdata = SubsetOfFarmdata.rename(columns={0:"CountOfMen", 1: "CountOfWomen"})
SubsetOfFarmdata
Out[24]:
gender1 gender2 gender3 gender4 gender5 gender6 gender7 gender8 gender9 gender10 ... gender31 gender32 gender33 gender34 gender35 gender36 gender37 gender38 CountOfMen CountOfWomen
0 1.0 2.0 1.0 1.0 2.0 1.0 1.0 1.0 1.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN NaN 11 11
1 1.0 2.0 1.0 1.0 1.0 2.0 1.0 1.0 1.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN NaN 12 8
2 2.0 1.0 2.0 2.0 2.0 1.0 2.0 2.0 2.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN NaN 5 7
3 2.0 2.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0 2.0 ... NaN NaN NaN NaN NaN NaN NaN NaN 6 5
4 1.0 2.0 1.0 1.0 1.0 2.0 1.0 1.0 2.0 2.0 ... NaN NaN NaN NaN NaN NaN NaN NaN 9 15

5 rows × 40 columns

Making a column, Men Plus Women

In [25]:
farmdata = pd.concat([farmdata,
          ((farmdata[[i for i in list(farmdata) if re.search("^gender",i) != None]] == 1) |
           (farmdata[[i for i in list(farmdata) if re.search("^gender",i) != None]] ==2)).sum(axis=1)
          ],axis=1)
farmdata = farmdata.rename(columns={0:"MenPlusWomen"})
farmdata.head()
Out[25]:
Unnamed: 0 hhcode gender1 gender2 gender3 gender4 gender5 gender6 gender7 gender8 ... longtermrainfallshifts3 adjtempshifts1_1 adjtempshifts2_1 adjtempshifts3_1 adjrainfallshifts1_1 adjrainfallshifts2_1 adjrainfallshifts3_1 interviewer vname MenPlusWomen
0 0 10010601 1.0 2.0 1.0 1.0 2.0 1.0 1.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN . . 22
1 1 10010602 1.0 2.0 1.0 1.0 1.0 2.0 1.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN . . 20
2 2 10010603 2.0 1.0 2.0 2.0 2.0 1.0 2.0 2.0 ... NaN NaN NaN NaN NaN NaN NaN . . 12
3 3 10010604 2.0 2.0 2.0 1.0 1.0 1.0 1.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN . . 11
4 4 10010605 1.0 2.0 1.0 1.0 1.0 2.0 1.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN . . 24

5 rows × 1755 columns

Making a column, landowner

In [26]:
farmdata = pd.concat([farmdata,
                     ((farmdata.tenure1 == 1) |
                      (farmdata.tenure1 == 2)).rename(0)
                     ], axis=1)
farmdata = farmdata.rename(columns={0 : "landowner"})
farmdata.head()
Out[26]:
Unnamed: 0 hhcode gender1 gender2 gender3 gender4 gender5 gender6 gender7 gender8 ... adjtempshifts1_1 adjtempshifts2_1 adjtempshifts3_1 adjrainfallshifts1_1 adjrainfallshifts2_1 adjrainfallshifts3_1 interviewer vname MenPlusWomen landowner
0 0 10010601 1.0 2.0 1.0 1.0 2.0 1.0 1.0 1.0 ... NaN NaN NaN NaN NaN NaN . . 22 True
1 1 10010602 1.0 2.0 1.0 1.0 1.0 2.0 1.0 1.0 ... NaN NaN NaN NaN NaN NaN . . 20 True
2 2 10010603 2.0 1.0 2.0 2.0 2.0 1.0 2.0 2.0 ... NaN NaN NaN NaN NaN NaN . . 12 True
3 3 10010604 2.0 2.0 2.0 1.0 1.0 1.0 1.0 1.0 ... NaN NaN NaN NaN NaN NaN . . 11 True
4 4 10010605 1.0 2.0 1.0 1.0 1.0 2.0 1.0 1.0 ... NaN NaN NaN NaN NaN NaN . . 24 True

5 rows × 1756 columns

Changing the order of rows

In [31]:
farmdata[["MenPlusWomen","hhsize"]].sort_values(
    by="MenPlusWomen").head()
Out[31]:
MenPlusWomen hhsize
2257 0 8.0
2421 0 5.0
2422 0 4.0
2423 0 8.0
2424 0 8.0

Setting "ascending = 0" for descending sort

In [34]:
farmdata[["MenPlusWomen","hhsize"]].sort_values(
    by="MenPlusWomen",ascending = 0).head()
Out[34]:
MenPlusWomen hhsize
438 38 39.0
729 38 41.0
402 38 38.0
433 38 41.0
367 37 37.0

NA is a value which is larger than any number in the function "arrange" (even infinity!).

In [68]:
pd.DataFrame([2**10, float('nan'), -2**5, float("inf"), float("-inf"), float('nan')]).sort_values(by=0)
Out[68]:
0
4 -inf
2 -32.000000
0 1024.000000
3 inf
1 NaN
5 NaN

Summarizing variables ("pd.DF.describe()")

In [70]:
farmdata.describe()
Out[70]:
Unnamed: 0 hhcode gender1 gender2 gender3 gender4 gender5 gender6 gender7 gender8 ... ad7617 ad7618 ad7619 ad7620 ad7621 ad7622 ad7623 ad7625 longtermrainfallshifts3 MenPlusWomen
count 9597.000000 9.597000e+03 8799.00000 8578.000000 8301.000000 7824.000000 6926.000000 5697.000000 4436.000000 3382.000000 ... 6411.000000 6572.000000 6405.000000 6514.000000 6476.000000 6403.000000 6401.000000 6587.000000 0.0 9597.000000
mean 4798.000000 5.085130e+07 1.07660 1.788995 1.449584 1.450665 1.485273 1.502896 1.538774 1.530751 ... 0.015286 0.094796 -0.010304 0.089346 0.089098 0.070904 0.027339 0.255048 NaN 6.569240
std 2770.559601 2.862039e+07 0.26597 0.408046 0.497482 0.497592 0.574047 0.509771 0.498551 0.499127 ... 1.248806 1.261920 1.239152 1.265469 1.268909 1.269755 1.254381 1.301236 NaN 4.290431
min 0.000000 1.001060e+07 1.00000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... -99.000000 -99.000000 -99.000000 -99.000000 -99.000000 -99.000000 -99.000000 -99.000000 NaN 0.000000
25% 2399.000000 2.806043e+07 1.00000 2.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 NaN 4.000000
50% 4798.000000 5.004011e+07 1.00000 2.000000 1.000000 1.000000 1.000000 2.000000 2.000000 2.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 NaN 6.000000
75% 7197.000000 8.002082e+07 1.00000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 NaN 8.000000
max 9596.000000 1.000901e+08 2.00000 2.000000 2.000000 2.000000 25.000000 9.000000 2.000000 2.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 NaN 38.000000

8 rows × 1582 columns

Grouping sets of observations ("pd.DF.groupby()")

In [75]:
farmdata.groupby(["landowner"])["fplots"].median()
Out[75]:
landowner
False    1.0
True     2.0
Name: fplots, dtype: float64
In [76]:
farmdata.groupby(["fplots"]).count()
Out[76]:
Unnamed: 0 hhcode gender1 gender2 gender3 gender4 gender5 gender6 gender7 gender8 ... adjtempshifts1_1 adjtempshifts2_1 adjtempshifts3_1 adjrainfallshifts1_1 adjrainfallshifts2_1 adjrainfallshifts3_1 interviewer vname MenPlusWomen landowner
fplots
0.0 27 27 27 27 25 25 22 16 10 7 ... 0 0 0 0 0 0 2 2 27 27
1.0 4213 4213 3716 3572 3416 3180 2713 2097 1538 1105 ... 415 15 1 408 27 3 3238 1415 4213 4213
2.0 2576 2576 2356 2309 2244 2101 1861 1540 1186 896 ... 192 8 1 193 17 0 1893 1067 2576 2576
3.0 1545 1545 1481 1466 1434 1367 1246 1059 849 657 ... 59 20 9 58 18 14 1178 782 1545 1545
3.3 1 1 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 1 1
4.0 1145 1145 1144 1135 1115 1087 1025 934 811 682 ... 0 0 0 0 0 0 886 739 1145 1145
5.0 38 38 36 34 32 32 29 26 23 18 ... 0 0 0 0 0 0 36 14 38 38
6.0 11 11 11 11 11 10 10 8 8 7 ... 0 0 0 0 0 0 11 5 11 11
8.0 9 9 9 7 7 7 7 5 3 3 ... 0 0 0 0 0 0 5 3 9 9
10.0 2 2 2 2 2 2 2 2 1 1 ... 0 0 0 0 0 0 2 0 2 2
12.0 1 1 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 1 1

11 rows × 1755 columns

In [77]:
farmdata.groupby(["landowner","fplots"]).count()
Out[77]:
Unnamed: 0 hhcode gender1 gender2 gender3 gender4 gender5 gender6 gender7 gender8 ... longtermrainfallshifts3 adjtempshifts1_1 adjtempshifts2_1 adjtempshifts3_1 adjrainfallshifts1_1 adjrainfallshifts2_1 adjrainfallshifts3_1 interviewer vname MenPlusWomen
landowner fplots
False 0.0 27 27 27 27 25 25 22 16 10 7 ... 0 0 0 0 0 0 0 2 2 27
1.0 1889 1889 1497 1439 1371 1269 1078 832 569 385 ... 0 330 11 1 329 20 1 1369 513 1889
2.0 853 853 689 670 648 588 520 422 318 237 ... 0 144 3 0 145 13 0 572 349 853
3.0 364 364 349 343 337 326 300 253 197 136 ... 0 15 2 0 15 2 1 297 231 364
4.0 279 279 278 276 269 260 243 215 185 142 ... 0 0 0 0 0 0 0 193 173 279
5.0 9 9 9 9 9 9 9 8 7 5 ... 0 0 0 0 0 0 0 8 5 9
6.0 3 3 3 3 3 3 3 2 2 1 ... 0 0 0 0 0 0 0 3 2 3
8.0 3 3 3 2 1 1 1 1 1 1 ... 0 0 0 0 0 0 0 2 1 3
10.0 1 1 1 1 1 1 1 1 0 0 ... 0 0 0 0 0 0 0 1 0 1
True 1.0 2324 2324 2219 2133 2045 1911 1635 1265 969 720 ... 0 85 4 0 79 7 2 1869 902 2324
2.0 1723 1723 1667 1639 1596 1513 1341 1118 868 659 ... 0 48 5 1 48 4 0 1321 718 1723
3.0 1181 1181 1132 1123 1097 1041 946 806 652 521 ... 0 44 18 9 43 16 13 881 551 1181
3.3 1 1 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 1
4.0 866 866 866 859 846 827 782 719 626 540 ... 0 0 0 0 0 0 0 693 566 866
5.0 29 29 27 25 23 23 20 18 16 13 ... 0 0 0 0 0 0 0 28 9 29
6.0 8 8 8 8 8 7 7 6 6 6 ... 0 0 0 0 0 0 0 8 3 8
8.0 6 6 6 5 6 6 6 4 2 2 ... 0 0 0 0 0 0 0 3 2 6
10.0 1 1 1 1 1 1 1 1 1 1 ... 0 0 0 0 0 0 0 1 0 1
12.0 1 1 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 1

19 rows × 1754 columns

In [113]:
farmdata.groupby(["interviewer"]).count().loc[(farmdata.groupby(["interviewer"]).count() > 50)["Unnamed: 0"],["Unnamed: 0"]].sort_values(by="Unnamed: 0",ascending=0)
Out[113]:
Unnamed: 0
interviewer
. 1124
4 851
70 123
162 64
228 63
65 60
223 60
42 60
62 60
84 60
237 55
329 55
327 53


공지사항
최근에 올라온 글
최근에 달린 댓글
Total
Today
Yesterday
링크
TAG
more
«   2025/08   »
1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
31
글 보관함