티스토리 뷰
Manipulating Data in R with the Tidyverse.¶
<출처>https://www.kaggle.com/rtatman/manipulating-data-with-the-tidyverse/notebook (Manipulating Data with the Tidyverse, Rachael Tatman)
https://www.kaggle.com/crawford/agricultural-survey-of-african-farm-households (dataset: Agricultural Survey of African Farm Households)
https://www.tidyverse.org/ (Tidyverse)
https://github.com/tidyverse/magrittr (magrittr)
https://github.com/tidyverse/dplyr (dplyr)
<함께보기>
Manipulating Data in Python3 with the Pandas
Loading Tidyverse¶
In [1]:
#install.packages("tidyverse")
require(tidyverse)
Reading a data set¶
In [2]:
farmdata <- read_csv("../datasets/data.csv")
In [3]:
class(farmdata)
In [4]:
problems(farmdata)
In [5]:
head(farmdata)
In [6]:
dim(farmdata)
Piping ("%>%")¶
In [7]:
columnsAlphaOrder_withPiping <- farmdata %>%
names() %>%
sort()
In [8]:
columnsAlphaOrder_withoutPiping <- sort(names(farmdata))
In [9]:
identical(columnsAlphaOrder_withPiping,columnsAlphaOrder_withoutPiping)
In [10]:
columnsAlphaOrder <- columnsAlphaOrder_withPiping
head(columnsAlphaOrder)
Selecting one or more columns ("select()")¶
Selecting a column¶
In [11]:
farmdata %>%
select(gender1) %>%
head()
Removing a column¶
In [12]:
farmdata %>%
select(-gender1) %>%
head()
Selecting the columns that start with "gender"¶
In [13]:
farmdata %>%
select(starts_with("gender")) %>%
head()
Removing the columns that start with "gender"¶
In [14]:
farmdata %>%
select(-starts_with("gender")) %>%
head()
Selecting two columns¶
In [15]:
farmdata %>%
select(gender1,age1) %>%
head()
Removing two columns¶
In [16]:
farmdata %>%
select(-gender1,-gender2) %>%
head()
Selecting several columns¶
In [17]:
farmdata %>%
select(ends_with("1"),ends_with("2")) %>%
head()
In [18]:
farmdata %>%
select(contains("ge")) %>%
head()
In [19]:
farmdata %>%
select(age1:age11) %>%
head()
In [20]:
farmdata %>%
select(num_range("age",1:11)) %>%
head()
Selecting one or more rows ("filter()")¶
In [21]:
farmdata %>%
filter(vname == "Tikare")
In [22]:
farmdata %>%
filter(fplots >= 9)
- < : less than
- <= : less than or equal to
- > : greater than
- >= : greater than or equal to
- == : equal to
- != : not equal to
- | : or
- & : and
In [23]:
farmdata %>%
filter(vname == "Tikare" & gender3 > 1)
Adding new variables ("mutate()")¶
Testing with a sample data¶
In [24]:
SubsetOfFarmdata <- farmdata %>%
select(starts_with("gender"))%>%
head()
SubsetOfFarmdata
In [25]:
CountOfMen <- SubsetOfFarmdata %>%
{. == 1} %>%
apply(1,sum,na.rm=T)
CountOfMen
In [26]:
CountOfMen <- SubsetOfFarmdata %>%
magrittr::equals(1) %>%
rowSums(na.rm=T)
CountOfMen
In [27]:
CountOfWomen <- SubsetOfFarmdata %>%
{. == 2} %>%
apply(1,sum,na.rm=T)
CountOfWomen
In [28]:
SubsetOfFarmdata %>%
mutate(CountOfMen, CountOfWomen)
Making a column, MenPlusWomen¶
In [29]:
farmdata <- farmdata %>%
mutate(
MenPlusWomen = farmdata %>%
select(starts_with("gender")) %>%
{. == 1 | . == 2} %>%
apply(1,sum,na.rm=T)
)
farmdata %>% head()
Making a column, landowner¶
In [30]:
farmdata <- farmdata %>%
mutate(landowner = (tenure1 == 1 | tenure1 == 2))
Changing the order of rows ("arrage()")¶
In [31]:
farmdata %>%
arrange(MenPlusWomen) %>%
head()
In [32]:
farmdata %>%
select(MenPlusWomen, hhsize) %>%
arrange(MenPlusWomen) %>%
head()
Using desc() for descending sort¶
In [33]:
farmdata %>%
select(MenPlusWomen, hhsize) %>%
arrange(desc(MenPlusWomen)) %>%
head()
NA is a value which is larger than any number in the function "arrange" (even infinity!).
In [34]:
arrange(data.frame(val = c(2^10, NA, -2^5, Inf, -Inf, NA)), val)
Converting a variable to a single value ("summarize()" or "summarise()")¶
Calculating mean of variables¶
In [35]:
farmdata %>%
summarize(meanMenPlusWomen = mean(MenPlusWomen),
meanhhsize = mean(hhsize, na.rm = T))
In [36]:
farmdata %>%
summarise(meanMenPlusWomen = mean(MenPlusWomen),
meanhhsize = mean(hhsize, na.rm = T))
Counting NAs of variables¶
In [37]:
farmdata %>%
summarize(missingMenPlusWomen = sum(MenPlusWomen == 0),
missinghhsize = sum(is.na(hhsize)))
Grouping sets of observations ("group_by()")¶
In [38]:
farmdata %>%
group_by(landowner) %>%
summarize(plots = median(fplots, na.rm=T))
In [39]:
table(farmdata$fplots)
farmdata %>%
group_by(fplots) %>%
tally()
farmdata %>%
group_by(fplots) %>%
count()
In [40]:
table(farmdata$landowner, farmdata$fplots)
farmdata %>%
group_by(landowner, fplots) %>%
tally() %>%
na.omit()
In [41]:
farmdata %>%
group_by(interviewer) %>%
tally() %>%
filter(n > 50) %>%
arrange(desc(n))
'Statistics > DATA Manipulating' 카테고리의 다른 글
Building Web App with Shiny in R - Day 2 (0) | 2018.06.19 |
---|---|
Building Web App with Shiny in R - Day 1 (0) | 2018.06.15 |
Manipulating Data in Python3 with the Pandas (0) | 2018.06.12 |