MAT381E-Fall21
diff --git a/‎.DS_Store
14 KB b/‎.DS_Store
14 KB
diff --git a/‎.gitignore
+4 b/‎.gitignore
+4
diff --git a/‎01-importing.Rmd
+235 b/‎01-importing.Rmd
+235
@@ -0,0 +1,4 @@
+.Rproj.user
+.Rhistory
+.RData
+.Ruserdata
@@ -0,0 +1,235 @@
+
+---
+title: 'Week 2: Examples for Data Importing'
+---
+
+# Load packages
+```{r}
+library(readr)
+library(readxl)
+```
+
+# Let's focus on readr package first
+# Check the current directory
+
+```{r}
+getwd()
+```
+
+# Visit Environmental Performance Index at https://epi.yale.edu/
+# Go to "Downloads > EPI2020Raw data"
+# https://epi.yale.edu/downloads/epi2020rawdata20200604v02.zip
+# Download this file to your Desktop and unzip this file. 
+# Focus on CTH_raw_na file and move this file to data folder uder your working directory
+
+```{r}
+# move the file to working directory
+#file.copy("/Users/gulinan/Desktop/epi2020rawdata20200604v02/CTH_raw_na.csv", "data/CTH_raw_na.csv")
+# When your working directory is the path provided by
+# getwd() function, but you want to access the data folder and 
+# list all the files in it:
+list.files(file.path(getwd(), "data"))
+```
+
+# CTH_raw_na is a csv file 
+# Have a look at the first three lines of the CTH data
+```{r}
+# read_lines is a function from readr package
+# ?read_lines
+# read_lines() reads up to n_max lines from a file.
+read_lines("data/CTH_raw_na.csv", n_max=3)
+```
+
+# The first line is header
+# read_* functions all assume that col_names=T by default.
+# read_* functions always skip empty rows through ski_empty_rows=T by default
+
+
+```{r}
+# more info
+?read_csv
+```
+
+```{r}
+# Read the data into R console
+cth_data <- read_csv("data/CTH_raw_na.csv")
+```
+# When everything is OK, you will see your data under Environement section
+```{r}
+# Inverstigate data carefully
+View(cth_data)
+```
+
+```{r}
+# Check the types of variables: character, numeric, integer
+str(cth_data)
+```
+
+# Note: NA stands for missing values in the data 
+
+```{r}
+# Check the class of the data
+class(cth_data)
+```
+
+```{r}
+# Check the dimensions
+dim(cth_data)
+```
+
+```{r}
+# Check the column names
+colnames(cth_data)
+```
+
+```{r}
+# Not comfortable with column names?
+colnames(cth_data)[-seq(1:3)] <- paste(seq(1950,2014), sep = "")
+```
+
+```{r}
+# Check the column names one more time!
+colnames(cth_data)
+```
+
+
+# Since in readr function path is defined as:
+# "Either a path to a file, a connection",
+# we can also import csv files into R as well.
+# Let's import a csv file on the internet into R now
+
+# Visit Google' Covid-19 mobility report
+# at https://www.google.com/covid19/mobility/
+# Global data is available to be downloaded as a csv file
+# at the following url:
+# https://www.gstatic.com/covid19/mobility/Global_Mobility_Report.csv?cachebust=c050b74b9ee831a7
+
+```{r}
+# First specify the url address of the data
+url <- "https://www.gstatic.com/covid19/mobility/Global_Mobility_Report.csv?cachebust=c050b74b9ee831a7"
+# Then read it into R (It takes a bit time)
+google_mobility <- read_csv(url)
+```
+
+```{r}
+# Investigate it as you wish
+View(google_mobility)
+```
+
+```{r}
+# or download this file into your local computer
+# download.file functions is under utils library
+download.file(url, "data/google_mobility.csv")
+```
+
+
+# Let's focus on readxl package
+# readxl does not come with tidyverse. 
+# For that reason, install readxl package and then load it
+```{r}
+# install.packages("readxl")
+library(readxl)
+```
+
+# Visit at the web site of Monitoring the situation of 
+# children and women in Europe and Central Asia
+# at http://transmonee.org/
+# Go to http://transmonee.org/database/ and 
+# Download Excel file named
+# "Population at the beginning of the year by sex and selected age groups"
+# into to your working directory and rename it as Population-1989-2015
+# since the file name is very long.
+
+```{r}
+# List the sheet names
+excel_sheets("data/Population-1989-2015.xlsx")
+```
+# read_excel() reads both xls and xlsx files and detects the format from the extension.
+```{r}
+pop <- read_excel("data/Population-1989-2015.xlsx")
+View(pop)
+```
+# The excel file is very crowded. It consists of many tables and notes in text format
+# Let's focus on Total population on January 1 which is located between 5th and
+# 39th rows. Note that the 5th row is a heading.
+```{r}
+pop_red <- read_excel("data/Population-1989-2015.xlsx", range = cell_rows(5:39))
+colnames(pop_red)[1:2] <- c("Country", "Number")
+View(pop_red)
+```
+# Now it seems a bit OK! Since new data set still involves
+# a few empty rows!..We can get rid of these rows
+# when we are familiar with dplyr package. 
+
+
+# However, how about you have a very big data set?
+# Maybe, it is better to get more help from 
+# https://readxl.tidyverse.org/articles/sheet-geometry.html
+# https://readxl.tidyverse.org/articles/cell-and-column-types.html
+
+
+# Sometimes, the Excel file may involve multiple sheets
+# Go to http://transmonee.org/database/download/ and
+# Download the Excel file of TransMonEE full database for 2019 and
+# save the file into your working directory. Do this only once.
+```{r}
+url <- "http://transmonee.org/wp-content/uploads/2016/05/TM-2019-EN-June.xlsx"
+download.file(url, "data/TM-2019-EN-June.xlsx")
+```
+# Check the number of sheets
+```{r}
+excel_sheets("data/TM-2019-EN-June.xlsx")
+```
+# It consists of 6 sheets
+```{r}
+full_data <- read_excel("data/TM-2019-EN-June.xlsx")
+View(full_data)
+```
+
+# By default, it prints the first sheet
+```{r}
+juvenile_data <- read_excel("data/TM-2019-EN-June.xlsx", sheet = "5. Juvenile Justice & Crime")
+View(juvenile_data)
+```
+
+# More work has to be done to retrieve this data!
+# Any idea and suggestion may contribute to the R Community!
+
+################## Study at home ###########################################################################
+# Iterating over multiple files or multiple
+# sheets can be done via purr package. 
+# https://readxl.tidyverse.org/articles/articles/readxl-workflows.html
+
+# If you have any solutions, share it on Twitter with #mat381e #rstats and #unicef hashtags!..
+
+# https://rviews.rstudio.com/2019/10/09/building-interactive-world-maps-in-shiny/
+
+
+# Sometimes, your data may be in a Google SpreadSheet
+# Such as the gapminder data at 
+# https://docs.google.com/spreadsheets/d/1U6Cf_qEOhiR9AZqTqS3mbMF3zt2db48ZP5v3rkrAEJY/edit#gid=780868077
+# More info can be found at 
+# https://moderndive.com/2-viz.html
+# Then you can use googlesheets4 package to dowload 
+# this data into your working directory
+
+```{r}
+#install.packages("googlesheets4")
+library(googlesheets4)
+url <- "https://docs.google.com/spreadsheets/d/1U6Cf_qEOhiR9AZqTqS3mbMF3zt2db48ZP5v3rkrAEJY/edit#gid=780868077"
+gapmind <- read_sheet(url)
+```
+
+
+# On the use of this package,
+# it is better to read this
+# https://www.tidyverse.org/google_privacy_policy/
+
+# Or you can just download the Google Spreadsheet as a csv or xls file and then
+# read it via read_csv or read_excel function.
+
+# You can write out your files as well.
+
+
+
+