Using R at school to graph temperature over time in a city

This page is part of a series under tag r-at-school. The series is about improving how we teach Science in schools. Find out more: Using R at school

In this note, the focus is mainly, but not exclusively, on 10th graders. This recipe is about a workshop that can be carried on in a class or delivered as an homework assignment. Though the level is designed for young students, as very basic use of R is required, this can actually be extended to high-schoolers too.

Scenario

The class is asked to take a city in the world and evaluate the trend of ground temperature over the year. These requirements should be kept in mind:

1. The source of data can be anything, but it should be a reliable source. This is important to make students understand that reliable sources are important, and it’s not possible to trust any web site for good data. So, sources like Wikipedia1 are good, data from an unknown blogger, not so much.
2. The granularity of data is up to the students. Either measurements at day level or month level are welcome. The more granular, the better though.
3. Data must span one year. Typically it is easy to find such data calculated as averages over several years and this is fine too. Students should take note however how the data was calculated.
4. Whatever city is fine. The student should take note of the geographical location and altitude.
5. Two students cannot choose the same exact location. They can though, if they manage to find 2 different reliable data sources, in that case they can compare the trends and evaluate the quality of their sources and start a discussion over the differences they find.

Motivation

The main idea is to promote the analysis of data and raise awareness about climate and how it changes across the grography of our world. This assignment can help students understand the geodiversity of Earth and draw important conclusions. Students can also fact check the information they find in their books, and directly experience the meaning of that data.

At the end of this experience, students will gain more knowledge about how to import data from the web using R, and how to process such information and plot it.

Execution

R is good with data and everything that revolves around it (Statistics, data manipulation, plotting and so forth). That is why, for this assignment, we just need to write very little code. In this example, we choose the city of Tokyo, Japan; the data we are going to use comes from the Tokyo Wikipedia page2.

1. So, in our R session, let’s start by installing the packages we need:
install.packages("XML")
install.packages("RCurl")

Package XML is required to read HTML pages (Internet pages), and package RCurl to download content from Internet.

1. We then want to load these packages in our workspace, plus some other that come pre-installed:
library("XML")
library("RCurl")
library("stringr")

Package stringr will be used later for text manipulation.

1. So, in our R session, let’s start by storing the URL of the page we want to extract data from into a variable:
url <- "https://en.wikipedia.org/wiki/Tokyo"
1. Let’s download the Wikipedia page into a variable by using function getUrl in the RCurl package:
page <- getURL(url)
1. Now that we have downloaded the page, let’s read its content. One powerful tool is function readHTMLTable in the XML package. This function is able to read tables from a web page.
tables <- readHTMLTable(page)
1. Function readHTMLTable will load all the tables it can find in the page, so let’s see how many were found:
length(tables)
## [1] 36
1. Every student will get a different number. The challenge here is to figure out which table is the one we want among the 36 we got. To do this, let’s just check each one of them: look at its content and try to see if it contains the data we saw in the page:
tables[[1]]
tables[[2]]
tables[[3]]
# Continue until seeing the table we want...

As soon as we hit, in our case, table number 6, we find that it is the table we want:

summary(tables[[6]][1:3])
##                                      V1             V2             V3
##  Average high °C (°F)                 :1   -9.2(15.4):1   -7.9(17.8):1
##  Average low °C (°F)                  :1   0.9(33.6) :1   1.7(35.1) :1
##  Average precipitation days (= 0.5 mm):1   184.5     :1   10.4(50.7):1
##  Average precipitation mm (inches)    :1   2.8       :1   165.8     :1
##  Average relative humidity (%)        :1   22.6(72.7):1   24.9(76.8):1
##  Average snowfall cm (inches)         :1   (Other)   :7   (Other)   :7
##  (Other)                              :8   NA's      :2   NA's      :2

Function summary3 can be used to generate a more compact representation of the table. Here we have just used range [1:3] to ask for the first 3 columns for printing reasons (the table is pretty big).

1. Let’s save the table in its own variable:
table <- tables[[6]]

Plotting the table

At this point we have located the table we are interested in. We now need to extract its data and plot it.

1. The table shows different series of values (rows). But we will pick up one only. The teacher is expected to choose which type of data the class should work on, for example: Average High, Record High, Daily Mean and so on. In this example, we ask students to plot the Daily Mean for each month, which means we are interested in the 5th row:
avg_mean <- table[5,]
avg_mean # Print the row
##                   V1        V2        V3        V4         V5         V6
## 5 Daily mean °C (°F) 5.2(41.4) 5.7(42.3) 8.7(47.7) 13.9(57.0) 18.2(64.8)
##           V7         V8         V9        V10        V11        V12
## 5 21.4(70.5) 25.0(77.0) 26.4(79.5) 22.8(73.0) 17.5(63.5) 12.1(53.8)
##         V13        V14
## 5 7.6(45.7) 15.4(59.7)

The previous command will extract from table the 5th row, all the columns (to select all columns, we specify no index in the square brackets: [<row-index>, <col-index>]).

We now need to use the values, however we notice 3 problems:

• The first column should be ignored because it contains the name of the series.
• The last column should be ignored as it contains the overall year statistic.
• Every remaining column does not show one number, but it shows the temperature in this format: <degrees-c> (<degrees-f>).

The teacher should point out to students that we want to use one measurement unit only, so a choice will have to be made between Celsius and Fahrenheit. In this example we choose Celsius.

1. We need to modify the elements in the list values to only consider the first part and ignore what’s inside parentheses. To do this, we need to use a little advanced technique called Regex (Regular expression):
values <- lapply(avg_mean, function (v) { as.numeric(str_extract(v, "[-\\d\\.]*")) })

Function lapply will apply a custom function to each element v of the list. The function we define will use str_extract4 function in package stringr to extract the part we want from the text. We will get a text in the end, because R still does not know the text actually represents a number, we need to convert it by using function as.numeric.

If the teacher wants to use Fahrenheit, then the previous code gets a little more complicated:

values <- lapply(avg_mean, function (v) {
s <- str_extract(v, "\$[-\\d\\.]*\$") # Extract the parenthesized expression
s <- sub("(", "", s, fixed = TRUE)      # Remove the opening bracket '('
s <- sub(")", "", s, fixed = TRUE)      # Remove the closing bracket ')'
as.numeric(s)                           # Convert the string to a number
})

The snippet above will first extract (for every element in values) the quantity in parentheses. Problem is that what we extract is a string containing the brackets, so we need to remove them, hence the lenghty code.

If you inspect the content of values, you will see the first element to be NA, that is because the first column contained a text which was not a number and function as.numeric could not convert it.

1. We now want to remove the first and last elements from the list:
# Remove the first column as it is NA
values[sapply(values, is.na)] <- NULL

# Remove the last column
values[mask] <- NULL            # Apply the mask

Removing the first column is easy as we need to assign NULL to the first element. The last column requires more lines as we need to work on the input mask. Bitmasks is a topic I cover in this note: R hack - Bitmasks on vectors to change specific elements .

1. At this point we want to simplify the data structure we have. Variable values is a list, which has info about columns and rows. We now want to just have a vector of values without anything else:
vector <- unlist(values, use.names = FALSE)
vector # Print the final vector
##  [1]  5.2  5.7  8.7 13.9 18.2 21.4 25.0 26.4 22.8 17.5 12.1  7.6

Function unlist will simplify the list and give us a vector of numbers.

1. We are now ready to plot vector:
plot(vector, type = "l", main = "Avg mean monthly temperature in Chiyoda ward, Tokyo (1981-2010)", xlab = "Months", ylab = "Temperature (Celsius)")

Function plot Will take vector and generate a line plot out of it. The values on the x-axis are the indices of the vector and, therefore, they will correspond to each month of the year.

Analysis

Once students finish and deliver the assignments, they should be encouraged to report the following information:

1. The source of data they used.
2. Granularity of data: how many samples over time? (per month, daily, etc.).
3. A few information about the geographical location of the city they chose with emphasis on altitude.

The class should enter a discussion about the findings each student was able to retrieve. The following points can be good suggestions about how to drive such discussion:

• Students should be asked to take notes of the general trend shown by their plots asking the following questions:
• Does the graph show a constant, increasing or decreasing pattern? Does it exhibit a bell shape?
• How does the plot relate with the seasonal climate of the region? Is it possible to find correspondance? In this case it is important to point out to students that some areas have a 2-season climate, while others have a 4-season climate.
• All conditions being equal, do big cities exhibit higher temperatures than smaller ones?
• Students who chose the same city can compare their data and evaluate the quality of their data sources.
• The teacher can help students creating a world map and placing pins with printed temperature charts for each city they chose. The class can have a nice overview of the different zones of the world they covered.
• This can be done on a physical paper map.
• Every student’s plot can be printed on paper and pinned on the map. By doing so, it is possible to have a global vision of the trends they found.
• Students can be asked to compare their findings and draw conclusions. They can be asked questions like:
• How does the temperature trend change among cities close to the sea and cities far from it?
• How does the temperature trend change among cities with different altitude?
• How does the temperature trend change among cities located at different distances from the equator/poles?

Closing remarks

This recipe is marked as intermediate in difficulty because it involves messing around data structures a lot. And that is where things can go wrong sometimes. It all depends on the tables students choose. To make sure that all students succeed in the steps provided above, the techer can ask them to only use articles from Wikipedia as these steps were tested against those.

The teacher is encouraged to use this recipe in classes where students are learning about climate and geography. If the teacher could link the workshop/assignment described here, it would be extremely beneficial for students. The goal is always to use R as an experience integrated with the learning program students are following.

This recipe does not focus on the worrying issue of Climate Change but on climate in general. The teacher should be aware of this in order not to mislead pupils during the workshop. The reason is because the data collected is not suited for targeting Climate Change, due to the fact that an analysis of yearly trends on cities across different years is required; and such analysis is not performed in the context of this recipe. Other recipes will be published addressing Climate Change as main theme.

1. Students should be warned that Wikipedia contains articles authored not necessarily by experts, therefore the data it provides might not be 100% accurate. Nonetheless, teachers should also let students know that Wikipedia usually reports the source of information for each article and data-table or figure in it; thus checking out those references is a good exercise to show students how sources and references should be checked and that those are usually reliable agencies.

2. Please note that the data reported here was taken from this Wikipedia page on the day this note was published.

3. Note that this function will change the order of the rows in the table to follow an alphabetical order. So when, in future steps, we will select one specific row to extract, to see which row index, do not use summary(tables[[6]]), but just tables[[6]].

4. The second parameter we pass to the function does the magic. String "[-\\d\\.]*" instructs the function to take every character which is a digit or a decimal point and ignore everything else.