# Using R at school to find correlation in search terms

This page is part of a series under tag r-at-school. The series is about improving how we teach Science in schools. Find out more: Using R at school

In this note, I want to focus on a basic type of research students from early grades can perform. The level is easy and all students from 8th graders and up can do this.

## Scenario

The class is asked to evaluate the statistical correlation of a pair of sampled data sets $(X, Y)$. In order to make the experience more interesting, the class is asked to use data sets coming from trends1 of search terms in Google.

The main idea is to let students evaluate the Kendall Rank Correlation Coefficient $\tau$ of the two data sets by using R.

Students can carry on the task as single, or they can team up in small groups during this workshop. The teacher will assign, or let students decide, what pair of search terms to use with the following conditions:

1. Each group will evaluate a different pair of search terms.
2. A date range for the samples must be chosen by the teacher and given to each group as a requirement. All students will evaluate trends on the same time range.
3. A geographical context must be chosen. This is up to the teacher and depends on the theme if any. Depending on the theme, it might be required (or not), for all students, to focus on the same geographical area.

The teacher is encouraged to find a theme for the pairs of search terms to assign, or choose, during the workshop in order to define a consistent story. Later, it will be possible to ask students to compare their results, and relate those with each other and draw more conclusions in a joint cross-team effort.

### A brief introduction to Kendall’s coefficient

The teacher can quickly explain students, depending on their level, what information the coefficient provides. For very young classes, it is possible to simply mention that $X$ and $Y$ are related when $\tau \approx \pm 1$. If this happens, then the two search terms are related, meaning that there is a cause-effect relationship between the two.

It is, finally, important to explain students what events they are going to try to relate. If the data set $X$ corresponds to search term $s_X$ and data set $Y$ to search term $s_Y$, then the events they will be correlating are:

• Event $\epsilon_X$ relating to data set $X$: People searching for $s_X$.
• Event $\epsilon_Y$ relating to data set $Y$: People searching for $s_Y$.

If students find that $\tau \approx 0$, then there is no correlation between people searching for $s_X$ and $s_Y$. If $|\tau| \approx 1$, then it is very likely that people searching for $s_X$ did it because they searched for $s_Y$ or the other way around.

## Motivation

The Kendall coefficient is quite easy to understand (and also calculate) compared to the different correlation coefficients available, therefore the choice of using it.

This recipe is intended to make students familiar with the concepts of Big Data, statistics and how to draw scientifically valid conclusions from harvested information on the web. It is possible to relate this recipe also to the increasingly worrying topic of fake news, fact-checking and misinformation in general. The goal of this experience is to let students understand how it is possible to find cause-effect relations between different phenomena, in a mathematically valid way. Furthermore, students will get insights about how to properly use Google functionalities at their advantage, in order to get better data by employing search terms and search trends, and use such information in their future research tasks.

### Choosing the theme

An example of very good themes the teacher can choose from, are those relating to society and culture. It is a good idea to look among topics that are defining the current political debate for example (e.g. Brexit). Actuality is not a must though, it is possible to also consider important events that happened in the recent past2 like the 2016 American election.

## Execution

In this example, we choose Brexit as theme and evaluate two pairs of search terms: one which will prove to be highly correlated, and another one with no correlation. The first pair of (highly correlated) search terms we want to consider will be:

• Search term $s_X$: Brexit.
• Search term $s_Y$: European Union.

The second pair of (non correlated) search terms will be:

• Search term $s_X$: Brexit.
• Search term $s_Z$: Eurovision.

The referendum for Brexit was held on June 23rd 2016, since we want to understand what information people were trying to get ahead of the vote, we will consider trends from January 1st 2016 to June 23rd 2016 in the United Kingdom.

1. The first thing to do is getting data. So we navigate to Google Trends and type the first search term: Brexit. We will also need to specify the correct dates and United Kingdom as country. After submitting our query, the page will give us the following chart:

Showing trend of search term 'Brexit' in Google Search from Jan 1 2016 to Jun 23 2016 (UK)

What we need to do is clicking the CSV icon (top right corner of the chart, first icon) in order to download the data in CSV format. This will download a .csv file which we will later load and read using R.

1. After downloading the file, move it inside your R workspace directory and rename it into: gtrend_brexit.csv.

However I just noticed a problem with the CSV files that Google Trends emits: they are wrongly formatted! It means that R will not be able to read them properly if we don’t fix them. Lucky enough, the fix is pretty easy.

1. Open the .csv file you just downloaded with a text editor (Notepad on Windows, TextEdit on Mac, Vim or GEdit on Unix), you should see something like this:
Category: All categories

Day,brexit: (United Kingdom)
2016-01-01,0
2016-01-02,<1
2016-01-03,<1
2016-01-04,<1
2016-01-05,<1
2016-01-06,<1
...

The CSV format requires the first line to contain the names of the columns, separated by comma, and the next lines should contain the values of each column separated by comma. As you can see, the first 3 lines are wrong3.

1. You must fix the first three lines by removing the first two and removing everything after the colon : sign (that included) in the third:
Day,brexit
2016-01-01,0
2016-01-02,<1
2016-01-03,<1
2016-01-04,<1
2016-01-05,<1
2016-01-06,<1
...
1. Make sure to save the file. Next, in your R session, let’s load and parse4 the file we have just modified:
gtrend_brexit <- read.csv("gtrend_brexit.csv")
gtrend_brexit # Print the table to see the content
1. At this current stage, R has loaded the whole table, but the first column is something we don’t really need, we just want the vector of values, ordered by time. So let’s rewrite variable gtrend_brexit:
gtrend_brexit <- gtrend_brexit[[2]]

You can just type the variable name again to see that only the second column, holding the values, is present now. Variable gtrend_brexit is now a sequence of values.

1. We can see that the data we receive has <1 as value sometimes. Google Trends approximates the numbers and if one value is less than 1, it will not report the exact decimal value. We want to remove those non-numbers and approximate them to 0:
gtrend_brexit <- as.numeric(as.character(gtrend_brexit))
gtrend_brexit[sapply(gtrend_brexit, is.na)] <- 0
gtrend_brexit # Print the final processed values
##   [1]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##  [18]   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   1   0
##  [35]   0   1   0   1   1   1   0   1   0   0   0   1   1   1   1   2   3
##  [52]   4   6   4   3   2   2   2   2   2   2   2   2   2   1   1   2   2
##  [69]   3   2   2   2   1   2   2   1   1   1   1   1   2   2   2   2   1
##  [86]   2   2   1   1   2   2   1   1   1   2   2   1   2   2   1   1   2
## [103]   2   2   2   2   2   2   3   3   3   2   3   3   3   3   2   2   3
## [120]   2   2   2   2   2   3   3   2   2   2   4   3   4   4   5   4   5
## [137]   4   5   5   4   5   4   4   6   6   7   6   5   4   4   4   8  10
## [154]   8   9   7   9  12  13  11  14  14  13  13  15  21  24  20  18  18
## [171]  20  29  33  49 100

Function as.character will make sure every value is parsed in the list as a text so that as.numeric can convert that text into a number. Function as.numeric will return NA if the conversion was not possible, for this reason, in the second line, we create a mask to select all the NAs to be converted to 0 inside the vector by the outermost assignment. This trick is better explained in my note: R hack - Bitmasks on vectors to change specific elements .

### Repeat the process

We have successfully loaded in our R session the vector for the trends of search term $s_X$, we now need to repeat the same exact process from point 1 to 7, but using different files and different variables:

1. Go back to point 1 and execute the same steps again but this time searching for term European Union; name the file: gtrend_eu.csv and the variable: gtrend_europeanunion.

2. Go back to point 1 and execute the same steps again but this time searching for term Eurovision; name the file: gtrend_eurovision.csv and the variable: gtrend_eurovision.

### Correlating data

We have now loaded all three series of data for search terms $s_X$, $s_Y$ and $s_Z$. We can move on to the second part: evaluating the correlation.

1. [Optional] Now that you have all data, you could optionally have students plot the three series and see the trends (it is always nice to have diagrams):
plot(gtrend_brexit, type = "l", main = "Search terms trends before Brexit", xlab = "Time before referendum", ylab = "Trend value")
lines(gtrend_europeanunion, lty = "dashed", col = "red")
lines(gtrend_eurovision, lty = "dashed", col = "blue")

1. Let’s perform the correlation. R has a package which is part of the base installation, so we do not need to install it: stats, but we need to load it:
library("stats")
1. The function we want to use is cor and it allows us to perform different kinds of correlations:
cor(gtrend_brexit, gtrend_europeanunion, method = "kendall")
## [1] 0.6578461
cor(gtrend_brexit, gtrend_eurovision, method = "kendall")
## [1] 0.2870491

As we can see, there is a greater correlation in $\left(s_X, s_Y\right)$ than $\left(s_X, s_Z\right)$. The fact that $\tau_{X,Y} > 0.5$ means that there is a good level of correlation, while the fact that $\tau_{X,Z} \approx 0$ means that the other pair is not correlated.

If the teacher decided to perform the optional plotting step, then it is worth mentioning to the class that the graphs for the correlated pairs have a similar trend, while the other couple do not show similar patterns.

## Analysis

At the end of the experience, students can engage in a conversation about the relations of cause and effect between the different search terms. This type of analysis can lead to very interesting retrospective material when analyzing events that happened in the recent past.

Teachers are encouraged to promote cross-comparison of results of search terms across the different teams in a class when giving similar search term pairs. By doing so, it is possible to teach students how similar search terms can lead to many different results.

Finally, this workshop is designed to let students understand the many connections between data samples that can be extracted by search engines. The key learning points for students should be:

• Understanding that everytime we search in Google, our search record is stored. Our personal data is not stored, but our search activity is.
• Search data can be gathered, processed and analyzed to find answers to many questions.
• Research on search term trends can help analyzing social phenomena.
• Research on search term trends can be used as a fact checking mechanism.

## Closing remarks

This recipe is adequate for a very young audience as the type of data is very predictable and the manipulation requires very simple steps.

Teachers are encouraged to choose themes that are actual and deeply related to social and/or cultural phenomena not too far in time. This helps students relate the research with their lives, and opens up to a subsequent debate on the results they are able to gather.

Teachers are also encouraged to prepare a pool of search term pairs to use and that they have tested beforehand in the context of the chosen theme. This will help them drawing, later, the most important conclusions from the findings students will gather.

1. Google provides a way to extract information about how often a search term was googled in a specific range of time, the service is called Google Trends.

2. Not too long ago though, as Google does not have information about events that happened earlier than 2000 approximately.

3. I genuinly do not know why this issue exists. This is clearly a mistake and it seems strange for a company like Google to make it.

4. Parsing is an operation which involves reading a certain text and converting it into a data structure. In this case, the CSV file needs to be converted into a table.