Sunday, May 24, 2020

Visualizing Traffic on my Blog using R

I've been a data analyst in the past and one thing I can say for sure is that we don't have to be great analysts or statisticians to be able to read basics graphs and understand trends. Visuals are all around us whether it's stock market trends or data around the dreaded Covid-19 pandemic. By now, I'm sure all of us have heard about "flattening the curve". It literally took a pandemic for us to know what it means but the point I'm trying to make is that we are surrounded with data and people should ideally know how to understand it. I recently learnt the basics of R, which is a programming language mostly used in data analytics, statistical analysis and visualization. R is a good language to learn for data analysts and statisticians which resonates really well with professions who know SQL. 

In this post, I'll visualize traffic coming to my blog since 2017 (data captured in Google Analytics) and show some commonly used graphs and visualizations using R Studio. The most obvious trend you'll see is traffic started gradually increasing on my blog since I started writing again in January-18 and has really spiked in the last few months. So, let's dive in!

Basics of R

As I explained earlier, R is a programming language used to analyze existing trends and also do predictive analytics using statistics. For the purposes of this article, I'm using R Studio to run basic R commands to create simple visuals such as bar graphs, line graphs and slightly more complex visuals such as bubble charts, word cloud and a map using some commonly used packages. 

Data Frame

The first step before doing any analysis in R is data wrangling which is manipulation and transformation of data in a format which you can use for analysis. In R, we do the same thing by creating a data frame which is essentially a table that is populated typically by importing a .CSV file but other formats such as SPSS or Stata are also supported for advanced use cases. 

A data frame contains rows and columns and can be compared to an Excel spreadsheet. The other thing to keep in mind is that it's okay to do some data transformation in the source file itself before bringing data into R but a lot of the manipulation is done directly in R itself. In my case, I modified the source .CSV files exported from Google Analytics for basic data formatting such as switching the metrics to Number format as an example. The command to bring data into R via a .CSV file is: 

where df is the data frame, read.csv is the function which reads a .CSV file and stringsAsFactors = FALSE ensures that the data is not converted into a factor to keep the source data format intact. The original file contained Page name and some common metrics such as Page views, Visits etc.


R packages are reusable code libraries that provide additional functionality to R and help simplify tasks. You can install packages using the install.packages() function and invoke them using the library() function. In my case, I'm using the following packages:

  • library(ggplot2)
  • library(lubridate)
  • library(ggwordcloud)
  • library(maps)
  • library(dplyr) 

Finally, before we take a look at the visuals, one other thing to note is that I did some basic data manipulation in R to convert the Month and Year using the lubridate library by using the ymd() and mdy()  functions. There are a lot other things that I can cover under basics but that's outside the scope of this post.

Visualizing Blog Traffic Trends

In this section, I've inserted the graphs created in R Studio which were saved as images. I've used the "ggplot2" package which is a very popular R library to visualize data. I'll admit that my blog does not get a ton of traffic but that is not my intent as I'm not into any competition to artificially inflate my traffic. My intent is to share what I know with others and document things for myself for future reference. Let's take a look at some of the trends.

Show Visits and Page Views for the Last 3 Years (Line)

In this line graph, I've visualized Visits (called Sessions in Google Analytics) and Page views for the last 3 years. 

  • If you notice closely and look at the label I manually added, traffic started gradually increasing once I started writing again in early 2018. 
  • The biggest spike happened Thanks to my last post about Real-Time CDP which I wrote last month. This shows how much my readers want to consume content about the latest and greatest technology from Adobe and especially if it's around Adobe Experience Platform.

I mentioned earlier that it's very common for analysts to modify the source data before bringing in the data in R which is what I did to generalize the page names by removing the month and year using the mid() function in Excel.

As promised, here's the code sample to generate this visual. Please note that I created a separate data frame called 'dfl' which contained Month, Visits and Page views. Also, note that the file format is .rmd which is a format used to visualize R commands. 

Top 10 Pages Visited from Jan-2017-Apr-2020 (Flipped Bar Graph)

In this bar graph, I'm showing the top 10 page Visits by Page in the last 3 years. 

  • The most popular page is the one I wrote to show the calculation of funnel drop off rate way back when I started blogging. This shows that there's still a sizable audience looking for calculation of basic metrics such as drop off rate as the traffic source of this page is primarily search engine.
  • The other popular pages is the homepage which may mean that people get sent directly to my homepage via search. Again, this is inclusive of the last 3 years worth of data so more analysis is needed to understand this fully which is outside the scope of this post.
Below is the code I wrote.

Top 10 Pages by Visits and Bounce Rate from Jan-2017-Apr-2020 (Bubble Chart)

In this chart, I show the top 10 pages (last 3 years) visualizing Visits and Bounce Rate. The color of the bubble is the page name and the size of the bubble is tied to Visits.

  • The most popular page name (drop off rate) has the highest Bounce Rate and Visits which shows that readers searching for drop off rate who come to my blog are MOSTLY interested in this type of content and nothing else.
  • The homepage ("/") has the lowest Bounce Rate of 66% which is obvious because users typically either search or click into a post which they came to read as opposed to just stay on the homepage.

Below is the code snippet.

Top Traffic Sources from Jan-2017-April-2020 (Stacked Area Chart)

In this chart, I visualize the top traffic source sending traffic to my blog for the last 3 years.

  • Organic search has traditionally been the top traffic source for my blog but the interesting thing is that a lot of visitors come directly to my blog by typing the URL which is very surprising to me unless they bookmark it.
  • Traffic via Social channels started appearing since I started sharing my blog posts on LinkedIn and Twitter from early 2018 which explains the trend.

Below is the code snippet.

Age and Gender Data Captured since late 2018 (Pie Chart and Bar Graph)

Now, this might be a bit surprising for some Adobe Analytics users to understand how Google Analytics captures demographic data. This is done by enabling the Demographics and Interests reports  in Google Analytics which uses data collected from IDFA and Google advertising cookies to help in retargeting. Again, none of this data shown here is even borderline PII so Google has taken into consideration all privacy regulations. It might be a good addition for Adobe Analytics if it can receive similar data from the Adobe Ad Cloud platform.

There's not much to say here as these graphs are self-explanatory but I manually added the percentage (total is ~5500 Visitors) to show the breakdown of Gender in the Pie chart. The code snippet is show below.

Word Cloud showing Internal Search Terms since early 2019

Word cloud is a commonly used visual to show search terms or popular tags which people are looking at. I've imported the "ggwordcloud" package to do this. 

  • Given that I've written extensively about Adobe Audience Manager, it's not surprising that a lot of the search terms are tied to AAM.
  • This also tells me what else I can write about based on what people are searching for.

Below is the code snippet.

Map Showing Visitors by Country from Jan-2017-April-2020

Finally, this visual created using the "maps" and "viridisLite" (already available in R) packages shows which country is the most popular in terms of the total number of Visitors. As shown below, United States is the most popular followed up by India which are the two countries I'm associated with so I'm not surprised :)

Below is the code snippet for this.

Adobe Experience Platform Data Science Workspace

One of the best things about Adobe Experience Platform is that it provides you the ability to run SQL queries and run ML algorithms or models directly on the data (in XDM format) in the tool which has never been the case in the past. This is super powerful and completely eliminates the additional time it would take to export the data and make it available in a platform outside of Adobe. 

Data Science Workspace integrates Jupyter notebooks which is very popular open source application that allows you to run ML models, perform data visualization written in programming languages such as Python. The reason why I'm mentioning it in this post is because it can also run code written in R so theoretically everything that I showed you can work in Data Science Workspace but my understanding is that it requires the underlying data to be in XDM format. I haven't dabbled with it due to data access constraints but here's how you can access DSW and run R commands in case you have access.

So that was it! Hope with this post, I piqued your curiosity about R and its data visualization capabilities.

No comments: