Lesson 1: Why Use R?
R is a programming language used for data analysis, data visualization, and statistics. R is widely used in academia, government, and industry by statisticians, researchers, data scientists, and other quantitative practitioners.
Free and open source
R is free and open-source software, meaning that the source code is publicly available and there is no cost to use it. Open-source tools are popular in software development because they allow transparency as well as collaboration. R and Python are two open-source programming languages that are widely used for data analysis.
As open-source software, R allows users to create their own add-ons with additional functionality (R packages) that other users can install. A community of user-developers create and maintain these packages, making regular updates and adding new features. Many of these packages are maintained by the R user community, but there are also packages developed and maintained by organizations and companies.
Posit is one such company that manages multiple packages and other R tools. Posit is also the developer of RStudio, the environment through which you will access the R software. Posit also maintains the tidyverse and other population R packages (R packages will be discussed further in Lesson 4). RStudio and tidyverse packages are free and accessible to the public; Posit provides them at no cost.
In contrast to R and Python, SAS, Stata, SPSS, Tableau, and Power BI are examples of closed-source software used for data analysis and visualization. The source code for closed-source software is not publicly available, and the software usually costs money to use—typically in the form of licenses that must be renewed. SAS, Stata, and SPSS are often used for statistical programming, while Tableau and Power BI are frequently used for data visualization and business intelligence. One reason Tableau and Power BI are widely used is that they allow users to create interactive dashboards with point-and-click workflows that don’t require code to build. One downside to these tools is that they do not provide the same level of flexibility in visualization as R, Python, or custom web development.
Quality assurance
The reports and documents that corrections analysts produce are relied upon by leadership, oversight bodies, the media, and the public, so it’s essential that the numbers and data that are published are correct. The ability to reproduce, re-create, and review the steps of a data analysis project is key for maintaining quality assurance.
As a programming language, R has an advantage over point-and-click tools such as Microsoft Excel and Tableau in this realm. To perform any computation or data manipulation in R, a user must write code—a list of instructions for the computer to follow. Writing code allows one user to get the exact same output as another user who runs the same code.
When analysis tasks are completed in Excel, the “instructions” are not saved because the analysis is conducted by clicking buttons and copying and pasting cells, formulas, and data. This makes reproducibility more difficult, as the user is responsible for documenting exactly what tasks were completed in what order.
Data management
R also provides additional security and control when managing data. Raw data is imported into R and any adjustments to the data are memorialized in the code. Excel workbooks can easily become messy or inaccurate if a user accidentally edits a value within a cell. Even with protected cells, Excel does not save information on the source of the raw data or changes that were made to the data. Using a programming language like R means that there is a clear record of the data source and the steps that are taken to clean, manipulate, and analyze the data.
Data size
Microsoft Excel is limited to worksheets of about 1 million rows. On the other hand, a programming language like R can handle datasets of nearly unlimited size (especially when paired with an external database). When working with administrative datasets, it is not uncommon to have data of this size, so a programming language or database is essential for performing the analysis.
Flexible output
An exciting feature of R is that it allows users to create outputs in many formats, such as HTML, PDF, and Microsoft Word documents. This flexibility allows a single tool to create interactive dashboards, web applications, PDF reports, presentations, and more. Moreover, one of the strengths of R is the ability to create professional-quality data visualizations such as bar charts and line charts, but also maps and interactive visualizations.
Automation
R is a programming language and as such, it allows a user to “script” or template repetitive tasks. For example, a user can create a new report when a new month of data is ready by simply re-executing the same code with the new data. This monthly process in Excel may involve manually changing cells and copying and pasting data, adding time and the possibility of introducing errors.
Ad-hoc requests
Given R’s steep learning curve, it can be hard to commit to learning how to do tasks within R that you already know how to complete in Excel, especially if it’s a “one-off” or ad-hoc request. Let’s say you’re given a task to filter some data and create a chart. What if a few months later, you get the same request or someone wants to see the same chart but with a different metric? If you’re using Excel, you will likely have to start the whole process over again. But, if you fulfilled the initial request in R, you can re-run the code with different inputs to respond to the second request. R requires more to set up and learn, but in the long run it will make your work more efficient.