Best practices for using R in corrections analysis
The primary tools used for data analysis should be code-based.
Code-based tools allow for:
Reproducibility
Quality assurance and transparency
Automation
Each step of a data analytic workflow can be replicated and repeated.
Code-based tools have advantages over point-and-click tools such as Microsoft Excel and Tableau regarding reproducibility. To perform any computation or data manipulation in a programming language, a user must write code—a list of instructions for the computer to follow. Writing code allows one analyst to get the exact same output as analyst user who runs the same code.
When analysis tasks are completed in Excel, the “instructions” are not saved because the analysis is conducted by clicking buttons and copying and pasting cells, formulas, and data. This makes reproducibility more difficult, as the analyst is responsible for documenting exactly what tasks were completed in what order.
This reproducibility allows for both quality assurance and automation.
Transparency into the logic and methods used to manipulate and analyze data enable robust quality assurance processes.
By using code for data analysis, the logic and methods used to clean, manipulate, and analyze data are transparent and visible. This visibility allows for colleagues to ensure the accuracy of the analysis through code review and other quality assurance processes. It is near impossible to review all steps of an analysis completed in a non code-based tool.
Code-based analysis allows for an agency, if it wishes, to make public the source code used for it’s analysis, models, or dashboards. This level of transparency gives researchers, the media, and the public confidence that the data reported by agency is accurate.
Reports and analysis can be automated to save time and reduce errors.
Code-based tools allow analysts to “script” or template repetitive tasks. For example, a analysts can create a new report when a new month of data is ready by simply re-executing the same code with the new data. This monthly process in Excel may involve manually changing cells and copying and pasting data, adding time and the possibility of introducing errors.