Course SDA101
Introduction to Statistics Using R
Introduction to Statistics Using R
Duration: 5 Days
Course Synopsis
This course provides an introduction and review of Statistics Using R. It covers practical applications of R and the basics of R programming for statistical testing, exploratory data analysis and data reporting. It is a practical course for those whose job entail reporting and analysing data. It is suitable for data administrators , data analysts and technologists
R is an open source, free and relatively easy to use package. The purpose of this course is not so much to go into the mathematical theory of statistics but, rather, the practical use of statistics in day to day reporting and data analysis situations. It is part of a series of courses for organisations planning to adopt open source solutions (e.g. because the cost of licensing or updating commercial software is prohibitive) and who wish to standardise some or all of their applications on stable high quality open source software.
Attendees are assumed to have a knowledge of mathematics, to a level of GCSE or AS level, and some experience of simple programming, e.g. use of VBA or spreadsheet macros, basic PHP or basic JavaScript. In addition attendees are expected to have basic experience of working with databases e.g. Access, SQLite or MySQL. An awareness of statistics is assumed but no formal knowledge of statistics is required. The course will provide a thorough introduction to programming using the R programming language, as well as relational databases and SQL. The practical exercises will cover not only the basics of applying statistical tests to data, but also, extraction of data from relational databases such as e.g. MySQL, Access, PostgreSQL, Oracle and SQL Server, extraction of data from CSV files , graphical plotting of data (both 2D and 3D plots) and exploratory data analysis. In addition the use of R in conjunction with Microsoft Excel will also be covered. The course is heavily oriented towards practical exercises, the mix being 40% theory and 60% hands on exercises.
Possible follow on courses include
- Applied Scientific and Statistical data analysis using Python and R
- Data mining and data exploration using Python and R
- Applied GIS using PostGIS and R
- Advanced R - Bioconductor Programming
- Advanced C++ and R programming
Course Outline
Introduction to Statistics Using R
- Practical Applications of Exploratory Data Analysis and Statistics
- An overview of the background and features of the R statistical programming system
Introduction and Overview of Data Manipulation in R
- Basic data types and data operations
- Creating small datasets
- Basic R data types and variables
- R functions
- Basic R functions for manipulating data
- Concatenating data and combining variables - c, cbind and rbind functions
- Functions for working with collections of data items - vector, matrix, frame and list
- Functions for manipulating text and dates
- Functions for sorting, ranking and printing
- Introduction to importing and exporting data
- Excel
- CSV files
- Databases
- Other statistical packages
Foundations of programming in R
- Basic Programming Concepts and Elements
- Sequence, Choice, Iteration
- Basic R data types and variables
- Simple data types, complex data types - Variables and Data Frames
- Functions
- Object oriented-programming - combining a data type and its associated operations
- lists and classes
- Variables , data subsets and data frames
- Accessing variables from a Data Frame
- Passing data to functions
- How functions "return" values
- Accessing data subsets
- Creating and combining data subsets
- Categorical variables
- Some useful basic R functions
- str
- attach
- tapply, sapply, lapply
- summary
- table
- Choice using if statements
- Iteration and looping
- Filtering data using choice and iteration
Plotting and Graphing
- The plot function
- Effective use of symbols, colours and sizes
- Line smoothing
- Standard plots
- Pie charts, bar charts and strip charts
- Boxplots
- Cleveland Dotplots, Pairplots and Coplots
- Combining plots
- Modifying plots, labels, titles, and advanced customization
- The Lattice package and Trellis graphics
- Multipanel scatterplots, boxplots, histograms etc.
- 3D scatterplots and Contour graphics plots
- Saving graphics and graphic formats
Basic Statistics and Modeling in R
- Overview of statistics and statistical testing
- Basic statistical tests, power, and sample size functions
- t-tests and non-parametric procedures (eg the Mann-Whitney test)
- 1-way and 2-way Analysis of Variance
- Linear and Multiple Regression
- Analysis of Covariance
- Regression and analysis of variance
- Basic data mining functions
R, Access and Excel (optional module)
- Overview of Excel
- Introduction to the RExcel plugin
- Moving data from Excel to R
- Importing CSV tables into Excel and R
- Exporting data as CSV tables
- Importing / exporting data from Microsoft Access to Excel
- Importing / exporting data from R to Access
R and Relational Databases (optional module)
- Introduction to Relational databases and SQL for R users
- Tables, relations, and normalisation
- SQL as a Data Definition Language
- SQL as a Data Manipulation / Querying language
- Simple SQL queries
- More complex SQL - aggregation and grouping
- Accessing relational databases from R
Data mining and data exploration (optional module)
- Principles of data warehousing and data mining
- Organising and viewing data at various levels of granularity
- Star schemas
- Optimising data bases for working with historical data
- Exploratory data analysis (EDA) techniques
- Tables and cross-tabulations
- Plotting strategies for EDA
- Exploring data to discover hypotheses to test
- Some examples and case studies
Using R from the Command Line (optional module)
- Overview of R commands and command line options
- Introduction to Bash shell programming
- Incorporating R commands into Bash shell scripts for automating data analysis and reporting
