Statistics and Analytics for the Social and Computing Sciences
Draft Dated: 2021-08-10
This is a working collection of notes on statistics and data analytics that I am compiling, with two goals:
To serve as a supplement to a course that I teach at the National University of Singapore (BT1101: Introduction to Business Analytics), a statistics course in R targeted at first-year undergraduate students and aspiring data scientists. These sections will cover introductory material, and will be marked with BT1101 .
To discuss material that would be useful for graduate-level researchers in the social and computing sciences. This material will build upon the introductory level, and will be marked with Advanced
I hope to cover quite broadly several important and useful statistical tools (such as multivariate regression and simulations), as well as discuss issues like data visualization best practices. I also plan to write several chapters on applying statistics in the computing sciences (for example, proper statistics when analyzing machine learning models). And finally, if I have time, I would like to transition to teaching statistics in a more Bayesian tradition.
My philosophy in teaching statistics and analytics is to focus on helping students to achieve a conceptual understanding, and develop their own intuition for data. Yes, students will need some mathematical background to appreciate statistics, and yes, students will need to learn some programming (in R) to actually implement modern statistical calculations, but these are means to an end. The end is an appreciation of data, and especially, how data exists in the real world. Real data rarely conforms to the assumptions we make in our analyses. The job of the analyst is to understand the data, which involves “troubleshooting” confusing statistical output, modifying statistical models and their underlying assumptions, and perhaps even inventing new ones.
As some background, I am a computational cognitive psychologist, with a little training in econometrics, so I tend to favor regression and simulation approaches, and my examples may default to examples common in the social sciences.
Disclaimer: For students taking BT1101, please refer to these notes only if you are taking this course under me. If you are taking the course under a different instructor, that instructor’s lecture notes take precedence as to whether something is in syllabus or not (and hence, testable on assessments/exams). We are always making improvements to the syllabus, and so for different offerings of the course, instructors may cover slightly different material. So if you are taking it under a different instructor, do not assume that concepts covered here will show up on the exam, or assume that concepts not covered here will not show up on the exam. I’ve indicated sections that were covered the last time I taught BT1101 with a BT1101 label).
This is a work-in-progress that is inspired by Russ Poldrack’s Psych10 book here: http://statsthinking21.org/, which is another undergraduate Introduction to Statistics course. This set of notes is hosted on GitHub and built using Bookdown.
Feedback can be sent to dco (at) comp (dot) nus (dot) edu (dot) sg.
This material is shared under a Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0) License. What this means is that you are free to copy, redistribute, and even adapt the material in this book, in any format or for any purpose, even commercial. Basically, this is a freely-available educational resource that you can share and use. The only conditions are (i) you must give appropriate credit, and if you made any changes, you must indicate so and not in any way that suggests that I endorse your changes, and (ii) if you transform or build upon the material here, you must also distribute this contributions under the same license.