COSC 6323: Statistical Methods

Teaching

To fulfill its mission, AC-DC carries out challenging human studies, which accrue big data, requiring sophisticated analysis. The skills required for such studies is a superset of the skills required for any other data-driven project.

KEY INFORMATION

Class Meetings

Friday 4:00 – 7:00 pm @ Teams

Course Instructor

Prof. Ioannis Pavlidis (This email address is being protected from spambots. You need JavaScript enabled to view it..
Office hours: 3-4 pm on Fridays (@ Teams)

Course TA

Mohammed Emtiaz Ahmed (This email address is being protected from spambots. You need JavaScript enabled to view it..
Office hours: 12-2 pm on Thursdays (@ Teams)

Course Description - COSC 6323

The course covers statistical methods in human and technology studies or experiments, from where the bulk of scientific and engineering data originate. The course starts by situating statistics in the context of data science. Special emphasis is placed on the relationship of statistics to machine learning. Then, instruction proceeds in a stepwise manner building the student’s background in the statistical tools of the trade, without which an MS thesis or PhD dissertation cannot be complete. The course culminates with sessions on experimental design, one of the cornerstones of modern data science.

Although the introduction and methodological sections of scientific papers differ from discipline to discipline (e.g., algorithms vs. assays), the results sections of papers should conform to a universal pattern, according to currently accepted best practices. The produced data should be derived according to appropriate study/experimental designs and should be subjected to relevant statistical tests. There is no such thing as statistics for computer scientists or statistics for biologists; statistics is the same for everybody. However, certain disciplines tend to use some tools more than others, and instruction needs to be tailored according to students’ educational backgrounds. In computer science in particular, adopting statistical analysis of experimental results has been slow. This has changed the last few years and several computing disciplines have already adopted statistical methods as the analytic standard, while others are bound to follow sooner or later. Among the computer science communities that are at the forefront of this movement are the Human-Computer Interaction and Computer Vision communities. The Statistical Methods course aims to cover this need and is paced taking into account the typical background of graduate students in computer science. It is very practical in its orientation (no proofs), emphasizing the understanding of concepts and the ability to choose the right design or apply the right test.

The first part of the course starts with the delineation between continuous and discrete variables and the enormous implication that this carries for the selection of tests. Then, it proceeds with the description of distributions, probabilities and error types that are fundamental to the construction of the t-tests, ANOVA tests, and non-parametric tests. Next, the course visits regression in its various forms, completing the coverage of significance and association tests used in almost all scientific papers. Emphasis is placed on multiple regression and linear modeling – a powerful and elegant method to examine the effect of multiple factors in a research problem; it is heavily used nowadays in MS and PhD research. The treatment of symbolic data and the tools of last resort, that is, nonparametric methods complete the course’s first part.

The course’s second part covers various experimental designs. Before students start analyzing data, they need to know according to which principle to collect these data in order to address their hypothesis; for this, they need to pick the right experimental design. Even perfect analysis will not save the day if the investigator picked the wrong experimental design (i.e., garbage in – garbage out). Hence, at the end of the course’s second part students acquire 30,000 feet view of the scientific and engineering process, solidifying their ability to design, collect, and test data.

The course has four homework assignments to reinforce the understanding of the concepts and methods. In the place of a final exam, the course has a semester long-project, where a problem is defined for the class, and then each group of students is required to come up with a study design, collect/quality control data, and perform tests, putting everything in the form of a term paper. The homeworks are individual assignments while the project is a group assignment; each project group typically consists of 2-3 students.

The students need to know R and R Studio in order to process and plot the data. R is becoming one of the most useful tools for computer scientists in the data analytics business. The instructors provide the students with online educational material and organize an R tutorial class. Importantly, the last hour of each three-hour class session is devoted to R programming, where the students code the theoretical principles covered earlier in the session.

Gradebook

10% Participation

4 x 12.5 % Homework

40% Project

COURSE OUTLINE

Course Outline

Lesson 1: Data, Statistics, and Data Science 1/17/2020

Situating Statistics and Machine Learning in Data Science; observations and variables; types of measurements for variables; distributions; numerical descriptive statistics; exploratory data analysis; bivariate data; data collection

Lesson 2: Probabilities and Sampling Distributions 1/24/2020

Probability; discrete probability distributions; continuous probability distributions; sampling distributions

Homework #1 Out

Lesson 3: Principles of Inference 1/31/2020

Hypothesis testing; estimation; sample size; assumptions

Assignment of Projects

Lesson 4: Inferences on a Single Population 2/7/2020

Inferences on the population mean; inferences on a proportion; inferences on the variance of one population; assumptions

Homework #1 Due on 2/7/2020

Homework #2 Out on 2/7/2020

Lesson 5: Inferences for Two Populations 2/14/2020

Inferences on the difference between means using independent samples; inferences on variances; inferences on means for dependent samples; inferences on proportions; assumptions

Lesson 6: Inferences for Two or More Means 2/21/2020

Analysis of variance; linear model; assumptions; specific comparisons; random models; unequal sample sizes; analysis of means

Homework #2 Due on 2/21/2020

Homework #3 Out on 2/21/2020

Lesson 7: Linear Regression 3/6/2020

The regression model; estimation of parameters; inferences for regression; correlation; regression diagnostics

Lesson 8: Multiple Regression 3/27/2020

The multiple regression model; estimation of coefficients; inferential procedures; correlations; special models; multicollinearity; variable selection; detection of outliers

Lesson 9: Linear Models 4/3/2020

The dummy variable model; unbalanced data; models with dummy and interval variables; weighted least squares; correlated errors

Homework #3 Due on 4/3/2020

Homework #4 Out on 4/3/2020

Lesson 10: Categorical Data 4/10/2020

Hypothesis test for a multinomial population; goodness of fit; contingency tables; loglinear model

Lesson 11: Nonparametric Methods 4/17/2020

One sample; two independent samples; more than two samples; rank correlation; the bootstrap

Lesson 12: Experimental Designs 4/24/2020

Randomized designs; paired comparison designs; randomized complete block designs; Latin square designs; Greco-Latin square designs; balanced incomplete block designs; two-factor factorial designs; general factorial designs

Homework #4 Due on 4/27/2020

Project Reports Due on 4/29/2020

References

[1] Horton, N.J. and Kleinman, K. Using R and RStudio for Data Management, Statistical Analysis, and Graphics. CRC Press, 2015

[2] Freund, R. J., W. J. Wilson, and D. L. Mohr. Statistical Methods. 2010.

[3] Montgomery, Douglas, C. Design and Analysis of Experiments. Ninth Edition. John Wiley & Sons, 2017.