KEY INFORMATION
Class Meetings
Friday 4:00 – 7:00 pm @ Teams
Course Instructor
Prof. Ioannis Pavlidis (This email address is being protected from spambots. You need JavaScript enabled to view it..
Office hours: 3-4 pm on Fridays (@ Teams)
Course TA
Mohammed Emtiaz Ahmed (This email address is being protected from spambots. You need JavaScript enabled to view it..
Office hours: 12-2 pm on Thursdays (@ Teams)
Course Description - COSC 6323
The course covers statistical methods in human and technology studies or experiments, from where the bulk of scientific and engineering data originate. The course starts by situating statistics in the context of data science. Special emphasis is placed on the relationship of statistics to machine learning. Then, instruction proceeds in a stepwise manner building the student’s background in the statistical tools of the trade, without which an MS thesis or PhD dissertation cannot be complete. The course culminates with sessions on experimental design, one of the cornerstones of modern data science.
Although the introduction and methodological sections of scientific papers differ from discipline to discipline (e.g., algorithms vs. assays), the results sections of papers should conform to a universal pattern, according to currently accepted best practices. The produced data should be derived according to appropriate study/experimental designs and should be subjected to relevant statistical tests. There is no such thing as statistics for computer scientists or statistics for biologists; statistics is the same for everybody. However, certain disciplines tend to use some tools more than others, and instruction needs to be tailored according to students’ educational backgrounds. In computer science in particular, adopting statistical analysis of experimental results has been slow. This has changed the last few years and several computing disciplines have already adopted statistical methods as the analytic standard, while others are bound to follow sooner or later. Among the computer science communities that are at the forefront of this movement are the Human-Computer Interaction and Computer Vision communities. The Statistical Methods course aims to cover this need and is paced taking into account the typical background of graduate students in computer science. It is very practical in its orientation (no proofs), emphasizing the understanding of concepts and the ability to choose the right design or apply the right test.
The first part of the course starts with the delineation between continuous and discrete variables and the enormous implication that this carries for the selection of tests. Then, it proceeds with the description of distributions, probabilities and error types that are fundamental to the construction of the t-tests, ANOVA tests, and non-parametric tests. Next, the course visits regression in its various forms, completing the coverage of significance and association tests used in almost all scientific papers. Emphasis is placed on multiple regression and linear modeling – a powerful and elegant method to examine the effect of multiple factors in a research problem; it is heavily used nowadays in MS and PhD research. The treatment of symbolic data and the tools of last resort, that is, nonparametric methods complete the course’s first part.
The course’s second part covers various experimental designs. Before students start analyzing data, they need to know according to which principle to collect these data in order to address their hypothesis; for this, they need to pick the right experimental design. Even perfect analysis will not save the day if the investigator picked the wrong experimental design (i.e., garbage in – garbage out). Hence, at the end of the course’s second part students acquire 30,000 feet view of the scientific and engineering process, solidifying their ability to design, collect, and test data.
The course has four homework assignments to reinforce the understanding of the concepts and methods. In the place of a final exam, the course has a semester long-project, where a problem is defined for the class, and then each group of students is required to come up with a study design, collect/quality control data, and perform tests, putting everything in the form of a term paper. The homeworks are individual assignments while the project is a group assignment; each project group typically consists of 2-3 students.
The students need to know R and R Studio in order to process and plot the data. R is becoming one of the most useful tools for computer scientists in the data analytics business. The instructors provide the students with online educational material and organize an R tutorial class. Importantly, the last hour of each three-hour class session is devoted to R programming, where the students code the theoretical principles covered earlier in the session.
Gradebook
10% Participation
4 x 12.5 % Homework
40% Project
COURSE OUTLINE
Course Outline
Lesson 1: Data, Statistics, and Data Science 1/17/2020
Situating Statistics and Machine Learning in Data Science; observations and variables; types of measurements for variables; distributions; numerical descriptive statistics; exploratory data analysis; bivariate data; data collection
Lesson 2: Probabilities and Sampling Distributions 1/24/2020
Probability; discrete probability distributions; continuous probability distributions; sampling distributions
Homework #1 Out
Lesson 3: Principles of Inference 1/31/2020
Hypothesis testing; estimation; sample size; assumptions
Assignment of Projects
Lesson 4: Inferences on a Single Population 2/7/2020
Inferences on the population mean; inferences on a proportion; inferences on the variance of one population; assumptions
Homework #1 Due on 2/7/2020
Homework #2 Out on 2/7/2020
Lesson 5: Inferences for Two Populations 2/14/2020
Inferences on the difference between means using independent samples; inferences on variances; inferences on means for dependent samples; inferences on proportions; assumptions
Lesson 6: Inferences for Two or More Means 2/21/2020
Analysis of variance; linear model; assumptions; specific comparisons; random models; unequal sample sizes; analysis of means
Homework #2 Due on 2/21/2020
Homework #3 Out on 2/21/2020
Lesson 7: Linear Regression 3/6/2020
The regression model; estimation of parameters; inferences for regression; correlation; regression diagnostics
Lesson 8: Multiple Regression 3/27/2020
The multiple regression model; estimation of coefficients; inferential procedures; correlations; special models; multicollinearity; variable selection; detection of outliers
Lesson 9: Linear Models 4/3/2020
The dummy variable model; unbalanced data; models with dummy and interval variables; weighted least squares; correlated errors
Homework #3 Due on 4/3/2020
Homework #4 Out on 4/3/2020
Lesson 10: Categorical Data 4/10/2020
Hypothesis test for a multinomial population; goodness of fit; contingency tables; loglinear model
Lesson 11: Nonparametric Methods 4/17/2020
One sample; two independent samples; more than two samples; rank correlation; the bootstrap
Lesson 12: Experimental Designs 4/24/2020
Randomized designs; paired comparison designs; randomized complete block designs; Latin square designs; Greco-Latin square designs; balanced incomplete block designs; two-factor factorial designs; general factorial designs
Homework #4 Due on 4/27/2020
Project Reports Due on 4/29/2020
References
[1] Horton, N.J. and Kleinman, K. Using R and RStudio for Data Management, Statistical Analysis, and Graphics. CRC Press, 2015
[2] Freund, R. J., W. J. Wilson, and D. L. Mohr. Statistical Methods. 2010.
[3] Montgomery, Douglas, C. Design and Analysis of Experiments. Ninth Edition. John Wiley & Sons, 2017.