**Friday 4:00 – 7:00 pm @ HBS – 315**

**Prof. Ioannis Pavlidis (ipavlidis@uh.edu). **
**Office hours:** Friday 3-4 pm (@ HBS - 306)

**George Panagopoulos (gpanagopoulos@uh.edu).**
**Office hours:** Tuesday, Thursday 1pm-2pm (@ HBS - 303)

The course covers statistical methods in human and technology studies or experiments, from where the bulk of scientific and engineering data originate. The course starts with a contrast of hypothesis-driven research supported by statistical inference versus rigorous deduction based on first principles; this is in order to delineate the current from the past mode of science and engineering, motivating the subject. Then, it proceeds in a step-wise manner building the student’s background in the statistical tools of the trade, without which an MS thesis or PhD dissertation cannot be complete. The course culminates with a series of sessions on experimental design, which is the cornerstone of a successful research project or industrial product.

Although the introduction and methodological sections of scientific papers differ from discipline to discipline (e.g., algorithms vs. assays), the results sections of papers should conform to a universal pattern, according to currently accepted best practices. The produced data should be derived according to appropriate study/experimental designs and should be subjected to relevant statistical tests. There is no such thing as statistics for computer scientists or statistics for biologists; statistics is the same for everybody. However, certain disciplines tend to use some tools more than others, and instruction needs to be tailored according to the differing educational backgrounds. In computer science in particular, awakening to standard analysis of study/experimental results has been slow; most of this analysis used to be carried out heuristically. This has changed the last few years and several computer science disciplines have already adopted statistical methods as the standard in results analysis, while others are bound to follow sooner or later. Among the computer science communities that are at the forefront of this movement are the Human-Computer Interaction and Computer Vision communities. The Statistical Methods course aims to cover this need and is paced taking into account the typical background of graduate students in computer science. It is very practical in its orientation (no proofs), emphasizing the understanding of concepts and the ability to apply the right design or test to the right problem.

The main part of the course starts with the delineation between continuous and discrete variables and the enormous implication that this carries for the selection of tests. Then, it proceeds with the description of distributions, probabilities and error types that are fundamental to the construction of the t-tests, ANOVA tests, and non-parametric tests. In the second stage, the course visits regression in its various forms, completing the coverage of significance and association tests used in almost all scientific papers. Emphasis is placed on multiple regression and linear modeling – a powerful and elegant method to examine the effect of multiple factors in a research problem; it is heavily used nowadays in MS and PhD research. The treatment of symbolic data and the tools of last resort, that is, non-parametric methods complete the course’s first part. In the course’s second part, we visit the various experimental designs, including new methods, the so-called Mixed Methods. Before start analyzing data, one needs to know according to which principle to collect these data in order to address her/his hypothesis; for this, s/he needs to pick the right design. Even an impeccable testing will not save the day if the researcher picked the wrong study or experimental design (garbage in – garbage out). Hence, the student acquires towards the end of the course a 30,000 feet view of the scientific and engineering process, solidifying her/his ability to design, collect, and test.

The course has five homework assignments to reinforce the understanding of the concepts and methods. In the place of a final exam, the course has a semester long-project, where a problem is defined for the class, and then each group of students is required to come up with a study design, collect/quality control data, and perform tests, putting everything in the form of a term paper.

The students need to know R in order to process and plot the data. R is becoming one of the most useful tools for computer scientists in the data analytics business. We provide the students with online educational material and organize an R tutorial class.

The homeworks are individual assignments while the project is a group assignment; each project group typically consists of 2-3 students.

10% Participation

5 x 10 % Homework

40% Project

Introduction; observations and variables; types of measurements for variables; distributions; numerical descriptive statistics; exploratory data analysis; bivariate data; data collection

Probability; discrete probability distributions; continuous probability distributions; sampling distributions

**Homework #1 Out**

Hypothesis testing; estimation; sample size; assumptions

**Assignment of Projects**

Inferences on the population mean; inferences on a proportion; inferences on the variance of one population; assumptions

**Homework #1 Due**

**Homework #2 Out**

Inferences on the difference between means using independent samples; inferences on variances; inferences on means for dependent samples; inferences on proportions; assumptions

Analysis of variance; linear model; assumptions; specific comparisons; random models; unequal sample sizes; analysis of means

**Homework #2 Due**

**Homework #3 Out**

The regression model; estimation of parameters; inferences for regression; correlation; regression diagnostics

The multiple regression model; estimation of coefficients; inferential procedures; correlations; special models; multicollinearity; variable selection; detection of outliers

The dummy variable model; unbalanced data; models with dummy and interval variables; weighted least squares; correlated errors

**Homework #3 Due**

**Homework #4 Out**

Hypothesis test for a multinomial population; goodness of fit; contingency tables; loglinear model

One sample; two independent samples; more than two samples; rank correlation; the bootstrap

**Homework #4 Due**

**Homework #5 Out**

Randomized designs; paired comparison designs; fixed effects model; random effects model

Randomized complete block design; Latin square design; Greco-Latin square design; balanced incomplete block designs

Two-factor factorial design; general factorial design; fitting response curves and surfaces; blocking in factorial design

**Homework #5 Due 4/30/2018**

**Project Reports Due **

[1] Boddy, R. and Smith, G. Statistical Methods in Practice for Scientists and Technologists. Wiley, 2009.

[2] Freund, R. J., W. J. Wilson, and D. L. Mohr. Statistical Methods. 2010.

[3] Hinkelmann, Klaus, ed. Design and Analysis of Experiments, Special Designs and Applications. Vol. 3. John Wiley & Sons, 2011.

[4] Friedman, Lawrence M., Curt Furberg, and David L. DeMets. Fundamentals of Clinical Trials. Vol. 4. New York: Springer, 2010.