Prof. Ioannis Pavlidis (ipavlidis@uh.edu).
**Office hours:** Friday 12pm-1pm (HBSC 306)

Dinesh Majeti (dmajeti@uh.edu).
**Office hours:** Tuesday, Thursday 11am-12pm (HBSC 302)

The course covers statistical methods in human and technology studies or experiments. The course starts with a contrast of hypothesis-driven research supported by statistical inference versus rigorous deduction based on first principles; this is in order to delineate the current from the past mode of science and motivate the subject. Then, it proceeds in a step-wise manner building the student’s background in the statistical tools of the trade. The course culminates the last two weeks with a connection of sound research methods to sound researchers’ attitudes, by delving into the teachings of stoic philosophy.

Although the introduction and methodological sections of papers differ from discipline to discipline (e.g., algorithms vs. assays), the results section of papers should look similar, according to currently accepted best practices. The produced data should be derived according to appropriate study/experimental designs and should be subjected to relevant statistical tests. There is no such thing as statistics for computer scientists or statistics for biologists; statistics is the same for everybody. However, certain disciplines tend to use some tools more than others, and instruction needs to be tailored according to the differing educational backgrounds. In computer science in particular, awakening to standard analysis of study/experimental results has been slow; most of this analysis used to be carried out heuristically. This has changed the last few years and several computer science disciplines have already adopted statistical methods as the standard in results analysis, while others are bound to follow sooner or later. Among the computer science communities that are at the forefront of this movement are the Human-Computer Interaction and Computer Vision communities. The Statistical Methods course aims to cover this need and is paced taking into account the typical background of graduate students in computer science. It is very practical in its orientation (no proofs), emphasizing the understanding of concepts and the ability to apply the right design or test to the right problem.

The main part of the course starts with the delineation between continuous and discrete variables and the enormous implication that this carries for the selection of tests. Then, it proceeds with the description of distributions, probabilities and error types that are fundamental to the construction of the t-tests, ANOVA tests, and non-parametric tests. In the second stage, the course visits regression in its various forms, completing the coverage of significance and association tests used in almost all scientific papers. The treatment of data collection comes next, although one would expect it earlier. The reason for this delay is emphasis on quality control methods in field data collection, which requires knowledge of statistical testing and association. In the last stage of the course’s main part, we visit the various experimental designs, including new methods, the so-called Mixed Methods. Before start analyzing data, one needs to know according to which principle to collect these data in order to address her/his hypothesis; for this, s/he needs to pick the right design. Even an impeccable testing will not save the day if the researcher picked the wrong study or experimental design (garbage in – garbage out). Hence, the student acquires towards the end of the course a 30,000 feet view of the scientific process, solidifying her/his ability to design, collect, and test.

The emergence of the statistical design of studies/experiments and the statistical analysis of results coincides with the spread of interdisciplinary research projects. These projects involve many people from many disciplines and last for a long time. An example of such a booming discipline in Computer Science is Wireless Health, where computer scientists, medical doctors, and social scientists are involved. Because heuristic analysis of results is no longer an option (where iffy outcomes may be claimed as improvements), it is entirely possible that the team finds after several years and millions of dollars that they wasted their time and their resources. This motivates the closing sessions on stoic philosophy for this course.

The course has three homework assignments to reinforce the understanding of the three main segments of the course and a short one-page essay to cover the culminating lecture on research attitudes. In the place of exam, the course has a semester long-project, where a problem is defined for the class, and then each student is required to come up with a study design, collect/quality control data, and perform tests, putting everything in the form of a term paper. In 2014 the theme of the course was career quantification of computer-science professors and their reflection on departments, based on openly available publication and funding data. The question was whether objective performance data was in step with the departmental ranking reported by the U.S. News Report or not. Project themes may change from year to year to keep things interesting.

The students need to know R in order to process the data and gnu plot in order to professionally plot them. If the students do not know these tools upon course enrollment, it would be easy to learn them within a week of self-study. We will provide the students with online educational material and organize an extra tutorial class, if necessary. Ideally, the students also need to know Javascript to scrape data from open sources when an API is not available. This is not necessary though, and if students are unable to carry out this step, we will provide them with the raw data; then, they will just need to perform data cleaning and analysis.

10% Participation

3 x 15 % Homework

5% Essay

40% Project

Observations in science; induction, deduction, and hypothesis; absolute and comparative experiments; drawing inference

Concept vs. variables; types of variables (categorical vs. interval); location, dispersion, and other measures; histograms and boxplots

Discrete (Uniform, Binomial, Poisson) vs. continuous (Normal) distributions; sampling distributions (χ2, t, and F); α and β; Type I and Type II errors; p values; power

Hypothesis test on μ, p, and σ2; assumptions and violations (tests of normality); pooled t test; paired t test

Analysis of variance (ANOVA); assumptions and violations; analyzing non-normal data; corrections for multiple comparisons

**Homework #1 Due, Homework #2 Out**

Descriptive statistics; Cohen’s kappa; Kolmogorov-Smirnov; Mann–Whitney; and, Wilcoxon tests

Types of regression analysis (maximum likelihood, least squares, logistic, Poisson); non-linear least squares regression

** Homework #2 Due, Homework #3 Out **

Data collection in qualitative and quantitative research; quality control; sample size; random, non-random, and mixed sampling

Error control designs; treatment designs; combination designs; sampling designs

Completely randomized designs; randomized block designs; Latin square designs; factorial designs

** Homework #3 Due **

Response surface designs; split-top designs; designs with repeated measures

New methods for interdisciplinary research (qualitative + quantitative); what and when; mixed designs (convergent parallel design; explanatory sequential design; embedded design)

Emotion vs. rationality in dealing with success and adversity in longitudinal projects

Historical paradigms to follow or avoid

**Essays and Project Reports Due **

[1] Boddy, R. and Smith, G. Statistical Methods in Practice for Scientists and Technologists. Wiley, 2009.

[2] Freund, R. J., W. J. Wilson, and D. L. Mohr. Statistical Methods. 2010.

[3] Hinkelmann, Klaus, ed. Design and Analysis of Experiments, Special Designs and Applications. Vol. 3. John Wiley & Sons, 2011.

[4] Friedman, Lawrence M., Curt Furberg, and David L. DeMets. Fundamentals of Clinical Trials. Vol. 4. New York: Springer, 2010.