Class Meetings:

Friday 4:00 – 7:00 pm @ HBS – 315

Course Instructor

Prof. Ioannis Pavlidis (ipavlidis@uh.edu).
Office hours: Friday 12pm-1pm (@ HBS - 306)

Course Description - COSC 6397-2 (24652)

The course covers statistical methods in human and technology studies or experiments, from where the bulk of scientific and engineering data originate. The course starts with a contrast of hypothesis-driven research supported by statistical inference versus rigorous deduction based on first principles; this is in order to delineate the current from the past mode of science and engineering, motivating the subject. Then, it proceeds in a step-wise manner building the student’s background in the statistical tools of the trade, without which no decent MS thesis or PhD dissertation can be complete. The course culminates the last week with a connection of sound research methods to sound researchers’ attitudes, by delving into the teachings of stoic philosophy.

Although the introduction and methodological sections of papers differ from discipline to discipline (e.g., algorithms vs. assays), the results section of papers should look similar, according to currently accepted best practices. The produced data should be derived according to appropriate study/experimental designs and should be subjected to relevant statistical tests. There is no such thing as statistics for computer scientists or statistics for biologists; statistics is the same for everybody. However, certain disciplines tend to use some tools more than others, and instruction needs to be tailored according to the differing educational backgrounds. In computer science in particular, awakening to standard analysis of study/experimental results has been slow; most of this analysis used to be carried out heuristically. This has changed the last few years and several computer science disciplines have already adopted statistical methods as the standard in results analysis, while others are bound to follow sooner or later. Among the computer science communities that are at the forefront of this movement are the Human-Computer Interaction and Computer Vision communities. The Statistical Methods course aims to cover this need and is paced taking into account the typical background of graduate students in computer science. It is very practical in its orientation (no proofs), emphasizing the understanding of concepts and the ability to apply the right design or test to the right problem.

The main part of the course starts with the delineation between continuous and discrete variables and the enormous implication that this carries for the selection of tests. Then, it proceeds with the description of distributions, probabilities and error types that are fundamental to the construction of the t-tests, ANOVA tests, and non-parametric tests. In the second stage, the course visits regression in its various forms, completing the coverage of significance and association tests used in almost all scientific papers. Emphasis is placed on multiple regression and linear modeling – a powerful and elegant method to examine the effect of multiple factors in a research problem; it is heavily used nowadays in MS and PhD research. The treatment of data collection comes next, although one would expect it earlier. The reason for this delay is emphasis on quality control methods in field data collection, which requires knowledge of statistical testing and association. In the last stage of the course’s main part, we visit the various experimental designs, including new methods, the so-called Mixed Methods. Before start analyzing data, one needs to know according to which principle to collect these data in order to address her/his hypothesis; for this, s/he needs to pick the right design. Even an impeccable testing will not save the day if the researcher picked the wrong study or experimental design (garbage in – garbage out). Hence, the student acquires towards the end of the course a 30,000 feet view of the scientific and engineering process, solidifying her/his ability to design, collect, and test.

The emergence of the statistical design of studies/experiments and the statistical analysis of results coincides with the spread of interdisciplinary research projects. These projects involve many people from many disciplines and last for a long time. An example of such a booming discipline in Computer Science is Wireless Health, where computer scientists, medical doctors, and social scientists are involved. Because heuristic analysis of results is no longer an option (where iffy outcomes may be claimed as improvements), it is entirely possible that the team finds after several years and millions of dollars that they wasted their time and their resources. This motivates the closing session on stoic philosophy for this course.

The course has three homework assignments to reinforce the understanding of the three main segments of the course and a short one-page essay to cover the culminating lecture on research attitudes. In the place of exam, the course has a semester long-project, where a problem is defined for the class, and then each student is required to come up with a study design, collect/quality control data, and perform tests, putting everything in the form of a term paper. In 2015 the theme of the course was career quantification of computer-science professors and their reflection on departments, based on openly available publication and funding data. The question was whether objective performance data was in step with the departmental ranking reported by the U.S. News Report or not. Project themes may change from year to year to keep things interesting.

The students need to know R in order to process and plot the data. R is becoming one of the most useful tools for computer scientists in the data analytics business. We will provide the students with online educational material and organize an R tutorial class. Ideally, the students also need to know Javascript to scrape data from open sources when an API is not available. This is not necessary though, and if students are unable to carry out this step, we will provide them with the raw data; then, they will just need to perform data cleaning and analysis.

Gradebook

10% Participation

3 x 15 % Homework

5% Essay

40% Project

Course Outline

Lesson 1: Formulating Research Problems & Constructing Hypotheses

Observations in science; induction, deduction, and hypothesis; absolute and comparative experiments; drawing inference

Week 1 files

Lesson 2: Identifying Variables & Performing Exploratory Data Analysis

Concept vs. variables; types of variables (categorical vs. interval); location, dispersion, and other measures; histograms and boxplots

Lesson 3: Sampling Distributions & Principles of Inference

Assignment of Projects, Homework #1 Out

Discrete (Uniform, Binomial, Poisson) vs. continuous (Normal) distributions; sampling distributions (χ2, t, and F); α and β; Type I and Type II errors; p values; power

Lesson 4: Inferences on Single Population & Two Populations

Hypothesis test on μ, p, and σ2; assumptions and violations (tests of normality); pooled t test; paired t test

Lesson 5: Inferences on Two or More Populations

Analysis of variance (ANOVA); assumptions and violations; analyzing non-normal data; corrections for multiple comparisons

Lesson 6: Non-Parametric Statistics

Homework #1 Due, Homework #2 Out

Descriptive statistics; Cohen’s kappa; Kolmogorov-Smirnov; Mann–Whitney; and, Wilcoxon tests

Lesson 7: Regression

Types of regression analysis (maximum likelihood, least squares, logistic, Poisson); non-linear least squares regression

Lesson 8: Multiple Regression and Other General Linear Models

Unlock the potential of your research with the power of linear models, taking into account multiple factors in a simple, elegant R equation.

Lesson 9: Data Collection & Quality Control

Homework #2 Due, Homework #3 Out

Data collection in qualitative and quantitative research; quality control; sample size; random, non-random, and mixed sampling.

Lesson 10: Experimental Design I

Error control designs; treatment designs; combination designs; sampling designs

Lesson 11: Experimental Design II

Completely randomized designs; randomized block designs; Latin square designs; factorial designs

Lesson 12: Experimental Design III

Homework #3 Due

Response surface designs; split-top designs; designs with repeated measures

Lesson 13: Mixed Methods

New methods for interdisciplinary research (qualitative + quantitative); what and when; mixed designs (convergent parallel design; explanatory sequential design; embedded design)

Lesson 14: Stoic Philosophy

Emotion vs. rationality in dealing with success and adversity in longitudinal projects. Historical paradigms to follow or avoid

Lesson 15: Project Presentations

Essays and Project Reports Due

References

[1] Boddy, R. and Smith, G. Statistical Methods in Practice for Scientists and Technologists. Wiley, 2009.

[2] Freund, R. J., W. J. Wilson, and D. L. Mohr. Statistical Methods. 2010.

[3] Hinkelmann, Klaus, ed. Design and Analysis of Experiments, Special Designs and Applications. Vol. 3. John Wiley & Sons, 2011.

[4] Friedman, Lawrence M., Curt Furberg, and David L. DeMets. Fundamentals of Clinical Trials. Vol. 4. New York: Springer, 2010.

Like us!
Follow us!