From the Diaries of a Principal Investigator

My name is Ioannis Pavlidis and I run the Computational Physiology Laboratory at the University of Houston. My path in many respects represents the changing face of research in America the last 20 years: an immigrant who studied in a public university and started his career in corporate research, while it existed, with emphasis on defense applications; then I moved back to a public university and transitioned to medical research with success. My narration starts from a point that is neither the beginning nor the end of my story.

the late 90's I had the idea of refocusing attention from the body to the face as the most appropriate place to detect sympathetic responses. Furthermore, I suggested the use of thermal imaging to measure these responses - a radical departure from the contact probes and wires in practice at that time. In collaboration with Dr. James Levine, a research endocrinologist from Mayo Clinic, we demonstrated that sympathetic signals do manifest on the face and they have a thermophysiological signature that can be captured via thermal imaging.

In startle experiments we ran, subjects exhibited instantaneous periorbital warming, likely due to blood flow redistribution towards the orbital muscle. The latter was in apparent support of rapid eye movement and fit very well in the context of the fight or flight syndrome. The result appeared as a clinical picture in The Lancet, a clear indication of its importance.

It was a qualitative result, we did not have a method to quantify it, and it was associated with a single physiological sign. Yet, it was landmark work because it opened the way for unobtrusive monitoring of stress in field conditions. Stress, which is defined as a sustained succession of sympathetic responses, could be measured only obtrusively at that time, limiting experimental freedom and interfering with the observed phenomena.

There was a need to radically reform lie detection methods that were based on sympathetic measurements. We thought to apply this new method there, per the instigation of the DOD Polygraph Institute that designed and executed a mock crime experiment to provide us with the necessary data. There was a lot of excitement and I remember the DOD Polygraph people hanging a large thermal picture of a subject in their hallway, almost as soon as the experiments completed.

When I returned back to Honeywell Labs, where I was working as a corporate researcher, reality set in. We did not have any computational method to extract the periorbital signals from the thermal clips, we had a couple of hours of recordings, which was a huge data set for the time, and we had no budget. I came up with a rudimentary image-processing algorithm and I asked the software engineer in my group to implement it on the side, as a favor. Then, I had to validate the results against ground-truth that I produced for a portion of the data set. It was the most labor-intensive task that I have ever done in my life.

After several months, by which time everybody had forgotten about the project, I was able to extract signals that I had some faith on. They represented the evolution of periorbital warming during the subjects' interviews. The question then was how could one tell which signals indicated deceptive and which ones non-deceptive behavior. The answer was not obvious until I thought to concentrate on the most pointed Q&A in the interview and use it as stimulus, splitting the periorbital signal to before and after segments. Two clusters naturally formed and I released the results to the DOD Polygraph Institute, which reported back that the method's predictive power was about 80% - a significant achievement.

Based on this outcome, James composed a short paper and submitted it to Nature. It passed through the editorial gateway, the importance of which I did not appreciate at the time, and it was sent for review. It came back as a rejection. The main point of the reviewer was that the data set was small, although he was complimentary about the concept. There were no resources to expand the experimental set. We simply re-wrote the manuscript, backtracking on strong claims and emphasizing some under-appreciated aspects of the results. The revised manuscript was sent back along with a convincing rebuttal letter. This time it was accepted.

When the paper appeared in Nature in January 2002, its impact was thunderous. In part this was due to the political conditions at the time - 9/11 was fresh on everybody's mind and there was thirst for new concepts and technologies that will keep us safe from the terrorist threat. This new method clearly had such potential. Every major news organization around the world reverberated this not for days or weeks, but for months.

At the end of this process I was somewhat famous, but exhausted and broke. Pursuing deep science was not favored anymore in corporate America. I was on a lonely path for a couple of years leading to the Nature 2002 paper, and in the process I lost all my funding and corporate connections. I decided to go back to academia and the fall of the same year found me in the University of Houston.

I arrived late, the semester had already started, and I was excused from teaching till January. I had a lot of time in my hands after many years. I was sitting on my desk, thinking what to do next - my thoughts were interrupted only by the occasional noise from toilet flush (my university office was next to the bathroom). I realized what we discovered was probably the tip of the iceberg. The human face is the most heavily innervated part of the body. Not only that, but the most important sensory organs are there, too. There had to be more facial signs than the periorbital sign and had to be more applications than lie detection. The only problem was to keep the attention of the subject in one direction, so that facial imaging is not interrupted. I was staring at my computer screen and suddenly dawned on me that this is what I was doing all day long. Computer users then, this is it. They always look towards the screen and if we mount a thermal imaging sensor behind the screen - bingo!

I sent a proposal to the National Science Foundation (NSF) casting the problem as a quest for developing new methods to quantify computer user emotions via physiological signs. The applications ranged from software usability studies to preventive health care. The proposal was awarded and the next several years the Computational Physiology Laboratory that I established, was devoted into discovering new sympathetic signs on the face and computational methods to quantify these signs via thermal imaging.

The supra orbital sign associated with mental engagement, the breathing sign on the nostrils, and the cardiovascular sign on the temporal region were among the first to be developed. These signs were quantified with computational methods that featured tracking to cancel facial motion effects. Real-time demos were developed alongside with the methods that enabled the broadest possible outreach, such as in Nextfest 2004. New applications were also opening up, such as sleep studies, with our seminal publication in the journal Sleep.

Circa 2007, we had cracked many of the face's sympathetic secrets, and we had developed good measurement methods usable in realistic experimental conditions. There was a lot of potential in all these but we never ran a longitudinal stress study on a critical and representative application. Our experimental studies were mostly startle studies or Stroop studies or mental arithmetic studies or some other lab variety. Lie detection and sleep studies were the closest to field studies that we have ever done, and although critical, they were not representative. Hence, we were in search of the big one - a stress study on a ubiquitous application using the advantages of the new measurement methods.

I was brought in contact with Dr. Barbara Bass, the new head of the Surgery Department at Methodist Hospital, and an authority in surgical education. She was setting up a research center on surgical training and she was eager to develop new competency metrics based on physiological responses. She immediately saw the potential of our methods, which she convincingly explained to me. Let us say it was love at first sight - I got it and the next day I started writing a proposal to NSF.

It was November and the deadlines for the large and medium awards had passed. Only the deadline for the small awards was still standing. Even if we won, the money would have not been enough, but it was a risk worth taking. By this time, I was working mainly in coffee shops around the Medical Center of Houston, communicating with my lab via Skype. It was the best way to concentrate on conceptual development and data analysis and resist the 'research atherosclerosis' the academic system was promoting on tenured professors.

We won the NSF grant, but we had no idea what we were getting into. The magnitude of challenges in a longitudinal field study of stress, where new methods were brought to bear, was a sobering one. We realized this very early. What saved us was not anticipation of every eventuality but fast correction of errors.

I have to say that the core team in this research effort was first class, not because it was ready from the start, but because it naturally grew into its role. Barbara who not only provided the test bed, but she was patiently waiting for three years to see results, while funding the deficit we were creating in the NSF account. Then, Dr. Panagiotis Tsiamyrztis, my statistician in Athens University, who performed all the statistical analysis that made discovery possible. Dr. Dvijesh Shastri, my Ph.D. student from the early years who decided to stick as postdoctoral fellow, delivering the most refined and well validated sympathetic measurement yet, fit for the rigors of such a demanding field study. Dr. Peggy Lindner who joined the lab as the study was beginning, and she kept the data collection and quality control process on its feet - a Herculean achievement. And, Avinash Wesley who annotated facial expressions in about 1,000 video clips without ever producing a distressed facial expression of his own. What I will describe next is a real research story that shows how ordinary people can achieve extraordinary things when they are inspired and led well.

We decided early on for one member of the lab to set camp inside the Methodist's Hospital Surgery Department. We got burnt many times in the past, especially in lie detection experiments, where we left psychologists to collect data with disastrous results. We were seeing our data collection person every Friday, when he was reporting in the lab meeting what transpired during the week in the data collection grounds. It was the day that he was physically bringing to the lab a portable hard disk full of data that was copied to our sever. We had people who were immediately looking into the quality of the data and they were recording problems on a data table that was kept current on the project's intranet site.

Doing regular and near real- time checks helped us to pinpoint issues early on and take corrective actions before it was too late. There were several problems in the beginning, including technical problems. In each training session, we had continuous and simultaneous thermal and visual recordings that lasted more than an hour at a time. This was a heavy-duty process and the computer system was operating at its limit, resulting occasionally in marred recordings that had to be thrown away. Peggy managed to fix these problems by developing smart engineering solutions.

We also had problems recruiting surgeon volunteers who would have allowed us to follow their training or retraining on laparoscopic procedures per the approved Institutional Review Board protocol. Novice surgeons were particularly reluctant to have their training sessions recorded in so many ways and scrutinized thereafter.

There was also a clash of cultures. The surgical educator, who was grading the trainees, established some hypochondriac rules and insisted that everybody, including our data collection person, follows them to the letter. Whoever was entering the lab had to rub his hands with alcohol. When someone was exiting the lab, he had to do the same! This rule did not make much sense because this was an inanimate surgical training lab. Punishment for whoever was not following the alcohol-rubbing rule was exclusion from the lab for a day. Guess what, the computer scientist found guilty quite a few times in the beginning and he was expelled from the lab for the specific days, something that was upsetting the experimental schedule and the few volunteers who signed up. Not to mention that the skin in our person's hands was peeling off due to excessive use of rubbing alcohol. I tried to lighten up the atmosphere by cracking jokes. It worked! Deep down it was very funny.

All these problems paled by comparison to the real science problems that were surfacing. It was the first time that we had such high quality hardware, solid systems software, and well-calibrated and sharply focused facial thermal imagery of maximum spatial resolution. We were looking in the visualizations to find qualitative evidence of sympathetic signs in the familiar places (e.g., periorbital and supraorbital), but to no avail. The only facial region where there was apparent activity in the form of transient perspiration was the perinasal area. This was a physiological sign we had recently discovered and we reported it in an article in the IEEE Transactions on Biomedical Engineering. It was the last sympathetic sign we unearthed and there was a reason for that - the minute flare-ups of the perspiration pores were averaged out by earlier thermal imaging sensors with gross spatial resolution.

Although we knew about this sympathetic sign, we had not studied it in depth and we had no good computational method to quantify it. I have made a critical decision at that point. We would not process any other area on the face, and we would concentrate on the perinasal area only. Hence, the first order of the day was to develop a computational method for perspiration quantification and thoroughly validate it. This proved quite difficult and took between one to two years to perfect. Dvijesh literally saved the day here.

In parallel to all these, we had to deal with the processing of the visual streams for annotation of facial expressions. I knew that physiological signals were agnostic to eustress and distress excitations and I was expecting both in a longitudinal study of this kind. We were particularly interested in distress and we needed to know when that was the case. We did not have a lot of experience in annotating facial expressions, but I did not like the process psychology labs were following in lie detection experiments. Basically, they were hiring undergraduate students per hour and had them code facial expressions after some basic certification. They had two of these students coding independently and then they were reconciling the results. I still found this process non-optimal. It was obvious that many of the coders they were using were low quality and two bad coders were not necessarily adding up to a good one.

I followed a different route. I found a person of extreme patience who was practicing meditation and for some strange reason he really liked this onerous task. Importantly, he was a Ph.D. student and hence, the thinking type. I told him to take his time and do a really good job, a job that may last as long as two years (which it did). We were performing random quality control on about 5% of what he was producing. As far as we can tell, we never found a problem. The approach appeared to work.

Three years into the project we had fully validated physiological signals, annotated facial expressions in the visual streams, and carefully checked psychometric scores. It was an impressive amount of tabulated information that was patiently distilled to perfection out of 5 TB of raw data. In the process we lost lab members who could not withstand the stress, we were nearly broke (again), and anxious as to what is going to happen next.

Studying the pattern of good science through historical examples, I realized that invariably it has three ingredients: a) New measurement methods that give an advantage to the science group. b) An application set of significance. c) A conceptual leap. I was confident we had the first two, but we were missing the third.

We sent the tabulated data to Panagiotis in Europe for statistical analysis and we were waiting on the results. To his question what I was looking for, my answer was that I do not really know. This was an exploratory study - the first of its kind - and the mandate I gave him was to perform an exhaustive combinatorial analysis. A few months later I received through Skype a 500-page analysis document. It was overwhelming. We had a perfect statistician and jokingly I thought that this might not be a good thing. I threw myself into studying all the different interactions and significance tests trying to abstract away basic behavioral patterns and their physiological sources. The only conclusion that was clear early on was that novices had a lot higher stress than experienced surgeons across the board. This was logical and an additional qualitative indicator that the new stress measurement method was doing its job.

I put aside a lot of information that apparently was not leading anywhere and I managed to reduce the initial 500-page report to about 50 pages. I focused on the possible effect the dramatic difference in stress between the two surgeon groups might have had on their performance. It was not very clear. There was a significant difference in error, but this could be ascribed to skill disparity and not necessarily to stress disparity.

At some point it intrigued me that novice surgeons were finishing the tool transferring and cutting drills at the same time as experienced surgeons. This was not the case for the suturing drill. Studying carefully the architecture of the drills, I realized that in the first two tasks errors were not costing time; only in the third task this was the case. Deep down novice surgeons were attempting to execute all drills at a fast pace, which was clearly above their level. This fast pace was evident in the first two tasks and hidden in the third one. That was it, fight or flight again, where people needed it the least, and well camouflaged.

There were some loose ends that we could not explain, but this was a fascinating lead and I decided to put it in writing. What would be the title of the paper my colleagues asked? Fast by nature, my answer was. I had already formed the opinion that in dexterous tasks under stress humans seemed `to act with their feet’.

It took us some time to compose the manuscript and once ready we started submitting it to the big journals. It was one editorial rejection after the other, and worse, the rejection letters were non-informative boilerplates. I decided to send this to some well-known people in the field, with whom I was not associated, and solicit honest feedback. I received a couple of very constructive reviews. The most useful was from Prof. Robert Sapolski who pushed us to strengthen the validation exposure of the measurement method. To do this we undertook additional research that lasted three months. The manuscript was also revised and the submission process resumed anew, but without any luck.

There was no apparent home for this work, although it was bringing to the fore a fundamental human behavior that was affecting performance in professions critical to society. At that time, I saw an ad for a new open access journal from Nature, called Scientific Reports, where editors were promising to send more informative decision letters. I decided to submit and get at least the feedback that we were always craving for. A month later, a brief, but fair and highly informative rejection letter came back from the Associate Editor. It was clear that he read the manuscript very carefully and he was bringing up all the loose ends, which we sort of knew about. However, he did it in such a way that unlocked my mind.

There was a major contradiction - faster in the first two tasks and slower in the third. We were trying to explain this against the differing task architecture, but this intuitive explanation was far short from good science. The surgeons grade themselves on the wrong metric, I thought. Instead of total time let us quantify attempt time - i.e., how much time it takes them until they commit the error and before they backtrack to try again. For tasks 1 and 2 there was no backtracking of course, but there was for task 3. We recovered the video recordings from inside the surgical box and we started patiently analyzing each subtask by counting number of attempts and attempt pace. I put the same person I used for the facial expression analysis to do this gruesome job. He was back in his element. Dvijesh was performing the 5% quality control check on a random sample, as usual.

It took several months to finish the job. Newly tabulated data of attempt pace (not time) were sent to Panagiotis for statistical analysis. This time, I gave him a hint. I believe the novices have dice throwing behavior, I told him. We will see, he said. He applied a geometric model, which indeed confirmed my suspicion. Novices uniformly attempted to perform subtasks as fast as experienced surgeons, irrespectively of the type of tasks. In tasks with no error recovery this was clear, in tasks with error recovery, this was hidden behind multiple attempts to do it right that deceptively made novices appearing slow overall. This was the conceptual leap we were missing. We now had it and it felt good. I totally rewrote the manuscript and I send it back to Nature Scientific Reports, along with an explanatory cover letter - it was summer 2011.

A month later I received a message notifying me that they were sending the manuscript for review. I cracked a smile, because I knew we had passed the editorial gateway. Several months passed without any word. Then in late October 2011 I received an apologetic note for the delay. It was a decision for major revision along with two of the most constructive reviews that I had in my career.

It took three months of additional research to answer all the points raised by the reviewers. While in the previous round we scored a conceptual leap regarding the time performance measure, in this last review round we scored a conceptual leap regarding the error performance measure. The surgeons grade themselves not only on the wrong time metric, but also on the wrong error metric, I thought. Error misrepresents accuracy performance and error propensity is the way to go. For example, in the suturing tasks, the errors recorded by the surgical educator were telling only part of the story. Novices committed many latent errors that did not show up in the final count, because they were corrected after multiple attempts in the process.

We had now a set of orthogonal performance measures, that is, attempt pace and error propensity, that could correlate with the stress measurements, suggesting that stress was instigating fast pace in novices, which precipitated error propensity well beyond that justified by their lack of skill. It was a full account of human's agonizing 'behind the scenes' effort to overcome his biological nature in the face of grand challenges: stay and subtly act with his hands versus flee with his feet. There were no gaps or lose ends anymore: Interlocking, quantifiable, and reproducible entities spawned out of a good mix of experiment and theory - the unmistakable mark of the scientific art. We sent the revision in mid January 2012; Peggy beautifully condensed everything in four figures. The acceptance letter arrived on Valentine's Day.

It is spring 2012 and we are sitting in the same coffee shop where all this started in my 2007 meeting with Barbara. We are pondering as to how many things could have gone wrong and they did not by the thinnest of margins. I review the critical calls that in retrospect look wise, but they involved nothing short of unacceptable risk and were taken with the lightest heart. I have been lucky. We have been lucky. I guess the history of scientific progress resembles the history of ancient battles, where the odds stacked even, because the opponents used similar technology and tactics. In pursuing nature's secrets, accepting risk is the best you can do. Actually, it is best if you do not even think of risk, opiating yourself with the beauty of research.

We are the core people left in the Computational Physiology Laboratory and this is neither the beginning nor, we hope, the end of our story.