Statistics.doc

(288 KB) Pobierz
Statistics: Screening and Data Summary

Statistics: Screening and Data Summary

Brenda Rae Lunsford, MS, PT

ABSTRACT

To gain meaningful results from a research effort, data must be collected and then analyzed correctly. While rigorous research methods must be adhered to during the data collection process, similar efforts must be made to assure data are handled correctly during analysis.

The importance of thorough screening and proofing cannot be overestimated. In computerized data management there are numerous opportunities to err. if one does not examine the data carefully, valid results may not be obtained (1).

This article will present the first two steps in the statistical management of data: proofing and screening. Proofing refers to the clerical steps required, such as checking for recording errors and making sure data recorded manually are attached to the correct subjects, observations and variables when entered into a computer. Screening refers to technical data management such as checking that the data conform to the mathematical assumptions necessary for subsequent analysis.

Introduction

Each statistical test relies on mathematical assumptions that if not adhered to will render data analysis invalid (2,3). Specifically, it is important to know if the data are normally distributed. Since most analyses performed on clinical-medical studies rely on the normal distribution, the bell curve shape will be discussed in detail. There are other distributions of statistical importance that are beyond the scope of this article (2).

Extreme values (outliers) need to be identified, and the researcher should decide if such occurrences are part of the population about which generalizations will be made. Frequently, outliers are the result of poor control during the early stages of a project and should be discarded. The presence of undefined strata or subgroups (e.g., percentage of burn, diagnostic classification, etc.) needs to be defined.

Following a thorough proofing/ screening process is the first step in data analysis, which is the data summary or description (1). The logical sequence in the statistical process is to:

§                      summarize data using single-variable (univariate) summaries such as a measure of the central tendency (mean, median, mode), spread (range, minimum, maximum), measures of variability (standard deviation or variance), and the shape of the distribution (e.g., normal, skewed).

§                      proceed with two-sample (bivariate) comparisons, such as the students t-test, or two sample correlations, followed with

§                      multivariate comparisons (ANOVA).

§                      correlations, regression analysis, etc.

Proofing

Assuming that manually recorded data are entered into a computer, it is important to print the data set and proof for accuracy. This step is often omitted. A number entered as 1,000 instead of 10 can change an important result. The data set also should be edited for misaligned observations of subjects with their associated data. Inconsistencies such as the number of subjects with BK amputations being greater than the total sample of all amputees can later prove embarrassing.

For example, consider the highlighted areas in the hypothetical data set of Table A . The age of subject "CD" as 210 years is obviously a recording error that must be corrected as is the "G" in the first column, gender, for subject AB. This is an important correction to make since later during analysis you may wish to sort by sex, or "M" and "F," and this subject's data will not be included.


Table A. Data set proofing is essential. Highlighted errors will invalidate the project's outcome if not corrected.


 

With the errant age of 210 years, the average age for the subjects is 67.75 years with a standard deviation of 95.57 years. By correcting the 210 to 21, the average age is 20.5 years and the standard deviation is 11.8 years. If one did not proof the data set and simply ran the computer analysis, this error would have invalidated the outcome.

The other two areas of concern are the single burn diagnosis mixed within a sample of spinal cord injury, and the single 6-year-old subject amid 20- to 35-year-olds.

If the entries are in error, they must be corrected. However, if the recording is correct, then decisions need to be made about the sample. Data in the case of the 6-year-old subject were removed because he was not part of the population this researcher wished to generalize. The case of the subject with the burn diagnosis was kept because the variables of interest were related to range-of-motion limitations similar to the remaining subjects and not related to the diagnosis.

Correcting the erroneous age and deleting the 6-year-old subject results in the data set shown in Table B . The standard deviations of age and velocity become smaller, 8.4 vs. 95.6 and 9.0 vs. 11.4, respectively, as the data become more homogeneous. The importance of this will become apparent as the analyses of the data are discussed. For more complex data sets there are computerized statistical techniques that allow detection of outliers (4). This also illustrates the value of the researcher being intimately involved with the proofing/screening process.


Table B. Revised data set after proofing.


 

Screening

To enable a realistic example, data from a study of 80 spinal-injured subjects will be used. This study evaluates the relationship between heart rate and velocity of locomotion by wheelchair and walking.

Descriptive Statistics

The first analysis of the data included the calculation of the descriptive statistics (i.e., mean, median, standard deviation, minimum, maximum and skewness). There are two reasons for doing this, first to screen the data and second to provide a quick summary result. For example, if you had just finished collecting data on the walking velocity of a group of patients with a new style of prosthesis, how would you describe how they performed? Reciting each patient's velocity, heart rate, etc., would not allow meaningful inferences to be made. Therefore, the first task in data analysis is to organize the data into some meaningful arrangement. One of the easiest and most useful steps is to produce a summary table as shown in Table C .


Table C. Summary tables organize data into meaningful arrangements.


 

To examine this table we will first define the terms then examine the data.

Variable: " . . . anything that takes on different values from time to time . . . " (5). Specifically:

WCHR: Heart rate achieved while propelling a wheelchair.

WALKHR: Heart rate achieved while walking.

WCVEL: Velocity attained while propelling a wheelchair.

WALKVEL: Velocity attained while walking.

n: Number of subjects

Central Tendency: There are three measures of central tendency, the mean, median and mode. When a distribution is normal, they are equal.

§                      Mean: The mean is the most common measure of central tendency for sample distributions (Equation 1). The mean is precisely defined and the most stable of the measures of central tendency. When extreme values are present, the mean is not the best representative of the data (3). For example, in a group of spinal cord injured patients the mean age was 28.4 years while the median age was 24.2 years. Given that there were two older subjects in the sample whose age affected the mean, the median in this case was a better measure of the average age. The mean is the average value of a given variable in the sample (3,5,6,7).
 

§                      Median: The median is defined as the middle data value of a set of sample data. Since the median is not affected by extreme values, it is a better measure when a distribution is not balanced (3). If data are perfectly distributed, the mean and median are the same (3,6,7). In the example, 3 is the median value (see Figure 1 ).

§                      Mode: The mode is the most common value in a set of sample data. This is the least useful measure of central tendency in biomedical research since it is really sensitive only to counts (3). The mode may not exist in continuous data where the measurement instrument is sensitive, and there are no duplicate data values (3). In the example, 4 is the mode (see Figure 2 ).

Standard Deviation: To understand standard deviation a brief discussion of normal distribution is warranted. When data are distributed "normally," it means the data are evenly distributed, with the mean at zero or at the center, and the data are distributed such that a bell curve is formed. The partitions of a bell curve are such that 68 percent of the data falls within the first partition, known as one standard deviation, 95 percent of the data falls within the second partition, and 99 percent falls within the third partition, or the 2nd and 3rd standard deviations (see Figure 3 ) (6,7,8).


Figure 3. Normal distribution.


 

The sample standard deviation is the positive square root of the variance (see Equation 2). This is the most commonly used measure of variability (2,6,7,8). Its values are closer to the data values of interest than the variance and is an easier number to relate to.


 

Variance: The variance is the mean of the squared differences from the mean of the distribution or the square of the standard deviation (see Equation 3). Mathematically, the variance is the square of the standard deviation. This number gives information about the distribution (spread) of the sample data. The variance is the term commonly used in the mathematical calculations performed in statistical testing (6,7,8).


 

Minimum: The smallest value of a variable.

Maximum: The largest value of a variable.

For the purposes of screening it is preferable to use minimum and maximum instead of range since it is possible to visualize extreme values that might be erroneous, such as the age of 210 in the earlier example. Many prefer to summarize using range; however, range is not useful in screening for extreme values.

Skewness: Skewness occurs when data are distributed unevenly about the mean with a higher concentration at one end (5,6,7). If the tail of the data is to the left, the data are said to be skewed to the left or vice versa if the tail is to the right (see Figure 4 ) (1,5,9). The skew value for data normally distributed is zero (9). Values greater than zero indicate that there is a skew to the right, less than zero indicate a skew to the left (9).


Figure 4. Curve skewed to the right.


 

Now return to the data set shown in Table C . Those familiar with the diagnosis of incomplete paraplegia or quadriplegia are aware it is possible to experience the extremes of having very little preservation of neuromuscular function or to remain mostly intact with little loss. For a patient to attempt to walk with weak muscular control and sensory deprivation can be such a struggle that a velocity of 4 meters per minute, MIN column, is not surprising. However, examination of the MAX column reveals a heart rate of 188 during wheelchair propulsion. That heart rate value is of concern since it would be greater than 90 percent of predicted maximum for a 20-year-old subject (10)!


Table C. Summary tables organize data into meaningful arrangements.


 

Next, the 1.659 skewness score for WCHR is high since a value of 0 corresponds to a normal distribution (i.e., a symmetrical bell curve) (9). Another hint of trouble is the differences between the mean and median for both WCHR and WALKVEL. The fact that the WCHR distribution curve has a skew to the right and that the mean and median have a difference of 4.2 units gives one cause to examine the data further.

The case is not so clear for the variable WALKVEL, whose mean and median are 11.2 units apart while the data are not substantially skewed (skew -.065). Also, the standard deviation of WALKVEL is quite large, being 59 percent of the mean, which suggests a very wide curve. The next step is to evaluate the cause of the extreme skew for the variable WCHR and the marked difference between mean and median values for WALKVEL

Distribution

The first step in screening a variable is to analyze the distribution of the data by frequency listing and/or histogram. Perusal of the "value" column indicates there is a reasonable continuum of data from the upper 70s through the low 130s, and a gap between 132 and 188. It is obvious there is an extreme value at the upper end causing the skew (see Figure 5 ). By plotting the frequency histogram of five-year groups of these data, you gain a more visual impact of the shape of the distribution (see Figure 6 ).


Figure 5. Frequency distribution of WCHR.


 

After this discovery it was learned that the subject with the heart rate of 188 was having a medical problem that caused erratic heart rates. Since the cause of this high rate was unrelated to the purpose of the research, this subject's data were removed. Table D shows the improvement of the data once this outlier was removed.

As a result, there was considerable improvement in skewness, the standard deviation became smaller and the median is now closer to the mean. These data parameters are now more acceptable with respect to the requirement of a normal distribution (2,6).

Now the data parameters for WALKVEL will be analyzed using the same techniques. In perusing the value column for the WALKVEL data set (see Figure 7 ), the relatively small values of 4 and 5 might be cause for concern; however, there is no significant skew to these data (i.e., -0.65). The only remarkable observation of the raw data is the discontinuity between the walking velocities of 37 and 50. The 11.2 discrepancy between the mean (38.8) and median (50) walking velocity values is also cause for concern (see Figure 7 ). This is an example of where a graphic representation is helpful.


Figure 7. Frequency distribution of WALKVEL.


 


Figure 7. Frequency distribution of WALKVEL.


 

The frequency histogram of WALKVEL clearly shows a bimodal distribution (see Figure 8 ). In other words, an underlying factor is causing these data to separate into subgroups.


Figure 8. Histogram of WALKVEL.


 

Another way to identify a subgrouping is to plot the data of WALKHR vs WALKVEL (see Figure 9 ). The WALKHR vs WALKVEL plot reveals a bimodal distribution of the two groups that should be evaluated to see if there is a significant difference between them. This finding sheds a completely different light on these data and is only discovered through careful screening. If one had obtained the means and then moved directly to analysis, a significant error would have been made and an important result overlooked.

This project revealed the subgrouping was the result of the impact of the combination of the variables' muscle strength and proprioception. This trend continued even when a larger sample was achieved. It is important to remember, however, that when working with small samples apparent differences may disappear or be reduced as the number of subjects in the sample is increased (3,11).

Summary

When data are transcribed, error is possible. To comply with the mandate of valid research, it is imperative that all data entry be checked for accuracy. The value of clerical proofing for this type of error cannot be stressed enough.

For correct inferences to be made, it is critical that the data follow the assumptions of the statistic to be applied. Very often one is left with data that either have a very large spread, do not look normally distributed, are skewed or have apparent outliers. While these conditions may be the result of poor control, large variability and/or a very small sample, not all is lost. There are two primary categories of statistical testing, parametric and non-parametric.

The parametric category of tests requires full and rigorous adherence to the assumption of normality, i.e., (n,0), which means the distribution is normal, and the mean is at the center. The non-parametric category also is frequently called distribution-free statistics (3,12).

The implication here is that data do not have to be normally distributed for the test to provide valid results. Screening, therefore, does not determine if you can test but rather how you will test and if legitimate measures must be taken in managing the data prior to testing. Since research establishes professional standards and is used to guide others in patient care, ethical conduct requires that published information be of the highest integrity with the best efforts made toward establishing correct information.

Brenda Rae Lunsford, MS, PT, is a visiting assistant professor in the school of physical therapy at Texas Women's University in Houston.

References:

1.       Hill MA. Annotated computer output for data screening. BMDP Technical Report 77, UCLA 1981.

2.       Sokal RR, Rohlf FJ. Biometry. 2nd ed. San Francisco: WH Freeman & Co., 1981:400-14.

3.       Currier DP. Elements of research in physical therapy. 2nd ed. Baltimore: Williams & Wilkins, 1984:152, 278-94.

4.       Afifi AA, Azen SP. Statistical analysis: A computer-oriented approach. New York: Academic Press Inc., 1972:281-3.

5.       Dominowski RL. Research methods. Englewood Cliffs, N.J.: Prentice-Hall Inc., 1980:4:124-63.

6.       Dunn JO. Basic statistics: A primer for the biomedical sciences. 2nd ed. New York: John Wiley and Sons, 1977:5:38-49.

7.       Brown FL, Amos JR, Mink 0G. Statistical concepts: A basic program. 2nd ed. New York: Harper & Row, 1975:28-32.

8.       Goldstein A. Biostatistics: An introductory text. New York: Macmillan, 1964:3444.

9.       Bostrom A, Kahn T. Crunch Statistical Package. Vol. I & II, 1991:590.

10.    Lunsford BR. Clinical indicators of endurance. Phys Ther 1978;58:6:704-9.

11.    Schlesselman JJ. Planning a longitudinal study: I sample size determination. J Chron Dis. 1973;26:553-60.

12.    Siegle S. Nonparametric statistics for the behavioral sciences. New York: McGraw-Hill. 1956:1-34.

 

Zgłoś jeśli naruszono regulamin