The importance of statistical analysis and measurements for research can hardly be overestimated. For social sciences this statement is true as well. Proper data processing and analysis are key elements for any quantitative research and for determining causal relations between variables (Agresti & Finlay 1997, p.24). The purpose of this essay is to describe the data set represented in the National Longitudinal Survey of Youth (NLSY97) concerning four concepts: SAT (math), SAT (verbal), total number of residences respondent has lived in and father’s educational attainment. This essay is also meant to analyze the frequencies and measures of central tendency for the results of the analysis (for combinations of different variables) and to give estimate for this data.

The chosen dataset consists of the following variables: gender (initially coded as 1 for males and 2 for females), region of residence, SAT score for math and SAT score for verbal, Â the number of residences the respondent has lived in and father’s educational attainment (highest grade).

**Concepts and Data**

First of all, it is necessary to outline the concepts that were used for the survey and the way of measurement for these concepts. Key concept is educational attainment of the respondent and educational attainment of the respondent’s father. Other variables are mainly used to distinguish between various groups of respondents. Let us describe the variables, their measurement and the way of coding.

- SAT score for math (of the respondent) is measured in points received for this SAT test, coded by the number of points received, ranges from 200 to 800; measured by survey question “Math score r received on SAT test?”
- SAT score for verbal (of the respondent) is measured in points received for this SAT test, coded by the number of points received, ranges from 200 to 800; measured by survey question “Verbal score r received on SAT test?”
- Total number of residences respondent has lived in (since age 12), is measured in whole numbers ranging from 1 to 12 (notation “12” means “12 or more” in fact), this variable is coded by its value as well; measured by survey question “Number of different residences since r age 12”
- Father’s educational attainment (highest grade completed of respondent’s father); the values of the variable represent a list of school grades (from “3
^{rd}grade” to “12^{th}grade”) followed by a list of college grades (from “1^{st}year college” to “8^{th}year college and more”), measured by survey question “Biological father’s highest grade completed”; this variable is coded according to the table:

Code | Highest grade completed |

0 | None |

1 | 1 grade |

2 | 2 grade |

3 | 3 grade |

4 | 4 grade |

5 | 5 grade |

6 | 6 grade |

7 | 7 grade |

8 | 8 grade |

9 | 9 grade |

10 | 10 grade |

11 | 11 grade |

12 | 12 grade |

13 | 1 year college |

14 | 2 year college |

15 | 3 year college |

16 | 4 year college |

17 | 5 year college |

18 | 6 year college |

19 | 7 year college |

20 | 8 year college or more |

95 | ungraded |

- Sex, measured in whole integers, 1 ”“ male, 2 ”“ female, 0 ”“ no information; after eliminating missing data the variable was coded by the given integers and then recoded into dummy variable “sex2”, where 0 meant “male” and “1” meant “female”; question in the survey was “Sex, rs gender”
- Region; the variable originally had 4 values ”“ North East, North Central, South, West. Finally, it was recoded into a dummy variable “region2” where South was coded as 1 as other regions were coded as 0. Corresponding question in the survey was “Census region of residence”.

After eliminating missing data, the sample size has reduced to 537.

**Descriptive analysis**

First of all let us analyze the percent for all categories in the frequency distribution table (because the chosen variables are measured as nominal). Interesting statistics can be witnessed concerning SAT test results, both verbal and math: there are some values (among high and low values) which have much higher frequencies compared to the other values: 250, 350, 450, 550 and 650 points. For math test, some middle values also happen to have higher frequencies. This may be connected either with the structure of the test, or may be reflecting some other tendency of respondents’ results (like peculiarities of thinking or the questions grouped according to levels). Concerning number of residences where respondent has lived, here we can see an exponential decline of frequencies: 1 place is the most common result, and the frequencies are strongly declining while value of this variable increases. This situation is quite logical and shows the standard picture of people tending to live on one place or relocating seldom. As for highest grade completed by respondent’s father, the frequencies show slow increase starting from 1 grade, and dramatically raise when the value corresponding to the 12^{th} grade appears. Also significant tendency is the “2 year college”, “4year college” and (relatively) “8 year college”. This data shows the following situation: the prevalent number of fathers finishes school, some go to college and study about 2 years and then quit; the other part of those who have started finish their 4-year program or 8-year program.

For the whole sample, the mean for math SAT score constituted ~522.73, and SAT verbal score was ~524.86. For the results of SAT test median and mode coincided and equaled to 550,0 points. Therefore, the general SAT level is more or less homogeneous and is slightly bigger than the centre of the scale (500). Total number of residences respondent has lived in for all sample is represented by mean (1.79), median and mode ”“ 1.00. Therefore, most residents used to dwell in one place since 12. Highest grade completed by respondent’s father for all sample is: mean ”“ 14.23, median ”“ 14.00, mode ”“ 12. Thus, the most frequent case is when father has graduated from school and did not go to college, however, the “higher segment” of those who went to college and finished it raises the ratio for 14 concerning other central tendencies.

Concerning tendencies of educational rates separated by sex, there are 43% of female respondents and 57% of male respondents. For SAT verbal results, median and mode for males are the same as for the whole sample ”“ 550, and the mean was ~519.42, which is slightly lower than the result of the whole sample. Concerning females, mode and median were the same for then, and mean was ~528.97. All in all, this result shows that men and women show almost the same results, but there were slightly more women with high SAT verbal results (women are usually considered to be better than men in verbal tests, and this result proves this statement as well). For SAT math results, the situation is the following: for males, mode and median are the same (550), and mean value was ~533.57, for females, mode and media were again the same, but mean equaled to ~514,54, which again means that men and women show likely results, but still men tend to be slightly better in math than women.

Concerning tendencies of educational rates separated by region, there are 61.1% of respondents from south and 38.9% of respondents from south. Â SAT scores for math central tendencies concerning regions are the following: mean and media are again the same, and mean is 531.12 for non-south region; for south region, mean equals to ~509.55, median is 510 and mode is 450. In general, this statistics shows that south region is doing worse than average concerning SAT math test. For SAT verbal scores, non-south shows the same mode and median as the whole sample, and mean is ~536.19; south has the mean ~507.08, median and mode the same. For verbal SAT test, south again shows worse results; though the majority is doing similarly to the whole sample size. For math SAT test, situation for south was worse because median and mode dropped there as well compared to whole sample.