About the NRC's Quality Scores

Why Two Sets of Rankings?

You get two for the price of one in the latest comparisons of graduate-degree programs from the National Research Council. For each field of study the NRC has published two parallel sets of overall rankings, which give somewhat different assessments of how the various schools stack up against one another. The two lists are based on the same underlying data-that is, the same evaluations of program features such as faculty publication rates and time-to-degree. Where the lists differ is in the relative importance they assign to each such program variable.

How did we wind up with two sets of rankings? It's a complicated story, but it begins with a problem that should be familiar to anyone who has ever tried to compile a top-ten list. If you are ranking your favorite restaurants, for example, you may well find that one place has the most adventurous menu, while another excels in service, and a third offers the best prices. In order to compare the restaurants, you have to decide how much weight to give each of these factors. It's much the same when you're shopping for a new apartment (convenient location vs. roomy interior) or rating your favorite movies (cast vs. script vs. special effects). The basic problem is that a ranking requires things to be put in order along a one-dimensional scale of measurement, but most of the things we want to compare differ across many dimensions.

In the case of the NRC's ranking of graduate-school programs, there are 20 attributes to be reduced to a linear ordering. The 20 factors include various characteristics of the faculty (number of publications and citations, number of awards) and the students (GRE scores, percent with fellowships) as well as broader features of the doctoral programs (average time to complete a degree, measures of ethnic and gender diversity). The NRC has adopted a three-stage procedure for combining all these disparate factors into a single score of program quality.

  • The first step is to standardize all the measurements in a way that suppresses differences in units of measure and numerical range. All the variables are transformed so that the data have a mean of 0 and a standard deviation of 1. This step makes it possible compare such diverse quantities as GRE scores, numbers of citations and percentages of international students, all on the same scale.
  • Second, each of the normalized values is multiplied by a weight coefficient-a number that determines the relative importance of that variable in the composite score. The weights are numbers between 0 and 1, adjusted so that the sum of all the weights is exactly 1.
  • Finally the 20 products formed by multiplying the program variables by their corresponding weights are summed up to yield the overall score. The scores are sorted from highest to lowest to produce a ranking.

(Some further complications are glossed over in this description. To account for uncertainties in the data, the measurements and the weights are both randomly perturbed and resampled some hundreds of times, yielding ranges of ranks rather than specific rankings; this process is described elsewhere).

Walking and Talking

The values of the 20 program variables mentioned in the algorithm above come from the NRC's massive data-gathering project, in which multipage questionnaires were filled out at hundreds of institutions. Where do the weight coefficients come from? The NRC panel experimented with two methods of determining appropriate weights, and ultimately decided to report the results of both procedures.

The first method was simply to ask. Faculty evaluators were invited to identify the features or attributes of doctoral programs that contribute most to the programs' success. The 20 variables were grouped in three broad categories (representing characteristics of the faculty, of the students, and of the programs as a whole), and the evaluators were asked to select four important factors in each group, and then to name the two most important. The evaluators also assigned numerical measures of importance to the three categories. From the data gathered in this way, the NRC distilled a set of weight coefficients for the 20 variables. These weights are called the survey-based weights, since the faculty evaluators stated their preferences explicitly in a survey.

The other set of weights came from an indirect, inferential process. Instead of asking evaluators to state what factors they believe to be most important, the survey panel asked faculty members to rate a selection of doctoral programs, then looked at which characteristics of the rated programs correlated most strongly with high or low scores. In each discipline about 50 faculty members rated a random selection of 15 programs on a scale of 1 to 6, without stating the criteria they applied in judging the programs. (All of the programs were in the evaluator's field of study, but they did not include programs at the evaluator's own university.)

How are weight coefficients inferred from a set of ratings? At a qualitative level, the process is easy to understand. If an evaluator gives consistently high marks to programs where the faculty have a strong record of research publication, and lower marks to programs that are weaker in this respect, then it's reasonable to infer that publications have a major influence on the evaluator's judgments. If there's little correlation between some variable and the evaluator's scores, then that variable is assigned a low weight.

The mathematical procedure for getting quantitative results of this kind is called linear regression, and hence these weights are called the regression-based weights. In essence, it finds the set of weights that most accurately predict the evaluator's ratings of the programs, in a process analogous to fitting a curve to a series of data points. Because the number of parameters is quite large-20 weight coefficients, one for each program variable-the fitting has to be done with care. The NRC devised a stepwise method, which at each stage eliminates some of the less-important variables. (For more detail on the mathematical techniques, see "A Guide to the Methodology of the National Research Council Assessment of Doctorate Programs.")

An Embarrassment of Rankings

Both the direct and the indirect methods of assigning weight coefficients have plausible arguments in their favor. The survey-based weights reflect the stated preferences of experts in the field, who should know what factors matter most in a successful graduate program. The regression-based weights are derived from the actual judgments of such experts. In short, between the two sets of weights, the experts are both talking the talk and walking the walk.

Initially, the NRC planned to combine the survey-based and regression-based weights and issue a single set of program rankings, but ultimately the committee decided not to blend the weights; instead they released separate rankings based on both systems. In addition, there are three more sets of rankings based on narrower subsets of the weights-those that focus on faculty, on students, and on program traits. All these alternatives offer some intriguingly diverse views of the graduate-school universe. On the other hand, if you are merely trying to figure out which doctoral program is best for you, having two sets of rankings-or five sets-is not necessarily better than just one!

How do you choose? Which rankings should you believe? Is it better to do as the experts say, or as they do? If there were simple answers to these questions, the NRC would not have published both sets of rankings. But for users of PhDs.org there is a way to sidestep making the choice. You can define your own set of weight coefficients, emphasizing those factors that are most important in you personal search for a doctoral program.

About the Graduate School Guide