On Those 'Stunning' Shanghai Test Scores

The most popular item on the NYT site at this moment concerns the "stunned" reaction of U.S. educators to the high scores of students in Shanghai, in the latest PISA results. PISA is the Program on International Student Achievement; it's run by the OECD and covers 15-year-olds in dozens of countries around the world; it compares their achievement on a variety of math and reading tests; and once again this year it shows U.S. students in the middle of the pack or worse. The shock was how well students in Shanghai, the only test site in China, scored on the tests. For more on PISA, see "Your Child Left Behind" in the new issue of our magazine--subscribe! If you'd like to try some questions yourself, go here.

As with just about everything concerning modern China, the results should be taken seriously. Chinese schools are full of bright and motivated students; many of them work and study exceptionally hard; most of them are aware of the make-or-break importance of tests in their own life prospects -- and by extension for the country's continued development. No doubt these results reflect something real.

But as with just about everything concerning modern China, the results should also be viewed with some distance and possible skepticism. The 5000+ students who were tested in China's biggest and most modern city may or may not be indicative of broader progress throughout the country (as the NYT story points out). Anyone who has had experience with schools and testing in China will want to know more about how these tests were administered, supervised, and scored.

From a reform-minded American perspective, I'm happy for people to be as startled as possible by these results. Anything that will direct attention to American fundamentals -- education, infrastructure, research, that sort of tedious thing -- is fine with me. We need every spur to action we can get! But on the merits, it's worth applying a version of Reagan's old "trust, but verify" approach toward the Soviet Union. Pay attention, and assume that the general pattern shown here is right and significant. (The best Chinese students working hard and doing well; not enough Americans bearing down hard enough.) But don't take this too literally as the next sign of Inevitable "Chinese Professor" Dominance (source of image above).

Below and after the jump, an overnight reaction from a scientist at a major U.S. university, who explains some detailed cautions against giving too much weight to the results. He writes:

>>I am not at all sure what these numbers mean. But I am pretty sure that the breathless interpretation given by US/Microsoft Secretary of Education Arne Duncan -- i.e. "the brutal truth" of the Chinese "out educating us" is quite overblown.

Let us start by acknowledging that standardized tests can provide meaningful information on information on individual students, scholastic programs, and even different student populations. And such information if used properly can guide school and teaching reforms. For example the battery of tests administered by Educational Testing Service (ETS) which grew out of Vannevar Bush and James Conant postwar evaluation of American higher education initially played a major role in helping open up American Universities to a much broader group of talented students. But really good tests are hard and costly to write and to evaluate. And all my years of experience in education lead me to believe that no program of standardized testing has much meaning if test taking becomes an end in itself rather than a measure of other things.

Never mind these big questions regarding the use of standardized tests.. It seems to me that even if we take these numbers at face value that there are significant numerical context issues which have not been addressed by the NY Times article.

First, all numerical measurements have intrinsic variance or uncertainty. So if we split the sample test taking group (according to this article ~5000 students) into several random subgroups we would expect that the averages would vary. This would allow an estimate of the intrinsic random uncertainty. No error estimates are quoted here

Second, how were the representative groups of 5,000 students selected in each study? In a city of 20 million, there might be a population of half a million 15 year olds enrolled in secondary education. That means there would be many ways to select a "representative" group.

Third, to what extent were students trained specifically for this test? Even in the case of the SAT which was designed to be an "Aptitude" test and for which preparation was supposed to be useless, Kaplan and other test training services claimed that they could improve scores by an average of 30-50 points on the old 800 point scale (much more significant in upgrading a student's percentile ranking if the student moved from 450 to 500 or even from moved 730 to 780)

Now some comments on the numerical data with respect to these basic statistical ideas.

The spread for the approximately 30 countries (where are the other 35 mentioned in the article?) is around 110 points. The bottom scoring half of these cluster in an approximate 20 point range between 490 and 510; the top half spread out over an approximate 90 point range between 510 and 600; the top 3 spread out over and approximate 30 point range. If the table only gives the top countries and the distribution is Gaussian, an estimate for the distribution variance suggests that there no statistically significant contrast among the "bottom" two thirds in the table. This idea is supported by the jumbling of the ordering between the three test categories.

The gap between Shanghai and the second place finisher for the math scores is somewhat more spread out than the corresponding segments in reading and math. The rest of the distributions appear fairly constant. So the most likely explanation for this would be that the Shanghai schools were specifically trained for the "PISA" exam and this training was most effective for the math subject test. Once again the jumbling and narrow range of scores in the top third in all test categories (except Shanghai which appears to be an outlier) is also supportive of the idea that Shanghai is "gaming" the exam.

What we are ultimately left with from a strict statistical numerical point of view are pretty modest distinctions. Based on correlations between rankings in the three subject area tests supports the idea that there might be statistically significant outcome differences as measured by this test between countries in the top and bottom thirds of the rankings. In my view even this relatively modest inference would be weakened and might not stand up to real critical examination of the sampling (how the representative groups were chosen for comparison)

In this vein there are some notable other puzzles which argue against attaching any validity whatsoever to these rankings. For example why are Denmark, Finland, Norway, Iceland Sweden, and Estonia so different? And what about Canada, New Zealand, and Australia? The variance within these similar groupings support the idea that the underlying cause of score differentiation has more to do with sampling variance than national educational effectiveness.

it is certainly arguable the Chinese educational system and culture leads the world in training students how to take tests. But it is not clear whether this type of training prepares students for much else other than taking tests. Certainly I have seen much evidence for this proposition in the Chinese graduate students that I have worked with. My favorite examples were the Chinese students with perfect TOEFL scores who could neither read nor write English in any meaningful way.

With regard to these type of studies in general: Statistical inference strongly depends on the quality of study design and methods. And critical analysis of data. Without these aspects this type of effort just recapitulates the sum of all prejudices. So here we have the consistent Tom Friedman/NYT/Chattering class message. Chinese: good: smart, better educated, harder working etc. Americans bad: dumb, poorly educated, lazy, and fat. We deserve to be losers

With regard to educational policy in particular: Standardized tests my have their uses but structuring national educational systems around such tests seems to me to be a prescription for disaster. This is the road that we are on and in my view it is destroying public education. No one benefits from this unless they want public education to produce conformity and narrowly trained individuals with limited "shelf-life". So who else other than corporations like Microsoft benefit (or at least people like Bill Gates and Tom Friedman think they do) from this mad view of education?

Perhaps the real educational deficiency here is in the sophistication with which our chattering and leadership classes understand statistics and the limitation of standardized tests in measuring student, school, and national educational system achievement. Not to mention what constitutes a good education and how it serves broader goals of national development.<<

On the timeless principle that "I'd rather make a point too often than not make it often enough," I'll say one more time: take this seriously, and recognize that China is moving ahead in many, many ways. But recognize the fallibilities in this study, and don't go nuts.