A Complete History of IQ Testing: From Binet to Today

Q: Who invented the IQ test?

Alfred Binet, a French psychologist, created the first practical intelligence test in 1905 in collaboration with Théodore Simon. The Binet-Simon Scale was commissioned by the French government to identify children who needed special educational support. Binet explicitly did not believe intelligence was fixed or inherited — he saw the test as a diagnostic tool, not a measure of innate capacity.

Q: What is the difference between Stanford-Binet and WAIS?

The Stanford-Binet (now in its 5th edition) is derived from Binet's original scale, adapted and standardized for American populations by Lewis Terman at Stanford. It is used with both children and adults and emphasizes a single general intelligence factor. The Wechsler Adult Intelligence Scale (WAIS, now WAIS-IV) was developed by David Wechsler in 1939 and is designed specifically for adults, measuring multiple distinct cognitive abilities including verbal comprehension, working memory, perceptual reasoning, and processing speed.

Q: What were the Army Alpha and Beta tests?

The Army Alpha and Beta tests were the first large-scale group intelligence tests, developed during World War I to classify 1.75 million military recruits. The Alpha test was for literate recruits (verbal questions, analogies, arithmetic); the Beta test used pictures and non-verbal items for illiterate or non-English-speaking recruits. Their results were widely (and controversially) used to support immigration restriction legislation in the 1920s.

Q: Are IQ tests culturally biased?

This is one of the most debated questions in psychometrics. Many traditional IQ tests included content that reflected dominant cultural knowledge, disadvantaging minority and immigrant populations. Modern tests have made substantial efforts to reduce cultural loading, and tests like Raven's Progressive Matrices (which use abstract visual patterns) minimize language and cultural knowledge requirements. Most psychometricians today conclude that well-constructed modern IQ tests are minimally culturally biased in their measurement properties, though group differences in average scores remain a subject of intense ongoing research and debate.

Intelligence testing is one of psychology's most consequential inventions — and one of its most controversial. Over 120 years, it has evolved from a simple diagnostic tool designed to help struggling schoolchildren into a multi-billion-dollar industry that influences academic placement, military assignments, hiring decisions, and immigration policy. Understanding where IQ tests came from — and the sometimes troubling uses to which they were put — is essential context for interpreting what they actually measure today.

Before Binet: Early Attempts to Measure Intelligence

The scientific attempt to measure human mental ability predates Alfred Binet by decades. Francis Galton, Charles Darwin's cousin and the founder of the eugenics movement, was arguably the first to try to quantify intelligence empirically. In the 1880s, Galton established an anthropometric laboratory in London where he measured thousands of visitors on sensory acuity, reaction time, grip strength, and other physical characteristics, believing these would correlate with mental ability.

Galton's key insight was that intelligence could be measured objectively and varied normally across the population. His key error was believing that simple sensory-motor measures would capture it — they largely don't. American psychologist James McKeen Cattell introduced the term "mental test" in 1890 and continued in Galton's tradition, but his tests also failed to show meaningful correlations with academic performance.

The theoretical foundation for modern intelligence testing came from Charles Spearman, who in 1904 published his landmark analysis demonstrating that performance on diverse cognitive tasks tends to correlate positively with each other — suggesting a common underlying factor he called "g," or general intelligence. Spearman's "g" remains the most debated and most replicated finding in all of intelligence research.

Alfred Binet and the Birth of IQ Testing (1905)

The practical history of IQ testing begins in France in 1904. The French Ministry of Public Instruction was grappling with a challenge that felt urgent: children were entering Parisian public schools with vastly different levels of readiness, and teachers needed a systematic way to identify those who required extra support. They commissioned psychologist Alfred Binet and his collaborator Théodore Simon to develop a diagnostic tool.

Binet and Simon's 1905 Binet-Simon Intelligence Scale was remarkably different from the Galton tradition. Rather than measuring sensory acuity, Binet tested higher-order functions: vocabulary, reasoning, memory, understanding of directions, and practical problem-solving. Crucially, Binet organized the tests by age level — he established which tasks a typical 5-year-old, 7-year-old, or 10-year-old could successfully complete. A child's "mental age" was determined by the highest age level at which they could pass the tasks.

A child with a mental age of 8 who was chronologically 10 years old was clearly behind; a child with a mental age of 12 who was chronologically 10 was clearly ahead. This was the foundation of what would eventually become the intelligence quotient. Binet himself was careful and humble about his scale's limitations. He explicitly rejected the idea that intelligence was a fixed, inherited quantity — he saw his test as a practical diagnostic tool to identify children needing help, and he believed intelligence could be developed with the right educational interventions.

Tragically, Binet died in 1911 before the scale was exported to the United States — and before he could see how his careful, modest diagnostic tool would be transformed into something far grander and more politically charged.

Lewis Terman and the Stanford-Binet (1916)

The American transformation of Binet's scale was led by Lewis Terman, a psychologist at Stanford University who was deeply influenced by Galton's eugenics movement. Terman translated and substantially revised the Binet-Simon Scale, standardizing it on American children, adding new test items, and extending its range. He published the result in 1916 as the Stanford-Binet Intelligence Scale — the name it carries to this day.

Terman also popularized the term "intelligence quotient," first coined by German psychologist William Stern in 1912. The original IQ formula was straightforward: mental age divided by chronological age, multiplied by 100. A 10-year-old with a mental age of 12 had an IQ of 120 (12/10 × 100). This ratio IQ has since been replaced by deviation IQ — which compares performance to same-age peers — but the term "IQ" stuck.

Terman, unlike Binet, was an ardent hereditarian who believed intelligence was primarily genetic and that high IQ individuals were natural leaders who should guide society. He used his Stanford-Binet to track gifted children in his famous "Genetic Studies of Genius" — the longest longitudinal study of intellectually gifted children ever conducted, begun in 1921 and continued for decades after his death. The study revealed much about the lives of high-IQ individuals, but also reflected Terman's era's racial biases — his sample was almost entirely white, and he interpreted group score differences through a hereditarian lens that has since been thoroughly critiqued.

Army Alpha and Beta: Intelligence Testing Goes to War (1917–1918)

When the United States entered World War I in 1917, the Army faced the challenge of classifying nearly two million recruits quickly. Robert Yerkes, then president of the American Psychological Association, saw an opportunity to demonstrate the utility of psychological testing and lobbied the military to let psychologists develop classification tests.

The result was the Army Alpha and Army Beta tests — the first large-scale group intelligence tests ever administered. The Alpha was a verbal test for literate recruits, covering analogies, arithmetic, general knowledge, and following directions. The Beta used pictures and non-verbal items for the 30% of recruits who were illiterate or didn't speak English.

These tests were administered to 1.75 million men and produced an enormous dataset that psychologists eagerly analyzed after the war. The conclusions drawn — that the average mental age of American adults was only 13, that recent immigrant groups from Southern and Eastern Europe scored lower than Northern Europeans, that Black Americans scored lower than white Americans — were taken by many as scientific confirmation of racial hierarchy.

Carl Brigham, who worked on the Army testing program, published A Study of American Intelligence in 1923 arguing that immigration from Southern and Eastern Europe was lowering American intelligence. This work directly influenced the Immigration Act of 1924, which drastically restricted immigration from those regions. Brigham later recanted his racial conclusions entirely — he would go on to develop the Scholastic Aptitude Test (SAT) — but the damage was done.

The Army testing experience exposed a recurring problem with IQ tests: the distance between what they actually measure and how their results are interpreted and used. The low scores of immigrant groups almost certainly reflected language barriers, cultural unfamiliarity, and lack of formal education — not innate intelligence — but this was widely ignored in the service of pre-existing prejudices.

The Eugenics Era: Intelligence Testing's Dark Chapter

The early 20th century saw intelligence testing become entangled with the eugenics movement — the pseudoscientific program to "improve" humanity by encouraging reproduction among those deemed genetically superior and discouraging or preventing reproduction among those deemed inferior. IQ tests became a primary tool for identifying the supposedly "feebleminded."

The consequences were devastating. Over 60,000 Americans — disproportionately poor, immigrant, minority, and disabled — were forcibly sterilized under state eugenics laws during the 20th century. The Supreme Court upheld this practice in the infamous 1927 Buck v. Bell decision (which has never been formally overturned). IQ tests provided the pseudo-scientific veneer that made these abuses appear rational and systematic.

Henry Goddard, who translated the Binet-Simon Scale into English and introduced it to American psychology, used his version of the test at Ellis Island to classify incoming immigrants as "feebleminded" — concluding (absurdly, given the obvious confounds of language and cultural unfamiliarity) that 83% of Jews, 79% of Italians, and 87% of Russians were feebleminded. This chapter remains a cautionary tale about the misuse of psychometric tools in service of ideology.

The scientific community's eventual repudiation of eugenics — accelerated by the Nazi Holocaust, which revealed where eugenic ideology led in practice — did not end intelligence testing, but it forced a reckoning with the assumptions embedded in early tests.

David Wechsler and the Modern Intelligence Test (1939)

The most important figure in reshaping intelligence testing for the modern era was David Wechsler, a Romanian-born psychologist who worked as a clinical psychologist at Bellevue Hospital in New York. Wechsler was dissatisfied with the Stanford-Binet for adult populations — it had been designed primarily for children, used ratio IQ (which becomes meaningless for adults as chronological age rises), and yielded only a single score.

In 1939, Wechsler published the Wechsler-Bellevue Intelligence Scale, which made several important innovations. First, it was designed specifically for adults. Second, it replaced ratio IQ with deviation IQ — comparing an individual's score to the distribution of scores in their same-age peer group, with 100 as the average and 15 as the standard deviation. This is the scoring system all major IQ tests use today.

Third, and most importantly, the Wechsler scale provided separate Verbal and Performance IQ scores in addition to a Full-Scale IQ, explicitly recognizing that intelligence is multidimensional. This was a significant conceptual advance over tests that reduced everything to a single number.

Wechsler went on to develop the Wechsler Intelligence Scale for Children (WISC) in 1949, and the Wechsler Preschool and Primary Scale of Intelligence (WPPSI)in 1967. Today, the WAIS-IV (fourth edition of the adult scale, 2008) and WISC-V (fifth edition of the children's scale, 2014) are the gold standard clinical intelligence tests worldwide. See our What Is IQ? guide for more on how modern tests are structured.

Raven's Progressive Matrices: A Culture-Reduced Alternative

One of the most influential and widely-used intelligence tests of the 20th century was developed outside the Binet-Wechsler tradition. John C. Raven, a British psychologist, published his Progressive Matrices in 1936, designed specifically to measure fluid intelligence — abstract reasoning ability — with minimal verbal content and cultural loading.

Raven's test presents a series of visual patterns with a missing piece; the test-taker must identify which of several options correctly completes the pattern. The patterns progress from simple to highly complex, requiring increasingly sophisticated visual reasoning. Because the test requires no language, no cultural knowledge, and minimal educational background, it became widely used for cross-cultural research and for measuring cognitive ability in populations where verbal tests would be inappropriate.

The Flynn Effect was originally documented most clearly using Raven's Matrices, because the non-verbal, culturally-neutral format made cross-generational comparisons more meaningful. Today, the Standard Progressive Matrices (SPM), Advanced Progressive Matrices (APM), and Colored Progressive Matrices (CPM, for children and elderly) remain widely used in research and clinical settings.

The Cattell-Horn-Carroll Model: Modern Theoretical Foundations

The theoretical framework underlying modern intelligence tests has become increasingly sophisticated. Raymond Cattell's distinction between fluid intelligence (Gf)and crystallized intelligence (Gc), introduced in 1941 and elaborated by his student John Horn, provided a richer conceptual vocabulary than Spearman's single "g."

John Carroll's 1993 landmark work, Human Cognitive Abilities, synthesized 60 years of factor-analytic research into a hierarchical model with three strata: specific narrow abilities at the bottom, broad abilities in the middle (including fluid and crystallized intelligence, processing speed, and memory), and general intelligence "g" at the top.

The integration of Cattell-Horn and Carroll's models produced the Cattell-Horn-Carroll (CHC) theory, which now serves as the dominant theoretical framework for major intelligence tests. The WAIS-IV, Woodcock-Johnson IV, Stanford-Binet 5, and Kaufman Assessment Battery for Children all explicitly map their subtests to CHC broad and narrow abilities.

Controversies: Cultural Bias and Group Differences

The most persistent controversy in the history of IQ testing concerns group differences in average test scores — particularly the well-documented gap between Black and white Americans, which has averaged approximately 10–15 points depending on the test and sample.

The causes of this gap are vigorously debated. Hereditarian researchers have argued (most controversially in Richard Herrnstein and Charles Murray's 1994 book The Bell Curve) that genetic factors contribute to group differences. The mainstream scientific consensus, reflecting the American Psychological Association's 1996 task force report Intelligence: Knowns and Unknowns, is that the causes of group differences are not well understood but likely involve a complex mixture of environmental factors: historical and ongoing educational inequities, socioeconomic disparities, stereotype threat, test bias, and differential exposure to environmental hazards like lead.

Importantly, the Black-white gap has narrowed substantially since the 1970s — falling from roughly 15 points to approximately 10 points in recent decades, coinciding with reductions in socioeconomic inequality, educational improvements, and declining lead exposure in predominantly Black communities. This narrowing strongly suggests environmental causes are dominant, since genetic makeup doesn't change on this timescale.

Test bias in the strict psychometric sense — that tests measure different things, or measure them with different accuracy, in different groups — has been extensively studied. Most researchers conclude that well-constructed modern IQ tests show minimal differential item functioning across racial groups: the tests predict academic and occupational performance similarly for different groups. This does not eliminate concerns about cultural loading in the content of tests, but it distinguishes between measurement bias and differences in the construct being measured.

Modern IQ Tests: The State of the Art

Today's major IQ tests represent 120 years of refinement. The leading instruments include:

WAIS-IV (Wechsler Adult Intelligence Scale, 4th edition, 2008) — The gold standard adult clinical IQ test, measuring four index scores: Verbal Comprehension, Perceptual Reasoning, Working Memory, and Processing Speed, plus a Full-Scale IQ. Used for neuropsychological assessment, disability evaluation, and research.
WISC-V (Wechsler Intelligence Scale for Children, 5th edition, 2014) — The corresponding children's test (ages 6–16), now measuring five primary index scores and widely used for educational placement and giftedness assessment.
Stanford-Binet 5 (2003) — Measures five factors: Fluid Reasoning, Knowledge, Quantitative Reasoning, Visual-Spatial Processing, and Working Memory across both verbal and non-verbal domains. Suitable from age 2 through adulthood.
Kaufman Assessment Battery for Children (KABC-II, 2004) — Designed with special attention to reduced cultural bias; explicitly grounded in CHC theory and Luria's neuropsychological model.
Woodcock-Johnson IV Tests of Cognitive Abilities (2014) — The most comprehensive CHC-based battery, measuring the broadest range of cognitive abilities and widely used in educational and clinical settings.
Raven's Advanced Progressive Matrices — Still widely used in research as a relatively culture-neutral measure of fluid reasoning.

All modern tests use deviation IQ scoring (mean 100, SD 15), are periodically re-normed to account for the Flynn Effect, and are scrutinized for evidence of differential item functioning across demographic groups. Explore our IQ score ranges guide to understand what different scores mean on these tests.

The Future of Intelligence Testing

The field is not static. Several directions are reshaping intelligence assessment in the 21st century:

Dynamic Assessment

Traditional IQ tests are static — they measure what you can do without help. Dynamic assessment, rooted in Vygotsky's concept of the "zone of proximal development," tests how much someone can improve with structured hints and guidance. This approach may better capture learning potential in populations where prior knowledge and test-taking experience disadvantage static test performance.

Process-Oriented Assessment

Rather than focusing only on whether an answer is right or wrong, some researchers are developing methods to analyze the cognitive processes used to solve problems — capturing information about working memory, attention allocation, and strategy use that traditional scoring misses.

Cognitive Neuroscience Integration

Advances in neuroimaging are allowing researchers to correlate IQ test performance with structural and functional brain measures. Studies consistently find that full-scale IQ correlates with measures of neural efficiency, white matter integrity, and processing speed at the neural level. This convergent validation strengthens the case that IQ measures something real about underlying cognitive biology.

Computer Adaptive Testing

Modern computerized tests can adapt difficulty in real time based on responses, allowing precise ability estimation with fewer items. This improves measurement efficiency while reducing floor and ceiling effects that limit traditional fixed-format tests.

Despite 120 years of controversy and refinement, the core finding of intelligence research remains: performance on diverse cognitive tasks is positively correlated, that correlation reflects something meaningful about cognitive capacity, and scores on well-constructed tests predict important life outcomes including academic achievement, job performance, and health. To see where you fall on modern cognitive assessments, take our free IQ test or explore our profiles of famous people's IQ scores.

Frequently Asked Questions

Who invented the IQ test?

Alfred Binet, working with Théodore Simon in France, created the first practical intelligence test in 1905. The term "intelligence quotient" was coined by German psychologist William Stern in 1912 and popularized by Lewis Terman's 1916 Stanford-Binet adaptation. Binet himself never intended his test to measure a fixed, inherited quantity — he saw it as a practical diagnostic tool for identifying children needing educational support.

What is the difference between Stanford-Binet and WAIS?

The Stanford-Binet descended from Binet's original scale via Lewis Terman's 1916 adaptation. The WAIS was independently developed by David Wechsler in 1939 specifically for adults, introducing deviation IQ scoring and separate verbal/performance components. Today, both are widely used clinical instruments grounded in CHC theory, though they differ in structure and emphasis. See our What Is IQ? page for more details.

What were the Army Alpha and Beta tests?

Developed during WWI to classify 1.75 million recruits, the Army Alpha (verbal) and Beta (non-verbal) tests were the first mass-administered group intelligence assessments. Their results were widely misinterpreted and used to justify discriminatory immigration policies in the 1920s — a cautionary chapter in the history of psychological testing.

Are IQ tests culturally biased?

Early IQ tests were substantially culturally loaded, reflecting the knowledge and linguistic norms of dominant groups. Modern tests have made significant efforts to reduce this, and psychometric analyses generally show minimal differential item functioning across contemporary demographic groups. However, culturally neutral tests like Raven's Matrices remain important alternatives for cross-cultural research. Group differences in average scores remain a subject of active research. Explore IQ score distributions for more context.

Reviewed by

MyIQScores Editorial Team

Researchers in cognitive psychology, psychometrics & educational science

Last updated

May 10, 2026

All content on MyIQScores is reviewed for scientific accuracy against peer-reviewed research in cognitive psychology and psychometrics. Our editorial team cross-references each article with published literature before publication and updates pages whenever new research warrants a revision.

Our Methodology →Editorial Policy →Last updated: May 10, 2026