The Entry of Randomized Assignment into the Social Sciences

Although the concept of randomized assignment to control for extraneous factors reaches back hundreds of years, the first empirical use appears to have been in an 1835 trial of homeopathic medicine. Throughout the 19th century, there was primarily a growing awareness of the need for careful comparison groups, albeit often without the realization that randomization could be a particularly clean method to achieve that goal. In the second and more crucial phase of this history, four separate but related disciplines introduced randomized control trials within a few years of one another in the 1920s: agricultural science, clinical medicine, educational psychology, and social policy (specifically political science). Randomized control trials brought more rigor to fields that were in the process of expanding their purviews and focusing more on causal relationships. In the third phase, the 1950s through the 1970s saw a surge of interest in more applied randomized experiments in economics and elsewhere, in the lab and especially in the field.

2 Development of Western science is based on two great achievements: the invention of the formal logical system (in Euclidean geometry) by the Greek philosophers, and the discovery of the possibility to find out causal relationships by systematic experiment (during the Renaissance).
-Albert Einstein (1953) 1. Introduction The quote above appears in Pearl (2000), a comprehensive reference on the statistics of causality.
In an entertaining history of the "art and science of cause and effect", Pearl refers to the randomized experiment as "the only scientifically proven method of testing causal relations from data, and to this day, the one and only causal concept permitted in mainstream statistics." Interestingly, although Einstein dates the idea of causal experiments -any relationship to randomization goes unspoken by him -to the Renaissance, Pearl claims that it waited upon Fisher in the 1930s. As we shall see, they were both correct: the explicit idea appeared hundreds of years ago, but only in an isolated fashion; it did not become a social construct until the 1920s.
Einstein and Pearl agreed upon the central role of rigorous experiments in determining causality, which has long been understood and accepted in the physical and biological sciences but has undergone a more recent rise in the social sciences. 1 The basic idea is straightforward: Suppose you wish to test the relative effect of treatment (or intervention, broadly construed) A vs treatment B -one of which could be a null treatment or status quo. Take a large number of subjects (individuals, schools, firms, villages, etc.) and divide them randomly into two groups. The first group gets A and the second group gets B; other than that, their experiences are identical. Since the division was random and the sample size was large, we can be highly confident that the two groups started out with the same average levels of all relevant characteristics, both observable and unobservable. Therefore, any aggregate differences between the groups measured after the experiment can be causally identified with the corresponding treatments.
Naturally there are assumptions to be made, and there are many complications that arise in specific instantiations of this approach. The goal of this paper is not to enter into the debate about the relative merits of randomization, although certain elements of that debate will make appearances throughout. However, it is abundantly clear that they are not always the right tool for the job. Imagine the idea of testing the efficacy of parachutes versus a control group on mortality rates when disembarking an airplane at a height of 10,000 feet (3,000 meters) above ground level. Not only would a randomized experiment be unethical, it would be completely unnecessary. That experiment has never been undertaken (see Smith and Pell 2003 for a review of the literature), and yet we are convinced from theory 3 and from analogy and from common sense that we know the actual relative efficacy of the two approaches. 2 Let us begin with a definition for our central concept of randomized assignment. An empirical research study consists of one or more observations, where an observation consists of measured conditions (what was done, when, to whom or what, etc.) and measured outcomes (what resulted). For instance, an observation might be that a 29-year-old male with a ponytail and a red jacket was made to sit for four minutes alone in a dark room in the morning (the conditions), after which he coughed and ran away (the outcomes). Randomized assignment occurs when the value of at least one condition is assigned randomly across observations. In the example above, the researcher might randomly assign some subjects to sit for four minutes and others for ten minutes. Or the researcher might assign the same subject, arriving on various occasions at different times of day and wearing different clothes, to sometimes sit in a dark room and sometimes in a brightly lit one.
The value of randomized assignment is that it implies that the measured status of the condition which is randomized is uncorrelated (in expectation) with any of the other conditions, and hence that any variation in outcomes as a function of that status must be due to the influence of that randomized condition, at least within the set of potential conditions examined. In the second example above, the researcher can causally determine the impact of light on any measurable outcomes of interest, but only for that subject and only when sitting alone in a room at times of day when the experiment is carried out. Randomization yields 'internal validity' (comparing "like with like" in the phrase of Chalmers 2001) but not 'external validity'. Of course, the larger the relevant population or expanse of observed conditions, the more widespread and robust is the conclusion. This is distinct from the random sampling of subjects from a larger population in order to draw conclusions that are representative of the entire population. That too can rightly be called randomization, and it has an important place in social science, but the rationale and history are not the same; see Fienberg and Tanur (1987). 3 Although the two are quite different in purpose and application, there has been some confusion over the years in the philosophy of science literature. 4 The articulation above implicitly assumes that the goal of randomized assignment is precisely to draw causal inferences, but there are other possible rationales -in particular fairness. In many applied contexts (such as medicine and social policy) with limited resources, it seems that nothing could be fairer than randomizing opportunities across all eligible individuals; on the other hand, nothing could use less information about where to optimally allocate those limited resources than randomization. If the 2 Note also that sometimes causality is irrelevant in science, e.g. when estimating the speed of light or how quickly a feather falls in a vacuum. 3 They mention a creative and early use of randomization carried out by Mahalanobis (1946) in India while surveying factory workers. Instead of assigning one enumerator to each area, he divided the areas into five independent random samples and had each enumerator work in every area. This is a nice example of embedding experimental design (involving randomized assignment) into survey design. 4 Urbach (1985) claims that from a Bayesian perspective, randomization can be of no use in testing statistical hypotheses; Papineau (1994) rightly responds that that only holds true for random sampling and not for randomized experimentation of the type considered here. 4 interventions are fairly well understood, and the default behavior involves allocation that is relatively efficient from a social perspective, then randomization is harder to defend. One response that is common in medical trials is to use current best practice for the control group (rather than e.g. a placebo). If there is less information in the first place, or if the default allocation mechanism is suboptimal (e.g. via corruption and nepotism), then randomized assignment starts to look much more attractive on its own merits. There are more cases with evidence suggesting that randomization was adopted for the purpose of determining causality than for the purpose of fairness, 5 though undoubtedly fairness has been used to sway especially policy makers.
Another use of randomization, also closely tied to the one that is the focus of the present paper, is for the purpose of ensuring the validity of specific statistical tests. Many formal analyses are predicated on certain assumptions regarding the data-generating process which can only be satisfied, or are more easily satisfied, when there has been random allocation into treatments. Thus, in practice the two notions are correlated and overlapping, but they are conceptually distinct and can diverge in practice. Indeed, randomized assignment serves a deeper purpose than simply impartially dividing a sample into subsamples, and it may apply even when causality is not a central concern: it guarantees us that any two observations we collect are comparable (in expected terms) across dimensions other than those we know about and vary in a controlled manner.
The primary focus of this paper is randomized assignment, but since no history of randomization would be complete without mention of Ronald Fisher -who is the originator and proselytizer of statistical randomization -let us pause for a moment to summarize his contributions. Fisher took a position in 1919 as statistician at Rothamsted agricultural research station, where his main job was to analyze the piles (literally) of existing data from previous 'experiments'. He started to develop theories of his own about how to optimally run experiments, culminating in the publication of his classic book (Fisher 1935). Although he had advocated randomization (in the sense of e.g. randomly allocating different seeds or fertilizers to different plots of land) as a theoretical concept in 1925, his first empirical publication that used randomization as a technique was two years later (Eden and Fisher 1927).
Apart from Fisher's statistical contributions, his main role in the history of randomization is that he explicitly and tirelessly advocated for rigorous experimentation and evaluation, including randomization, and that he gave more applied researchers the tools and techniques they needed to make this happen. Given that the basic idea of randomized assignment had arisen many years earlier but had not been particularly influential, this was an important step. In fact, even then not everyone liked the idea of randomization: Fisher's great statistical contemporary, Gosset (who published as "student" due to restrictions imposed by his employer, the Guinness brewery), felt that it was better to match data points on as many observable characteristics as possible, with randomization simply adding unwanted and unnecessary noise to the data. For small sample sizes this may well be true, 6 but Fisher's approach has mostly carried the day -albeit more quickly in some fields than in others.
The goal of this paper is to chart the introduction of randomized assignment, both in actual practice and as a conceptual construct, into various intersecting and intertwining branches of medical and social science, especially economics, psychology, and policy. One motivation for doing so is the growing success of this approach (at least in terms of relative popularity and claimed standing as a 'gold standard'). However, the focus is on the narratives and conditions surrounding the entry (and occasional re-entry) in and across disciplines, including especially the intellectual environment of the early adoptions, rather than on the factors that did or did not lead to later success and their relative merits. The main contributions to the existing literature are: corrections of various misstatements regarding initial appearances of randomization; earlier examples of randomized assignment in a variety of disciplines; bringing together in one place the discussion across medicine and multiple social sciences; and given all these elements being able to draw inferences about patterns regarding the viability and acceptance of randomization when it was a novel scientific research construct.
The remainder of the paper proceeds as follows. Section 2 provides the early background, tracing various isolated instances of both randomized assignment and not-quite-randomized assignment. Section 3 briefly discusses the history of randomized assignment in clinical medicine, the field with which it is most closely associated. Then we turn to social science proper, beginning with psychology in section 4; economics in Section 5; and finally, social policy (including public health) in Section 6. Section 7 provides concluding remarks.

Prelude
In the Bible, Proverbs 18:18 reads "The lot causeth disputes to cease, and it decideth between the mighty." 7 If only! But it is a nice thought, although presumably this refers to randomization as a way to solve disputes directly rather than as a way to help determine who is actually right. Meanwhile, in the Book of Daniel (1:8-16), Daniel does not wish to consume the royal fare and suggests a test: he and his friends will eat only pulse and drink only water for ten days, after which the official can compare their health to those of the young men consuming the royal fare. 8 Although this episode nicely captures the idea of a comparison group, there is an obvious problem with endogeneity and selection bias. Hence not only is randomization in any form missing, but there is no sense of a controlled or fair experiment. Furthermore, in terms of chronology, although Charles Darwin did not advance his theory of natural selection until the mid-19 th century (Darwin 1859), Nature had conveniently begun to experiment via randomization in the context of allopatric speciation after vicariance to test his theory some millions of years earlier. 9 6 It is likely that scholars in antiquity understood the basic idea of comparing two similar groups in order to reliably test interventions. However, the first written documentation of which I am aware is by the poet Petrarch (1364) in a letter to Boccaccio: I solemnly affirm and believe, if a hundred or a thousand men of the same age, same temperament and habits, together with the same surroundings, were attacked at the same time by the same disease, that if one half followed the prescriptions of the doctors of the variety of those practicing at the present day, and that the other half took no medicine but relied on Nature's instincts, I have no doubt as to which half would escape.
Although there is no mention of randomization and no concrete suggestion to collect data, it is clear that the goal was to devise two groups that were as similar as possible. It is also clear what Petrarch thought of doctors. However, this example serves to illustrate that the general idea was in circulation and yet simultaneously that it was not part of regular practice in terms of implementation, implying that it held no special place in convincing physicians or governments of efficacy.
Thus, we arrive at the generally accepted first surviving mention of randomized assignment, due to Flemish chemist and physician Jan Baptist van Helmont. Everyone at the time, including van Helmont, believed that bloodletting was a fantastic cure for most ailments. However, he believed that evacuation (i.e. inducing vomiting and defecation) was an even better approach, and he proposed a simple way to settle the argument once and for all: Let us take out of the Hospitals, out of the Camps, or from elsewhere, 200 or 500 poor People, that have Fevers, Pleurisies, etc. Let us divide them in halfes, let us cast lots, that one half of them may fall to my share, and the other to yours; I will cure them without bloodletting… we shall see how many Funerals both of us shall have.
For better or worse, there is no evidence that this test was ever put into practice, but the idea is up to modern standards. 10 When was this written? Nobody knows precisely. Many articles cite van Helmont (1662), but that is the first English translation (from which the above quote is taken) of the original Latin publication (van Helmont 1648). Even that is clearly too late, since van Helmont died in 1644; some of his writings were controversial, so the corpus did not see the light of day until his son brought them out posthumously.
Despite van Helmont's mistaken (but typical) views on clinical practice, he was an inquisitive and thoughtful researcher, a Renaissance man befitting Einstein's quote above. This will be a theme for many of those who intersect the origins of randomization, suggesting that each successive development was 7 not nearly as simple as it appears in retrospect. Along those lines, we proceed by mentioning two more notables in the history of clinical trials, albeit unrandomized. James Lind was a Scottish naval surgeon (which did not require extensive medical training, although he later earned an MD) who was an early believer in the theory that citrus fruits could help cure scurvy, which is indeed caused by a deficiency of vitamin C. He provided a partial test of this claim on a voyage in 1747 (published in Lind 1753), when he divided 12 afflicted sailors into six pairs and gave each pair a different treatment -one of which was two oranges and one lemon daily. 11 He made a point of the fact that the men were similar to begin with and were treated identically in all ways apart from the experimental variation: Their cases were as similar as I could have them. They all in general had putrid gums, the spots and lassitude, with weakness of their knees. They lay together in one place, being a proper apartment of the sick in the fore-hold; and had one diet common to all. While Lind did not include an untreated control group, Watson (1768) did exactly that in a study of smallpox variolation: as he put it, "it was proper also to be informed of what nature unassisted, not to say undisturbed, would do for herself." Although both men explicitly attempted to perform their tests on a homogeneous population, as well as to maintain parity apart from the treatments of interest, neither of them suggests randomization or any other method to approximate parity in this way.
Finally, it is worth mentioning a somewhat flamboyant experiment in the arena of animal husbandry performed by famous microbiologist Louis Pasteur in 1881. He was attempting to publicly prove that he had developed an animal anthrax vaccine (which may not have been his to begin with), so he asked for 60 sheep and split them into three groups: 10 would be left entirely alone; 25 would be given his vaccine and then exposed to a deadly strain of the disease; and 25 would be untreated but also exposed to the virus. It is unclear whether sheep have more or less natural variation than Fisher's plots of land, but there is no mention of randomization or selection bias in the paper (Pasteur 1881). Perhaps this was not a major issue given the stark results: all of the exposed but untreated sheep died, while all of the vaccinated sheep survived healthily.

Medicine
Many people associate the RCT (randomized control trial, which involves randomization into a control group and one or more 'treatment' groups for comparison) with medicine, where it has come to be viewed as the 'gold standard.' 12 Partly for this reason; partly because -as described above and below 8 -it was primarily clinicians who took the first steps along this path; and partly because the timing of randomized assignment entering the establishment in medicine so closely coincided with that in other fields; it makes sense to include some discussion of medicine in this context although it is not properly a social science. Of course, many of the same factors around human behavior are at play.
After the early empirical approaches of Lind and Watson, the next big step was taken by "a society of truth-loving men" in Nuremberg in 1835 (see Löhner 1835 anddiscussion in Stolberg 2006). In order to evaluate the effect of a salt-based homeopathic treatment, 100 local citizens were recruited to volunteer. 100 vials were numbered consecutively, mixed together, and then separated into two groups of 50. All vials were filled with pure snow water, and the salt potentiation was added to one of the groups. The experimenters noted which numbered vials this corresponded to, and the resulting list was sealed and kept secret until the end of the trial. After the vials were once again well mixed with one another, the participants each ingested the contents of one vial, reporting their symptoms two weeks later "in order to compare the effect with the cause". The results suggested no effect of the homeopathic remedy, although since outcomes were self-reported it is possible that there was bias introduced at that stagenamely reporting nothing so as to match the control.
On the one hand this is a remarkable event: it clearly constituted randomized assignment (the first instantiation of which I am aware) to treatment and control, as well as being double blind and remarkably transparent about procedures (prospectively!) and about attrition. On the other hand, it does not seem to have made much impact on the general practice of medical trials, and even now it is neither widely known nor appreciated. That being said, it was not an entirely isolated incident: dating to van Helmont in the early 17 th century and Mesmer in the late 18 th century, 13 much of the drive for rigorous testing was due to the high-stakes battle between homeopathy and allopathy. In the Nuremberg case, perhaps one of the reasons it had less impact was that the participant subjects were not in need of a cure; they were simply being tested to see if they noticed any effects.
Although randomization did not become common practice for another century, the idea of demanding a proper comparison group was gaining adherents. For instance, later in the 19 th century, we find examples of doctors using alternation "to avoid the imputation of selection" (Balfour 1854) or to induce "an equally large number of randomly selected patients treated as usual" (Fibiger 1898). 14 Note that although Fibiger obviously believed that what he did was equivalent to random allocation, which was indeed his goal, what he actually did was to alternate treatment based on the day the patient arrived at the hospital. From a modern perspective this looks importantly distinct, but at the time these were all simply methods to produce a valid control group (and in practice alternation likely worked quite well in most instances).
13 Mesmer (1781) proposed but did not carry out a challenge to his colleagues regarding his theory of 'animal magnetism', in which he writes: "In order to avoid any later argument and all the questions that could be raised about differences in age, in temperament, in diseases, in their symptoms etc. the assignment of the patients shall be made by the method of lots." His ideas were later tested, though without explicitly randomized assignment, by a 1784 commission led by Benjamin Franklin for the king of France (see Kaptchuk 1998). 14 This was a study of diphtheria; Fibiger later won the Nobel Prize for his work on cancer. 9 The modern era of RCTs in medicine begins with Colebrook (1929), in which "drawing lots" was used to decide which kids would be irradiated (it's not as bad as it sounds) -but if the parents refused consent then those children were added to the control group, which undoes much of the point of randomization but still [re]introduces the concept. The rigorous version appears two years later in Doull (1931), a study of the effect of ultra-violet light on the common cold. Doull worked at the Johns Hopkins School of Public Health and needed to figure out how to allocate his subjects into three groups in a manner that would allow for valid comparisons and analysis. According to Marks (2008), he consulted with a local biostatistician with a doctorate in mathematics, who suggested using colored dice to randomly allocate patients. Note the similar timing for these randomizations in clinical medicine as in agriculture (Eden and Fisher 1927).
The final piece of the medical puzzle falls into place with the famous streptomycin trial for tuberculosis (Medical Research Council 1948). 15 This is probably the most famous RCT in history, and many people have erroneously claimed that it was in fact the first RCT in history. The design was the brainchild of Austin Bradford Hill, whose degree was in economics (earned while recovering from tuberculosis himself) but who worked as a biostatistician and epidemiologist. 16 In addition to the important step of highlighting the need for randomization and of promoting it -he later wrote down influential formal criteria for imputing causality -Bradford Hill also promulgated another key aspect in the 1948 paper: the explicit idea of using randomization to consciously conceal foreknowledge, i.e. to "blind" the experimenter to treatment status whenever possible.

Psychology
Human sensation, like psychical phenomena and mesmerism, was for most of history not considered a domain susceptible to quantitative scientific analysis. That began to change with the work of Gustav Fechner in the mid-19 th century, who initiated the field of psychophysics (Fechner 1860) along with Ernst Weber. In particular, Fechner studied sensitivity of physical perception: e.g. how finely can a subject distinguish two masses, as a function of the base weight and the marginal difference between them? Although he deserves much credit for introducing concepts such as empirical experimentation and mathematical data analysis to this entire field, his methods were far from perfect. In particular, Fechner experimented on himself; for example in the perception experiments he knew all the relative weights in advance. He believed that he could consciously control for any resulting bias.
Müller (1879) took the next step, splitting the roles of subject and experimenter. He concurrently emphasized the notion of presenting stimuli in an irregular order (in buntem Wechsel; see Dehue 1997), but neither he nor Fechner employed randomization -although Müller did eventually start to promote the use of explicit randomization around the turn of the century. Meanwhile randomization was used by Richet (1884) but only as an inherent component of the stimulus itself. This is because he was testing telepathy, a topic that was all the rage in Europe at the time and which was eminently suitable for rigorous evaluation. 17 Randomly chosen playing cards were studied intently by one person, who tried to mentally pass the information to another. Thus, the randomization was not carried out in order to compare different treatments.
We turn now to one of the more well-known protagonists in this arc, and indeed the proponent of what is likely the first instance of randomized assignment in social science, namely Charles S. Peirce. According to Stigler (1992), Peirce was educated at home by his father, a mathematics professor at Harvard. He was ambidextrous and had the habit of writing questions with his left hand while writing the answers with his right hand. By December 1883, when he began the series of experiments described below, he was on the faculty at Johns Hopkins, where he was primarily known as a philosopher but also worked in physics, mathematics, cartography… and psychology.
Fechner had postulated that for any given base weight, there was a minimum additional weight below which it was impossible to perceive any difference, i.e. where the two felt exactly the same. Peirce disagreed, believing that even for very small differences, if subjects were forced to choose which one they thought was heavier, 18 they would be correct slightly more often than they were wrong. Along with a student of his named Jastrow, he proceeded to test his hypothesis in a series of experiments in 1884. They took turns as experimenter and subject, which Fechner was naturally unable to do while working alone, with the experimenter drawing playing cards to determine which weight came first on any given trial: if red the base weight came first; if black the supplemented weight was first. As Peirce and Jastrow (1885) note in their paper: A slight disadvantage in this mode of proceeding arises from the long runs of one particular kind of change, which would occasionally be produced by chance and would tend to confuse the mind of the subject. But it seems clear that this disadvantage was less than that which would have been occasioned by his knowing that there would be no such long runs if any means had been taken to prevent them. This is precisely the type of concern that Fisher and Gosset would argue about almost 50 years later in a different context: trading off the reduction of noise via regularity where possible, versus using randomization to equalize everything but only in expectation. We still argue about such things today. 19 17 Hacking (1988) provides illuminating historical details on this development. As far as results go, Richet was the first of many authors not to find evidence for supernatural powers. 18 Forced choice was an innovation along with randomization, albeit not as momentous. Additionally, subjects were asked to express confidence in their choice on a scale of 0-3, which was a further innovation that is still underutilized today. 19 Prominent behavioral economist Matthew Rabin has even suggested that sometimes it might be better not to randomize over long sequences, precisely in order to convince typical subjects that the sequence is random, since most people do not expect long runs of the same value in nature. See Jamison et al. (2008) for related discussion.
Was this an example of randomized allocation into treatment and control groups of subjects? Clearly not. Forsetlund et al. (2007) argue that Peirce's randomization served only to blind the subject and not to assess the effect of an intervention on an outcome. But this seems like a false dichotomy: Peirce was randomizing not merely to blind the subject (as Richet had) but also to allow for comparisons of "like with like". Because of the structure of the experiment, there were two possible conditions (base weight first or supplemented weight first), and Peirce wanted to ensure that the two corresponding sets of observations differed only in this respect. Among other attributes, this required the subject not to know which one came first; but even if the subject didn't know, it could have been the case that one condition was systematically different from the other (e.g. maybe it is easier to perceive increasing than decreasing weights). Randomization solves this problem neatly in a way that no deterministic ordering, however carefully balanced and thought out, can do. 20 For practical reasons Peirce randomized over stimuli rather than over subjects (the analogy is that one group of individuals would always receive the base weight first, while the other group would receive the supplemented weight first), but the purpose and the implications are the same. This is why we focus here on "randomized assignment" rather than "randomized allocation", and it is clear that Peirce understood the importance of this approach -although it did not immediately catch on with others. Fortunately, as it happens, Peirce's hypothesis was at least confirmed in the data.
Early efforts to apply experimental techniques in controlled settings outside the lab also lay with psychologists, although in this case it was educational psychology at the forefront. Starting around the turn of the century there were many studies of learning in classrooms, and a book by McCall (1923) on experimental design in education highlights randomization as a particularly efficient approach for avoiding selection bias and other spurious influences. However, no empirical studies cited by McCall involving actual randomization have been found; all extant sources are either silent on the matter or use some form of matching to create a control group for comparison. By the early 1920s, the importance of a rigorously equivalent (in expectation) comparison group had become clear. Dearborn and Lincoln (1922) divided pupils "arbitrarily according to the seating arrangements" but not explicitly randomly; indeed, the seating was unlikely to have been random. The earliest definitive examples that I have located (predating what has been found in the existing literature on this topic) appear in the Journal of Educational Psychology: Shaffer (1927) writes that "five experimental groups were made up by random selection" and Clark (1928) writes that "subjects were placed at random in four groups of eight each." There is no particular reason to believe that these were the absolute first such use of the technique in this field, but it is at least highly suggestive that the first conscious use was between 1923 (given that there are no examples in McCall's book that year, despite the author being particularly interested in the method) and 1927. Perhaps more importantly, we observe 12 that by the time of its casual mention in these two publications randomization was methodologically unremarkable within that field.

Economics
Unlike their non-laboratory brethren, experimental economists took to randomization very quickly, as had their counterparts in psychology. Although somewhat late to the game in the grand scheme of things, these researchers tended to be deeply careful about their hypotheses and assumptions, which led to multiple distinct uses of randomization -some but not all of which fall under the category of randomized assignment. In addition, like some of the early agricultural experimenters, they tended to focus on the role of theory in their models and analysis; sometimes there was no need for a control group because theoretical predictions provided the point of comparison. Chamberlin (1948) is often considered the first laboratory experiment in economics, although Roth (1993) points to an even earlier paper by Thurstone (1931), published in a psychology journal, in which indifference curves were studied by asking subjects hypothetical questions about consumption tradeoffs between everyday goods. As was typical of individual choice experiments where the concern was within-subject consistency of choices rather than selection bias or comparison across conditions, Thurstone did not randomize. 21 However, in an interesting link to the early psychology literature discussed above, Thurstone suggested that the motivation to consume followed Fechner's Law regarding least perceptible differences (see Moscati 2007 andLenfant 2012).
However, we find a creative and early use of randomization in consumer choice in Davidson et al. (1955): in the context of measuring utilities and subjective probabilities, they made their own dice with nonsense syllables (such as "ZEJ") on which subjects were asked to bet. In order to be absolutely certain that the results were not driven by people choosing on the basis of e.g. innate preference or familiarity for a particular sequence of letters, "…the choice of winning nonsense syllable was randomized." This is precisely the idea of randomization in order to control for unobservable factors.
Meanwhile Chamberlin (1948) reported on a market experiment with demand and supply curves induced by assigning separate values to individuals who served as either buyers or sellers. Implicit in his procedure was that this was done randomly; Smith (1962) reports on a series of market experiments from the late 1950s in which the separation is explicit: "The group of subjects is divided at random into two subgroups, a group of buyers and a group of sellers." This certainly constitutes randomized assignment, but note that the purpose was not to compare buyers against sellers or to avoid selection bias. In many ways it is reminiscent of Peirce and Jastrow (1885): randomization is consciously used to control for any 13 potential bias or asymmetry, including on the part of the experimenter, but it is not used to specifically compare treatments or interventions.
The third major topic within early experimental economics, in addition to individual choice and competitive markets, was game theory: models of strategic interaction. Kalisch et al. (1952) studied multiplayer games of cooperation, comparing the predictive ability of various equilibrium solution concepts. They were interested in one-shot games rather than the effects of repeated coalitions, so they "rotated" the players after each trial; this was not quite randomization but it served a related purpose. In terms of disciplinary background, this was a collaboration of mathematicians turned game theorists. A few years later Atkinson and Suppes (1957), also not economists by training, 22 analyzed different learning models in two-person zero-sum games, and they explicitly "randomly assigned" pairs of subjects into one of three different treatment groups. This is the earliest instance of random assignment in experimental economics, for purposes of comparing treatments, that has been found to date.
The mix of disciplines in the early years of experimental economics was broad and clearly invigorating. In addition to mathematicians and philosophers (with both Davidson and Suppes in the latter camp) bringing experience in mathematical decision theory, there were importantly the psychologists such as Atkinson and especially Sidney Siegel, a coauthor in the Davidson et al. (1955) paper. Economics was more often interested in testing the implications and predictions of specific theories, which does not necessarily require any comparison at all, or in comparing and contrasting the fidelity of various theories to data. In order to optimally organize all these experiments, there were a large number of methodological procedures that came from psychology. Siegel was a proponent of many of them, although with no special focus on randomization, and he worked hard to make these new techniques available to the world of economics, including a fruitful collaboration with economist Lawrence Fouraker on studies of bargaining and cooperation (Siegel and Fouraker 1960).
Although Siegel and others were publishing in psychology journals, most of the economics papers discussed here ended up as unpublished manuscripts or book chapters. Chamberlin (1948) appeared in an economics journal, but does not explicitly mention randomization. On the other hand, Smith (1962) is in an economics journal, discusses randomized assignment, and became highly influential in the development of the field. 23 Although Smith always gave much general methodological credit to Siegel (see Smith 2008), who unfortunately died prematurely, it is not clear whether the notion of randomized assignment was directly borrowed from psychology or was instituted independently as a natural reaction to the environment. What is clear is that he and the rest of the first generation of economists who were full-time experimentalists, such as Charles Plott, continued to use randomization 22 Remarkably, Patrick Suppes was also a coauthor in Searle et al. (1978), the first RCT in development economics, which is discussed below; he was also a coauthor in the Davidson (1955) article mentioned just above. Suppes was an analytic philosopher who worked in fields as diverse as quantum mechanics, decision theory, and psychology. 23 Suppes and Carlsmith (1962) came out slightly earlier that year, albeit in a less widely-read economics journal, and also explicitly randomized subjects into one of two experimental groups. Partly because the topic of that paper and related ones above did not flourish to the same extent, and partly because the authors went on to other work, it has not had the same impact within experimental economics as the oeuvre of Smith, who went on to win a Nobel Prize for his contributions in this area. not only for basic division into treatment groups but also (as many others mentioned in this paper) to control for anything unexpected that may have caused different outcomes in different trials. 24 6. Social Policy Many attempts have been made to analyze the development of rigorous experimentation in social policy, 25 and some of this work points to randomized evaluations going back well into the first half of the 20 th century. Unfortunately, as we saw regarding the field of educational psychology, most such claims turn out to be incorrect (typically involving instead careful but nonrandomized choice of the control group) or simply unverifiable. A perhaps surprising candidate for the position of first RCT in social science comes from the field of political science.
Leading up to the US presidential election of 1924, Harold Gosnell worked on a project whose goal was to increase voting rates in Chicago. The primary intervention was a mailed post-card (sent not just in English but also in Polish, Czech, and Italian) describing the necessity of registration prior to voting, and the results were encouraging. However, there have been conflicting opinions in the scholarly literature as to whether he used randomization to achieve those results. In the full report (Gosnell 1927), he himself writes: The second step in the process of sampling was the division of the citizens in each of the districts canvassed into two groups, one of which was to be experimented upon while the other was not. It was assumed that the non-experimental groups could be used as a sort of control. […] In order to avoid possible contacts between the experimental and the control groups, the dividing lines between the two groups were as sharply drawn as possible.
This strongly suggests that each of the 12 districts where the study was carried out was divided into two parts, one of which was somehow chosen as treatment and one as control. Forsetlund et al. (2007) acknowledge that Gosnell mentioned using "random sampling" as a method to control for nonexperimental variables, but they conclude from the description above (and from the lack of any explicit affirmative discussion of how randomization was introduced) that "random allocation is very unlikely to have been used to create the comparison groups." Indeed there is no irrefutable proof, but that conclusion seems overly pessimistic. In particular, they and others may have been unaware of Gosnell's original short report on the project (Gosnell 1926), in which he states: In order to set up this experiment it was necessary to keep constant, within reasonable limits, all the factors that enter into the electoral process except the particular stimuli which were to be tested. […] The method of random sampling was used to control these factors during the testing of the particular stimuli used in the experiment.
Although the phrase "random sampling" refers in modern parlance to choosing a representative subset of a population, which is as we have seen distinct from randomized allocation, this was not true at the time. There are multiple examples of random sampling being used in the context of randomly choosing between subsets of subjects, including Walters (1931). Indeed it is clear from Gosnell's own description that he was not referring to sampling in the modern sense, since he had no need for that: "Special efforts were made to list all the eligible voters in these areas." The most likely conclusion is that Gosnell did indeed randomize but at the "cluster" level, i.e. in order to determine which of the previously determined halves of the district (which had themselves been matched across treatment and control on baseline demographics and other observables) would receive the intervention and which would not. Whether Gosnell's study was randomized or not, two things are clear: First is that he did not immediately influence others to randomize, within political science or social experimentation broadly. Second, however, is that like more and more others at that time he clearly understood the need for a rigorous control group in order to isolate causal factors, which is what kept driving scholars toward randomization. The timing is not coincidental here: social policy, educational psychology, clinical medicine, and agriculture all used randomized assignment within a few years of each other (seemingly for the first time in each case, excepting the medical homeopathy trial of 1835) in the mid to late 1920s.
Turning back to political science, Gosnell himself did not pursue this methodological approach. Eldersveld (1956) explicitly randomizes in a similar get-out-the-vote experiment, but it did not really become popular or mainstream in political science until the turn of the 21 st century (see Green and Gerber 2003). However, this strand of literature does provide another example of the close interaction between social science fields. The second RCT on this topic was conducted by a social psychologist in Pennsylvania in 1935: Hartmann (1936) randomly divided city wards into two treatment arms and a control group. Considerably later, lab experimentalists in formal political theory (e.g. Fiorina and Plott 1978) studied issues such as majority rule, using random assignment across conditions and even within positions on a committee.
A major and fascinating early experiment in industrial psychology took place at the Hawthorne factory of the Western Electric Company, near Chicago. From the mid 1920s to the early 1930s, various environmental factors (such as lighting level) were systematically -though not randomly -varied and analyzed in terms of productivity. In an elegantly titled book, Mayo (1933) reports some early results, which are often described as consisting of increases in productivity every time an external factor is varied -whatever the nature of the change! This pattern has been interpreted as arising from the novelty of being studied, which is now referred to as the Hawthorne (or observer) effect, although re-analysis of the original data (see List and Rasul 2011) casts doubt on whether that conclusion was accurate for the actual experiments at the Hawthorne plant. 26 The first clearly and individually randomized social experiment was the Cambridge-Somerville youth study. This was devised by Richard Clarke Cabot, a physician and pioneer in advancing the field of social work. Running from 1942-45, the study randomized approximately 500 young boys who were at risk for delinquency into either a control group or a treatment group, the latter receiving counseling, medical treatment, and tutoring. Results (Powers and Witmer 1951) were highly disappointing, with no differences reported. Sociology and criminology continued to be early adopters in the use of random experimentation, with studies by Reimer and Warren (1957) on parole caseload levels, by Hanson and Marks (1958) on interviewer accuracy in the 1950 US Census, and by Ares (1963) on the large-scale Manhattan bail project.
In many ways public health acts as medicine on a social scale, and we find a similar trend for attempting to use randomization when possible even in large-scale interventions -starting a couple of decades later. A noteworthy early example involved testing the effectiveness of Jonas Salk's polio vaccine in the early 1950s, when there was a debate about whether to implement comprehensive vaccination. In order to evaluate its efficacy (and its potential risks), given that the disease had a relatively low incidence rate (especially for so-called paralytic polio), a large sample was needed: in this case over a million children. Some local health departments were hesitant to randomize and preferred an approach in which all second-graders would be vaccinated, with the first-and third-graders serving as controls. Other health departments felt that this would not be sufficiently rigorous and therefore not sufficiently compelling, thereby wasting all the money and effort spent, so they preferred a randomized (double-blind) placebo approach. In the end about half of the participants ended up using each method, which illustrates some common difficulties of using RCTs in the field. Results (see Francis 1955, but also Meier 1972 for a broader perspective) were highly encouraging, and polio vaccination has been standard ever since. 27 Unfortunately but perhaps unsurprisingly, even convincing evidence of an effective and inexpensive vaccine has not yet led to the complete eradication of polio, which as of today still exists in the wild in Afghanistan, Nigeria, and Pakistan.
Another impressive early randomized experiment in the realm of public health involved family planning in Taiwan, China (see Population Council 1963 for the experimental design and Takeshita 1964 for results). The city of Taichung was divided into three roughly matched sectors, each of which consisted of hundreds of neighborhoods (of 25-30 families). Individual neighborhoods were randomized into either a control treatment; a treatment involving information by mail only; or one of two more exhaustive treatments (which included group meetings and personalized home visits), either with the wife only or with both spouses. The relevance of the sectors is that the percentage randomized into the exhaustive treatments differed across sectors, from 20% up to 50%. This allowed the researchers to look at intensity of treatment and to examine what they called "circulation effects" (i.e. spillovers), an extraordinarily sophisticated protocol for the time. A slightly later family planning experiment (Chang et al. 1972), also in Taiwan, randomized ten experimental counties in which field workers received a monetary bonus for every woman who accepted birth control, 28 versus ten control counties. Testing marginal financial incentives for health or similar workers has returned as an active (and supposedly cutting-edge) area of research.
Although psychologists had been doing randomized applied work since the 1920s in studies of learning, and in the laboratory even longer, they tended to do less directly policy-relevant research. However, Campbell (1969) gives an overview of social experimentation from a psychological perspective, with a hierarchy that lists "true" experiments involving a randomized control group at its top. Bridging the lab and the field, Deci (1971) studied whether external rewards (pecuniary or otherwise) 'crowd out' intrinsic motivation. The lab studies were randomized, while the field study used two pre-existing groups as treatment and control. Later, of course, marketing research (the home of so-called 'A/B testing') applied psychological principles to advertising, consumer interfaces, and more.
Meanwhile Heather Ross, an MIT graduate student at the time, initiated in the 1960s what is considered the first field experiment in economics -after but not long after the corresponding research in public health, psychology, and sociology. She proposed to study the effects of a negative income tax (i.e. phased income supplementation by the government for very low incomes) in what became the New Jersey Income Maintenance Experiment. The experiment randomized, at the household level, both the level of guaranteed minimum income and the (negative) tax rate. Ross (1970) finds little evidence of a concomitant reduction in labor supply, although later analysis of the data suggested that it does in fact exist.
This project was followed by several other major randomized social experiments in economics, for instance Brooke et al. (1983) comparing outcomes of free as opposed to merely low cost health care in the RAND Health Insurance Experiment. Around the same time as Ross, Peter Bohm conducted a field experiment to test theoretical principles rather than direct policy questions. Bohm (1972) reports the results of an experiment involving willingness-to-pay for a Swedish closed-circuit television program, in which subjects were randomized into one of six possible treatment groups. The link to policy was made particularly effectively over the succeeding decades in the area of welfare, e.g. in the case of the Supported Work program evaluation (Hollister, Kemper, and Maynard 1984), as described fully in Gueron and Rolston (2013).
The one arena in which economists can perhaps claim to have been at the forefront of randomization, and which continues to be one of the most fruitful areas of application, is in the field of international development. The Radio Mathematics Project in Nicaragua began in 1974 as an effort to study the efficacy of teaching math skills via radio, initiated by education economists. The first publication from this study, comparing test scores and finding generally positive effects, was Searle et al. (1978). A later paper vividly highlighted the importance of randomized evaluation, this time in the context of students repeating school years, by showing that the full (rigorous) results ran contrary to early results reported using only the original pilot data which had not been randomized; see Jamison (1980). Although there was quite a gap in time between these early efforts and the new post-millennial age of 'randomistas' in development economics, there is a clear link between them in the work of such scholars as Michael Kremer.

Conclusion
This paper has argued that a single notion of randomized assignment captures not only the usual application of random allocation into treatment and control groups, but also more broadly any randomization that controls for observable and unobservable factors. This allows for the legitimate direct comparison of empirical observations across conditions in a broad range of environments, and hence for ascriptions of causality. Such an approach appears to be novel in the literature on this topic, and it allows a more holistic vision of the development of the concept over time. In particular we expand beyond a focus on clinical medicine, or indeed any single discipline (e.g. limited previous work on experimental economics, psychology, and social policy), and look at some of the interconnections between them.
Several specific examples that have not been highlighted in the existing literature are also resurrected en route.
In what can be called the first phase of the introduction of randomization to empirical social science, a scattering of 19 th -century research studies consciously employed the technique to good effect. Notable examples include the 1835 Nuremberg salt trial and the 1884 Peirce psychophysics experiment. Although these were completely rigorous even by modern standards, they did not immediately spawn imitators or enjoy influence. One possible reason is that they were simply unknown and unlucky, but there are two more plausible explanations. A purely practical possibility is that these experiments did not formally involve the allocation of subjects into groups, but rather randomization across treatments. It may not originally have occurred to practitioners that randomization could just as easily be used for allocation, which was the most typical need. But the most likely explanation is that the main goal was to provide a valid comparison, and no particular distinction was made at the time between randomization and other methods for doing so, such as matching and alternation. We see support for this in the work of Fibiger (1898), who equates alternation with randomization, and in the multiple educational psychology studies of the early 20 th century.
What had been a purely practical concern for learning (and debating one's colleagues, as in homeopathy versus allopathy) became a more conceptual or theoretical concern in the 20 th century. Issues surrounding causality and epistemology raised the bar for social and clinical science. It became increasingly necessary to be able to claim that certain discrete factors led to certain outcomes with a high degree of confidence in that knowledge, in order to convince one's peers as well as policy makers. Although these issues were explicitly articulated in the late 1940s and 1950s, the more explanatory transition period that gave rise to them displays a remarkable convergence across fields. In particular, one of the main contributions of the paper is the argument that four related but distinct disciplines experienced the introduction of randomized assignment within just a few years of one another around the late 1920s: political science (which was still closely tied to economics and political philosophy at the time); agricultural research; educational psychology; and of course medicine. This constituted the second phase of randomized assignment entering the social and related sciences.
The third and final phase, from the 1950s to the 1970s, was the application of randomization to larger-scale and policy-oriented problems. We find occurrences in public health as a natural analogue of clinical medicine, as well as in criminology and sociology. Contemporaneously, lab psychology moved to the field with randomized applications in marketing science and industrial psychology. Psychologists also started collaborating with economists to undertake lab studies that were motivated by and informed 'realworld' topics such as bargaining, consumer demand, and market efficiency. Eventually, but still within this formative period, economists moved to the field themselves: studying tax systems and economic development. The recent (après 2000) rise of RCTs in economics could almost be said to embody a fourth phase, although one that no longer concerns the entry of the concept into the intellectual environment. How and why that happened, as well as the strenuous pushback that it has received, 29 is a tantalizing arena for future study.
Of course, there are many environments where randomized assignment is simply infeasibleimagine nation-wide health systems -although even in such settings creative experimental designs can begin to nibble around the edges. Even when feasible, in addition to the Hawthorne effect mentioned earlier, there are occasions when randomization per se can make research more difficult. Subjects may be unwilling to be 'experimented upon', and in fact Kramer and Shapiro (1984) claim that it is much harder to recruit subjects for randomized than for nonrandomized drug trials. By definition this potential effect is difficult to test, although one can try to compare characteristics of subjects who respond to varied recruitment approaches. 30 Sometimes purely qualitative work will be less disruptive and allow greater fidelity to subjects' intrinsic behaviors; other times quantitative randomized evaluations in the field will leave subjects entirely noncognizant that there is even an experiment taking place. What remains clear is that over time the intellectual community has assigned value to specific attributes that obtain when randomization is employed. The particular attributes -blinding, equivalent comparisons, causal attributions, practicality and expediency, fairness, rigor as a social construct, and more -have varied over time and across subfields of social science. But whenever value is assigned, researchers have stepped in to fill the gap and will continue to do so.