Things I enjoyed learning about in August 2017

Table of contents

  1. Table of contents
  2. Piketty et al: US inequality after accounting for taxes and government transfers
  3. Bayesianism eats frequentism
  4. RCT with clusters of 60,000 population measures general equilibrium effects of Indian job guarantee
  5. Why is the destructiveness of wars power-law distributed?
  6. Does Solmomoff induction solve the new riddle of induction?
  7. The information-theoretic reason we use the words we do
  8. A cool example of evolutionary biology and cryptography applied to existential risks from nanotechnology
  9. Post-partum depression as an evolutionary adaptation?
  10. Animal protection offered legal precedent for children’s rights

Piketty et al: US inequality after accounting for taxes and government transfers

As far as I can tell, this paper is a reply to a persistent and strong critique of Capital in the 21st Century, namely that the book only shows that inequality has increased pre-tax.

Figure 3:

Table 2:

In the United States, not only has the rising tide not lifted all boats, the only thing that has made the median American richer has been government transfers.

Bayesianism eats frequentism

From Eliezer Yudkowsky (emphasis mine):

The ancient war between the Bayesians and the accursèd frequentists stretches back through decades, and I’m not going to try to recount that elder history in this blog post.

But one of the central conflicts is that Bayesians expect probability theory to be… what’s the word I’m looking for? “Neat?” “Clean?” “Self-consistent?”

As Jaynes says, the theorems of Bayesian probability are just that, theorems in a coherent proof system. No matter what derivations you use, in what order, the results of Bayesian probability theory should always be consistent - every theorem compatible with every other theorem. […]

Math! That’s the word I was looking for. Bayesians expect probability theory to be math. That’s why we’re interested in Cox’s Theorem and its many extensions, showing that any representation of uncertainty which obeys certain constraints has to map onto probability theory. Coherent math is great, but unique math is even better.

And yet… should rationality be math? It is by no means a foregone conclusion that probability should be pretty. The real world is messy - so shouldn’t you need messy reasoning to handle it? Maybe the non-Bayesian statisticians, with their vast collection of ad-hoc methods and ad-hoc justifications, are strictly more competent because they have a strictly larger toolbox. It’s nice when problems are clean, but they usually aren’t, and you have to live with that.

After all, it’s a well-known fact that you can’t use Bayesian methods on many problems because the Bayesian calculation is computationally intractable. So why not let many flowers bloom? Why not have more than one tool in your toolbox?

That’s the fundamental difference in mindset. Old School statisticians thought in terms of tools, tricks to throw at particular problems. Bayesians - at least this Bayesian, though I don’t think I’m speaking only for myself - we think in terms of laws. […]

No, you can’t always do the exact Bayesian calculation for a problem. Sometimes you must seek an approximation; often, indeed. This doesn’t mean that probability theory has ceased to apply, any more than your inability to calculate the aerodynamics of a 747 on an atom-by-atom basis implies that the 747 is not made out of atoms. Whatever approximation you use, it works to the extent that it approximates the ideal Bayesian calculation - and fails to the extent that it departs.

Bayesianism’s coherence and uniqueness proofs cut both ways. Just as any calculation that obeys Cox’s coherency axioms (or any of the many reformulations and generalizations) must map onto probabilities, so too, anything that is not Bayesian must fail one of the coherency tests. This, in turn, opens you to punishments like Dutch-booking (accepting combinations of bets that are sure losses, or rejecting combinations of bets that are sure gains).

You may not be able to compute the optimal answer. But whatever approximation you use, both its failures and successes will be explainable in terms of Bayesian probability theory. You may not know the explanation; that does not mean no explanation exists.

So you want to use a linear regression, instead of doing Bayesian updates? But look to the underlying structure of the linear regression, and you see that it corresponds to picking the best point estimate given a Gaussian likelihood function and a uniform prior over the parameters.

You want to use a regularized linear regression, because that works better in practice? Well, that corresponds (says the Bayesian) to having a Gaussian prior over the weights.

Sometimes you can’t use Bayesian methods literally; often, indeed. But when you can use the exact Bayesian calculation that uses every scrap of available knowledge, you are done. You will never find a statistical method that yields a better answer. You may find a cheap approximation that works excellently nearly all the time, and it will be cheaper, but it will not be more accurate. Not unless the other method uses knowledge, perhaps in the form of disguised prior information, that you are not allowing into the Bayesian calculation; and then when you feed the prior information into the Bayesian calculation, the Bayesian calculation will again be equal or superior.

When you use an Old Style ad-hoc statistical tool with an ad-hoc (but often quite interesting) justification, you never know if someone else will come up with an even more clever tool tomorrow. But when you can directly use a calculation that mirrors the Bayesian law, you’re done - like managing to put a Carnot heat engine into your car. It is, as the saying goes, “Bayes-optimal”.

RCT with clusters of 60,000 population measures general equilibrium effects of Indian job guarantee

That’s big. Karthik Muralidharan, Paul Niehaus, and Sandip Sukhtankar’s amazing paper.


Public employment programs play a major role in the anti-poverty strategy of many developing countries. Besides the direct wages provided to the poor, such programs are likely to affect their welfare by changing broader labor market outcomes including wages and private employment. These general equilibrium effects may accentuate or attenuate the direct benefits of the program, but have been difficult to estimate credibly. We estimate the general equilibrium effects of a technological reform that improved the implementation quality of India’s public employment scheme on the earnings of the rural poor, using a large-scale experiment which randomized treatment across sub-districts of 60,000 people. We find that this reform had a large impact on the earnings of low-income households, and that these gains were overwhelmingly driven by higher private-sector earnings (90%) as opposed to earnings directly from the program (10%). These earnings gains reflect a 5.7% increase in market wages for rural unskilled labor, and a similar increase in reservation wages. We do not find evidence of distortions in factor allocation, including labor supply, migration, and land use. Our results highlight the importance of accounting for general equilibrium effects in evaluating programs, and also illustrate the feasibility of using large-scale experiments to study such effects.

Also, Paul Niehaus reports on Twitter:

the cool thing is it mattered - study helped convince gov’t not to scrap the program, putting $100Ms annually into hands of the poor

Why is the destructiveness of wars power-law distributed?

An almost too elegant theory: they are attrition games where the players committ the sunk cost fallacy.

Steven Pinker, The Better Angels of our Nature, Chapter 5:

John Maynard Smith, the biologist who first applied game theory to evolution, modeled this kind of standoff as a War of Attrition game. Each of two contestants competes for a valuable resource by trying to outlast the other, steadily accumulating costs as he waits. In the original scenario, they might be heavily armored animals competing for a territory who stare at each other until one of them leaves; the costs are the time and energy the animals waste in the standoff, which they could otherwise use in catching food or pursuing mates. A game of attrition is mathematically equivalent to an auction in which the highest bidder wins the prize and both sides have to pay the loser’s low bid. And of course it can be analogized to a war in which the expenditure is reckoned in the lives of soldiers.

The War of Attrition is one of those paradoxical scenarios in game theory (like the Prisoner’s Dilemma, the Tragedy of the Commons, and the Dollar Auction) in which a set of rational actors pursuing their interests end up worse off than if they had put their heads together and come to a collective and binding agreement. One might think that in an attrition game each side should do what bidders on eBay are advised to do: decide how much the contested resource is worth and bid only up to that limit. The problem is that this strategy can be gamed by another bidder. All he has to do is bid one more dollar (or wait just a bit longer, or commit another surge of soldiers), and he wins. He gets the prize for close to the amount you think it is worth, while you have to forfeit that amount too, without getting anything in return. You would be crazy to let that happen, so you are tempted to use the strategy “Always outbid him by a dollar,” which he is tempted to adopt as well. You can see where this leads. Thanks to the perverse logic of an attrition game, in which the loser pays too, the bidders may keep bidding after the point at which the expenditure exceeds the value of the prize. They can no longer win, but each side hopes not to lose as much. The technical term for this outcome in game theory is “a ruinous situation.” It is also called a “Pyrrhic victory”; the military analogy is profound.

One strategy that can evolve in a War of Attrition game (where the expenditure, recall, is in time) is for each player to wait a random amount of time, with an average wait time that is equivalent in value to what the resource is worth to them. In the long run, each player gets good value for his expenditure, but because the waiting times are random, neither is able to predict the surrender time of the other and reliably outlast him. In other words, they follow the rule: At every instant throw a pair of dice, and if they come up (say) 4, concede; if not, throw them again. This is, of course, like a Poisson process, and by now you know that it leads to an exponential distribution of wait times (since a longer and longer wait depends on a less and less probable run of tosses). Since the contest ends when the first side throws in the towel, the contest durations will also be exponentially distributed. Returning to our model where the expenditures are in soldiers rather than seconds, if real wars of attrition were like the “War of Attrition” modeled in game theory, and if all else were equal, then wars of attrition would fall into an exponential distribution of magnitudes.

Of course, real wars fall into a power-law distribution, which has a thicker tail than an exponential (in this case, a greater number of severe wars). But an exponential can be transformed into a power law if the values are modulated by a second exponential process pushing in the opposite direction. And attrition games have a twist that might do just that. If one side in an attrition game were to leak its intention to concede in the next instant by, say, twitching or blanching or showing some other sign of nervousness, its opponent could capitalize on the “tell” by waiting just a bit longer, and it would win the prize every time. As Richard Dawkins has put it, in a species that often takes part in wars of attrition, one expects the evolution of a poker face.

Now, one also might have guessed that organisms would capitalize on the opposite kind of signal, asign of continuing resolve rather than impending surrender. If a contestant could adopt some defiant posture that means “I’ll stand my ground; I won’t back down,” that would make it rational for his opposite number to give up and cut its losses rather than escalate to mutual ruin. But there’s a reason we call it “posturing.” Any coward can cross his arms and glower, but the other side can simply call his bluff. Only if a signal is costly—if the defiant party holds his hand over a candle, or cuts his arm with a knife—can he show that he means business. (Of course, paying a self-imposed cost would be worthwhile only if the prize is especially valuable to him, or if he had reason to believe that he could prevail over his opponent if the contest escalated.)

In the case of a war of attrition, one can imagine a leader who has a changing willingness to suffer a cost over time, increasing as the conflict proceeds and his resolve toughens. His motto would be: “We fight on so that our boys shall not have died in vain.” This mindset, known as loss aversion, the sunk-cost fallacy, and throwing good money after bad, is patently irrational, but it is surprisingly pervasive in human decision-making. People stay in an abusive marriage because of the years they have already put into it, or sit through a bad movie because they have already paid for the ticket, or try to reverse a gambling loss by doubling their next bet, or pour money into a boondoggle because they’ve already poured so much money into it. Though psychologists don’t fully understand why people are suckers for sunk costs, a common explanation is that it signals a public commitment. The person is announcing: “When I make a decision, I’m not so weak, stupid, or indecisive that I can be easily talked out of it.” In a contest of resolve like an attrition game, loss aversion could serve as a costly and hence credible signal that the contestant is not about to concede, preempting his opponent’s strategy of outlasting him just one more round.

I already mentioned some evidence from Richardson’s dataset which suggests that combatants do fight longer when a war is more lethal: small wars show a higher probability of coming to an end with each succeeding year than do large wars. The magnitude numbers in the Correlates of War Dataset also show signs of escalating commitment: wars that are longer in duration are not just costlier in fatalities; they are costlier than one would expect from their durations alone. If we pop back from the statistics of war to the conduct of actual wars, we can see the mechanism at work. Many of the bloodiest wars in history owe their destructiveness to leaders on one or both sides pursuing a blatantly irrational loss-aversion strategy. Hitler fought the last months of World War II with a maniacal fury well past the point when defeat was all but certain, as did Japan. Lyndon Johnson’s repeated escalations of the Vietnam War inspired a protest song that has served as a summary of people’s understanding of that destructive war: “We were waist-deep in the Big Muddy; The big fool said to push on.”

The systems biologist Jean-Baptiste Michel has pointed out to me how escalating commitments in a war of attrition could produce a power-law distribution. All we need to assume is that leaders keep escalating as a constant proportion of their past commitment—the size of each surge is, say, 10 percent of the number of soldiers that have fought so far. A constant proportional increase would be consistent with the well-known discovery in psychology called Weber’s Law: for an increase in intensity to be noticeable, it must be a constant proportion of the existing intensity. (If a room is illuminated by ten lightbulbs, you’ll notice a brightening when an eleventh is switched on, but if it is illuminated by a hundred lightbulbs, you won’t notice the hundred and first; someone would have to switch on another ten bulbs before you noticed the brightening.) Richardson observed that people perceive lost lives in the same way: “Contrast for example the many days of newspaper-sympathy over the loss of the British submarine Thetis in time of peace with the terse announcement of similar losses during the war. This contrast may be regarded as an example of the Weber-Fechner doctrine that an increment is judged relative to the previous amount.” The psychologist Paul Slovic has recently reviewed several experiments that support this observation. The quotation falsely attributed to Stalin, “One death is a tragedy; a million deaths is a statistic,” gets the numbers wrong but captures a real fact about human psychology.

If escalations are proportional to past commitments (and a constant proportion of soldiers sent to the battlefield are killed in battle), then losses will increase exponentially as a war drags on, like compound interest. And if wars are attrition games, their durations will also be distributed exponentially. Recall the mathematical law that a variable will fall into a power-law distribution if it is an exponential function of a second variable that is distributed exponentially. My own guess is that the combination of escalation and attrition is the best explanation for the power-law distribution of war magnitudes.

Chapter 6 applies the same model to terrorism:

Though conventional terrorism, as John Kerry gaffed, is a nuisance to be policed rather than a threat to the fabric of life, terrorism with weapons of mass destruction would be something else entirely. The prospect of an attack that would kill millions of people is not just theoretically possible but consistent with the statistics of terrorism. The computer scientists Aaron Clauset and Maxwell Young and the political scientist Kristian Gleditsch plotted the death tolls of eleven thousand terrorist attacks on log-log paper and saw them fall into a neat straight line.261 Terrorist attacks obey a power-law distribution, which means they are generated by mechanisms that make extreme events unlikely, but not astronomically unlikely.

The trio suggested a simple model that is a bit like the one that Jean-Baptiste Michel and I proposed for wars, invoking nothing fancier than a combination of exponentials. As terrorists invest more time into plotting their attack, the death toll can go up exponentially: a plot that takes twice as long to plan can kill, say, four times as many people. To be concrete, an attack by a single suicide bomber, which usually kills in the single digits, can be planned in a few days or weeks. The 2004 Madrid train bombings, which killed around two hundred, took six months to plan, and 9/11, which killed three thousand, took two years. But terrorists live on borrowed time: every day that a plot drags on brings the possibility that it will be disrupted, aborted, or executed prematurely. If the probability is constant, the plot durations will be distributed exponentially. (Cronin, recall, showed that terrorist organizations drop like flies over time, falling into an exponential curve.) Combine exponentially growing damage with an exponentially shrinking chance of success, and you get a power law, with its disconcertingly thick tail. Given the presence of weapons of mass destruction in the real world, and religious fanatics willing to wreak untold damage for a higher cause, a lengthy conspiracy producing a horrendous death toll is within the realm of thinkable probabilities.

A statistical model, of course, is not a crystal ball. Even if we could extrapolate the line of existing data points, the massive terrorist attacks in the tail are still extremely (albeit not astronomically) unlikely. More to the point, we can’t extrapolate it. In practice, as you get to the tail of a power-law distribution, the data points start to misbehave, scattering around the line or warping it downward to very low probabilities. The statistical spectrum of terrorist damage reminds us not to dismiss the worst-case scenarios, but it doesn’t tell us how likely they are.

Does Solmomoff induction solve the new riddle of induction?

TL;DR: No.

Hutter and Rathmanner 2011, Section 5.9 “Andrey Kolmogorov”

Natural Turing Machines. The final issue is the choice of Universal Turing machine to be used as the reference machine. The problem is that there is still subjectivity involved in this choice since what is simple on one Turing machine may not be on another. More formally, it can be shown that for any arbitrarily complex string as measured against the UTM there is another UTM machine for which has Kolmogorov complexity . This result seems to undermine the entire concept of a universal simplicity measure but it is more of a philosophical nuisance which only occurs in specifically designed pathological examples. The Turing machine would have to be absurdly biased towards the string which would require previous knowledge of . The analogy here would be to hard-code some arbitrary long complex number into the hardware of a computer system which is clearly not a natural design. To deal with this case we make the soft assumption that the reference machine is natural in the sense that no such specific biases exist. Unfortunately there is no rigorous definition of natural but it is possible to argue for a reasonable and intuitive definition in this context.

Vallinder 2012, Section 4.1 “Language dependence”:

In section 2.4 we saw that Solomonoff’s prior is invariant under both reparametrization and regrouping, up to a multiplicative constant. But there is another form of language dependence, namely the choice of a uni- versal Turing machine.

There are three principal responses to the threat of language dependence. First, one could accept it flat out, and admit that no language is better than any other. Second, one could admit that there is language dependence but argue that some languages are better than others. Third, one could deny language dependence, and try to show that there isn’t any.

For a defender of Solomonoff’s prior, I believe the second option is the most promising. If you accept language dependence flat out, why intro- duce universal Turing machines, incomputable functions, and other need- lessly complicated things? And the third option is not available: there isn’t any way of getting around the fact that Solomonoff’s prior depends on the choice of universal Turing machine. Thus, we shall somehow try to limit the blow of the language dependence that is inherent to the framework. Williamson (2010) defends the use of a particular language by saying that an agent’s language gives her some information about the world she lives in. In the present framework, a similar response could go as follows. First, we identify binary strings with propositions or sensory observations in the way outlined in the previous section. Second, we pick a UTM so that the terms that exist in a particular agent’s language gets low Kolmogorov complexity.

If the above proposal is unconvincing, the damage may be limited some- what by the following result. Let be the Kolmogorov complexity of relative to universal Turing machine , and let be the Kolmogorov complexity of relative to Turing machine (which needn’t be universal). We have that That is: the difference in Kolmogorov complexity relative to and rela- tive to is bounded by a constant that depends only on these Turing machines, and not on . (See Li and Vitanyi (1997, p. 104) for a proof.) This is somewhat reassuring. It means that no other Turing machine can outperform infinitely often by more than a fixed constant. But we want to achieve more than that. If one picks a UTM that is biased enough to start with, strings that intuitively seem complex will get a very low Kolmogorov complexity. As we have seen, for any string it is always possible to find a UTM such that . If , the corresponding Solomonoff prior will be at least . So for any binary string, it is always possible to find a UTM such that we assign that string prior probability greater than or equal to . Thus some way of discriminating between universal Turing machines is called for.

The information-theoretic reason we use the words we do

“Mutual information and density in thingspace” (my emphasis):

In Empty Labels and then Replace the Symbol with the Substance, we saw the technique of replacing a word with its definition - the example being given:

All [mortal, ~feathers, bipedal] are mortal. Socrates is a [mortal, ~feathers, bipedal]. Therefore, Socrates is mortal.

Why, then, would you even want to have a word for “human”? Why not just say “Socrates is a mortal featherless biped”?

Because it’s helpful to have shorter words for things that you encounter often. If your code for describing single properties is already efficient, then there will not be an advantage to having a special word for a conjunction - like “human” for “mortal featherless biped” - unless things that are mortal and featherless and bipedal, are found more often than the marginal probabilities would lead you to expect.

In efficient codes, word length corresponds to probability—so the code for Z1Y2 will be just as long as the code for Z1 plus the code for Y2, unless P(Z1Y2) > P(Z1)P(Y2), in which case the code for the word can be shorter than the codes for its parts.

And this in turn corresponds exactly to the case where we can infer some of the properties of the thing, from seeing its other properties. It must be more likely than the default that featherless bipedal things will also be mortal.

Of course the word “human” really describes many, many more properties - when you see a human-shaped entity that talks and wears clothes, you can infer whole hosts of biochemical and anatomical and cognitive facts about it. To replace the word “human” with a description of everything we know about humans would require us to spend an inordinate amount of time talking. But this is true only because a featherless talking biped is far more likely than default to be poisonable by hemlock, or have broad nails, or be overconfident.

Having a word for a thing, rather than just listing its properties, is a more compact code precisely in those cases where we can infer some of those properties from the other properties. (With the exception perhaps of very primitive words, like “red”, that we would use to send an entirely uncompressed description of our sensory experiences. But by the time you encounter a bug, or even a rock, you’re dealing with nonsimple property collections, far above the primitive level.)

So having a word “wiggin” for green-eyed black-haired people, is more useful than just saying “green-eyed black-haired person”, precisely when:

  1. Green-eyed people are more likely than average to be black-haired (and vice versa), meaning that we can probabilistically infer green eyes from black hair or vice versa; or
  2. Wiggins share other properties that can be inferred at greater-than-default probability. In this case we have to separately observe the green eyes and black hair; but then, after observing both these properties independently, we can probabilistically infer other properties (like a taste for ketchup).

One may even consider the act of defining a word as a promise to this effect. Telling someone, “I define the word ‘wiggin’ to mean a person with green eyes and black hair”, by Gricean implication, asserts that the word “wiggin” will somehow help you make inferences / shorten your messages.

If green-eyes and black hair have no greater than default probability to be found together, nor does any other property occur at greater than default probability along with them, then the word “wiggin” is a lie: The word claims that certain people are worth distinguishing as a group, but they’re not.

In this case the word “wiggin” does not help describe reality more compactly—it is not defined by someone sending the shortest message—it has no role in the simplest explanation. Equivalently, the word “wiggin” will be of no help to you in doing any Bayesian inference. Even if you do not call the word a lie, it is surely an error.

And the way to carve reality at its joints, is to draw your boundaries around concentrations of unusually high probability density in Thingspace.


And so even the labels that we use for words are not quite arbitrary. The sounds that we attach to our concepts can be better or worse, wiser or more foolish. Even apart from considerations of common usage!

A cool example of evolutionary biology and cryptography applied to existential risks from nanotechnology

Courtesy of Eliezer Yudkowsky:

The Foresight Institute suggests, among other sensible proposals, that the replication instructions for any nanodevice should be encrypted. Moreover, encrypted such that flipping a single bit of the encoded instructions will entirely scramble the decrypted output. If all nanodevices produced are precise molecular copies, and moreover, any mistakes on the assembly line are not heritable because the offspring got a digital copy of the original encrypted instructions for use in making grandchildren, then your nanodevices ain’t gonna be doin’ much evolving.

I find this aesthetically pleasing because it takes two very theoretical abstract ideas (evolution, and cryoptography) and uses them to come up with an actionable, potentially impactful recommendation.

Post-partum depression as an evolutionary adaptation?

Better Angels, Chapter 7:

The “natural affection” is far from automatic. Daly and Wilson, and later the anthropologist Edward Hagen, have proposed that postpartum depression and its milder version, the baby blues, are not a hormonal malfunction but the emotional implementation of the decision period for keeping a child. (Postpartum depression as adaptation: Hagen, 1999; Daly & Wilson, 1988, pp. 61–77.) Mothers with postpartum depression often feel emotionally detached from their newborns and may harbor intrusive thoughts of harming them. Mild depression, psychologists have found, often gives people a more accurate appraisal of their life prospects than the rose-tinted view we normally enjoy. The typical rumination of a depressed new mother—how will I cope with this burden?—has been a legitimate question for mothers throughout history who faced the weighty choice between a definite tragedy now and the possibility of an even greater tragedy later. As the situation becomes manageable and the blues dissipate, many women report falling in love with their baby, coming to see it as a uniquely wonderful individual.

Hagen examined the psychiatric literature on postpartum depression to test five predictions of the theory that it is an evaluation period for investing in a newborn. As predicted, postpartum depression is more common in women who lack social support (they are single, separated, dissatisfied with their marriage, or distant from their parents), who had had a complicated delivery or an unhealthy infant, and who were unemployed or whose husbands were unemployed. He found reports of postpartum depression in a number of non-Western populations which showed the same risk factors (though he could not find enough suitable studies of traditional kin-based societies). Finally, postpartum depression is only loosely tied to measured hormonal imbalances, suggesting that it is not a malfunction but a design feature.

Many cultural traditions work to distance people’s emotions from a newborn until its survival seems likely. People may be enjoined from touching, naming, or granting legal personhood to a baby until a danger period is over, and the transition is often marked by a joyful ceremony, as in our own customs of the christening and the bris. Some traditions have a series of milestones, such as traditional Judaism, which grants full legal personhood to a baby only after it has survived thirty days.

Animal protection offered legal precedent for children’s rights

Better Angels, Chapter 7:

We have seen that during periods of humanitarian reform, a recognition of the rights of one group can lead to a recognition of others by analogy, as when the despotism of kings was analogized to the despotism of husbands, and when two centuries later the civil rights movement inspired the women’s rights movement. The protection of abused children also benefited from an analogy—in this case, believe it or not, with animals.

In Manhattan in 1874, the neighbors of ten-year-old Mary Ellen McCormack, an orphan being raised by an adoptive mother and her second husband, noticed suspicious cuts and bruises on the girl’s body. They reported her to the Department of Public Charities and Correction, which administered the city’s jails, poorhouses, orphanages, and insane asylums. Since there were no laws that specifically protected children, the caseworker contacted the American Society for the Protection of Animals. The society’s founder saw an analogy between the plight of the girl and the plight of the horses he rescued from violent stable owners. He engaged a lawyer who presented a creative interpretation of habeas corpus to the New York State Supreme Court and petitioned to have her removed from her home. The girl calmly testified:

Mamma has been in the habit of whipping and beating me almost every day. She used to whip me with a twisted whip—a rawhide. I have now on my head two black-and-blue marks which were made by Mamma with the whip, and a cut on the left side of my forehead which was made by a pair of scissors in Mamma’s hand…. I never dared speak to anybody, because if I did I would get whipped.

The New York Times reprinted the testimony in an article entitled “Inhumane Treatment of a Little Waif,” and the girl was removed from the home and eventually adopted by her caseworker. Her lawyer set up the New York Society for the Prevention of Cruelty to Children, the first protective agency for children anywhere in the world. Together with other agencies founded in its wake, it set up shelters for battered children and lobbied for laws that punished their abusive parents. Similarly, in England the first legal case to protect a child against an abusive parent was taken up by the Royal Society for the Prevention of Cruelty to Animals, and out of it grew the National Society for the Prevention of Cruelty to Children.

Written on August 30, 2017