The Netherlands in the British press

As part of of our research into the media's impact on the Brexit vote, we analyzed the coverage of different European Union member states in the British press. Specifically, we looked at references to these countries in the years 2005, 2010, and 2015 (i.e. prior to the main referendum campaign) in leading broadsheets and tabloid newspapers: The GuardianGuardian, the Observer, the Telegraph/Sunday Telegraph, the Evening Standard, the Mail, the Sun and the News of the World, and the Mirror.

Over the course of those 3 years, these papers published a little over 42,000 articles mentioning the Netherlands in some form (including references to Holland, Amsterdam, and The Hague). The table below shows the 20 words most associated with the Netherlands overall, and in individual years. In order to obtain these words, I looked at words that appear at least 100 times in a sentence referencing the Netherlands, and that appear in such sentences at twice as high a rate (or more) as they do in the overall corpus of articles published in these papers. Words are sorted in descending order of how much more frequently they appear next to the Netherlands than in the overall corpus


Overall 2005 2010 2015
constitution, van, winger, treaty, rejection, merger, friendly, qualifier, tribunal, striker, qualification, defender, cruise, goalkeeper, ace, coach, qualify, referendum, tournament, hamstring constitution, rejection, referendum, treaty, van, winger, merger, friendly, striker, ace, defender, goalkeeper, giant, vote, coach, international, tournament, keeper, debut, oil van, winger, tournament, cruise, coach, striker, friendly, final, joint, trial, de, defender, squad, flight, international, bid, beat, injury, defeat, scored van, winger, text, friendly, cruise, defender, striker, tonight, airport, assistant, coach, answer, finance, manager, flight, oil, international, squad, boss, loan


As we see in the first column, the Netherlands appears to be associated with three things in the British press: last names that begin with 'van', football (soccer), and the Dutch public's rejection of the European Union's Constitutional Treaty in a referendum in 2005. Results in the individual years add to this basic pattern, but do not eliminate it. In 2005, the referendum accounts for the top four entries; in the other years it fades out of view. 2010 boasts another common last name prefix ('de'), and in 20105, some financial references pop up.

If we filter out football-related terms and last name prepositions, the top 20 overall becomes: constitution, treaty, rejection, merger, friendly, tribunal, cruise, referendum, cycling, international, master, giant, joint, tonight, structure, airport, flew, flight, ship, politician. The main addition here are terms related to international travel, along with another sport associated with the Dutch (cycling). However, the overall impression does not change much.

Most of these words are nouns which do not convey a particular sentiment. Indeed, if we generate a word cloud of the texts mentioning the Netherlands, and strip out most football terms, the remainder is fairly generic, as the image below shows. (Here words are sized by their total appearance count, not by how strongly they are associated with the Netherlands, as was the case in the table above).

To get a closer look at how the British press thinks about the Netherlands, I took only those words that appear unusually often next to very positive adjectives (awesome, exceptional, etc.) or very negative adjectives (abhorrent, wretched, etc.). I then repeated the analysis just presented, but filtering down to just those words. Since these may not be as common, as changed the minimum number of appearances of a word to 50. Moreover, since we want to be sure that the negative word doesn't appear at the far opposite end of a sentence from the Dutch reference, I changed the analysis from the sentence level to a window of 10 words on either side of our key terms.

The most positive words associated with the Netherlands are: basten, cruyff, rembrandt, rhine, canals, 17th-century, xavi, budapest, cruises, cruise, destinations, bulbs, courage, goalkeeper, paintings. The top two are footballers, as is Xavi, who is, interestingly, Spanish rather than Dutch. In addition, we see references to Dutch 17th-century painting (Rembrandt, 17th-century, paintings), to the Netherlands as a travel destination (cruise, which also appeared in the table above) and to canals and bulbs.

On the negative side, the top words are abn, amro, far-right, crimes, nazis, prosecutors, prosecutor, defeat, and arrested (no other words meet the cut-off conditions). ABN-Amro was a large Dutch bank which failed; other than that it appears that nazi/far-right crimes committed in the Netherlands make the news in Britain too.

Finally, I conducted sentiment analysis on the UK press' coverage of the Netherlands. I used a lexicon-based method developed here at the STAIR lab, which takes the average sentiment values across 7 prominent sentiment analysis lexica, and scales them so that the average sentiment for a newspaper article in our broader corpus (all articles published in these newspapers in 2005, 2010, or 2015) is exactly 0. This allows us to see whether references to the Netherlands appear in a context of positive or negative sentiment overall.



The rise of Uber

In just a few years, Uber and its service model have become inescapable. Compared to taxis, Uber rides are often significantly cheaper and easier to arrange. However, Uber's popularity has also caused apparent dissatisfaction from taxi drivers and companies, and some people worry about security using Uber. In Europe, Uber has run into numerous challenges, most notably in the form of the French taxi strike aimed at Uber in June 2015.

How were Uber's rise and its challenges covered in the media in 2015? More generally, does the news coverage of Uber and regular taxis reflect their respective strengths and weaknesses?

To analyze these questions, I set up two categories of words. The first category consisted of the word 'Uber' alone. The second category consisted of the words 'taxi' and 'cab', because those two words are largely interchangeable (I included singular and plural, as well as capitalized and uncapitalized versions of the words). I then looked for sentences containing these words in the corpus of all articles published in the New York Times in 2015.

In total, 2008 sentences in this corpus mention 'Uber' (0.08% of all sentences), and 1511 mention 'taxi' or 'cab' (0.06%). Perhaps surprisingly, Uber is the more frequently mentioned of the two categories, with 1.3 mentions for very sentence that mentions taxis or cabs. The words most strongly associated with each category (compared to a baseline of all sentences not mentioning either) are shown below:


Associated with Uber Associated with taxis/cabs
black-car, car-hailing, verifications, for-hire, ride-booking, ride-hailing, best-in-class, ride-sharing, misrepresenting, app-based, delves, drivers, on-demand, livery, rides, summon, mapping, sobering, verification, vigilant, checks, wheelchair-accessible, contemplating, intimidation, unicorns, valuation, robotics, stricter, contractors, medallion, congestion, driver, self-driving, third-party, stringent, background, endeavor, fares, rape, kidnapping, class-action, confessed, raped, start-ups, screened, riders, classification, capitalists, operates black-car, dashboard-mounted, livery, tuk-tuks, wheelchair-accessible, limousine, tuk-tuk, for-hire, medallion, medallions, narrates, microcosm, naturalist, hailing, complimentary, ride-sharing, impressively, drivers, ride-hailing, hail, driver, app-based, cocoa, storyteller, boundary, rides, yellow, flagged, stringent, vans, carts, slippery, songwriters, fares, wry, adventures, buses, dashboard, passengers, pretend, hailed, contested, motorcycle, riders, ride, fleet, groceries, fare, crashing, passenger


As we would expect, given that Uber is intended to compete directly with regular taxis, many of the words are the same. However, discussions of Uber are more likely to talk about particular features -- ride-booking, app-based (also on the taxi list, but further down), on-demand, contractors -- and specific concerns -- kidnapping, rape -- that are particularly salient for Uber riders. In addition, words associated with discussions of Uber's business model also appear: unicorns, valuation, capitalists

In contrast, the list of words for taxis and cabs are centered more about the taxis and their riders themselves: medallions, drivers, hail, hailing, rides, yellow, flagged, fares, riders, fleet, passenger, etc.

We can also generate a word cloud of the sentences in question. These will show all words in these sentences, not just those that stand out relative to a baseline of unrelated text.


Taxis and cabs


The word cloud for Uber further demonstrates the fact that it gets mentioned both as a competitor of regular taxis and as a model of a particular business approach. In the latter category, Lyft and Google now appear, as well as words like employee, data, growth, company/companies, and billions. Moreover, we get a sense of some key cities where Uber is active: New York and San Francisco, but also around the world, such as in New Delhi, India.

Meanwhile, the second word cloud appears more focused both on what taxis do and on New York City: city, street, hotel, driving, people, airport, vehicle, etc., plus New York, Manhattan, mayor, and the Taxi & Limousine Commission.

To improve this analysis, I would look for ways to make sure 'cab' captures only references to taxicabs, and not to cabinets or Cabernet.



U.S. media coverage of the Rwandan and Bosnian genocides

In November 2015, ISIS launched several terrorist attacks targeting Beirut and Paris only days apart, and killing 43 and 129, respectively. Although the media covered both attacks, the headlines and narratives skewed the severity of the events. On one hand, headlines on the Beirut attack reported: "Suicide Bombing Kills at least 37 in Hezbollah Stronghold of Southern Beirut." On the other hand, headlines on the Paris attacks stated "Paris Terror Attacks Leave Awful Realization: Another Massacre." Across the board, media outlets emphasized the Hezbollah stronghold in Beirut and the massacre in Paris.

As Nadine Ajaka comments on The Atlantic, affiliating the victims in the Beirut attacks with a Hezbollah stronghold makes ISIS' attacks in Lebanon seem "expected" and even "retaliatory." Meanwhile, news sources were providing live updates reporting detailed human rights violations in Paris, allowing for greater exposure of the terrorist attacks. Thus, it is obvious that in the Western media, "white victims are being humanized in a way Arab victims aren't."

Unfortunately, the recent terrorist attacks are not the only time when the Western media has covered two similar attacks on human rights under different lights. The Rwandan and Bosnian genocides were two of the worst human rights catastrophes in the late 20th century. Despite the large difference in the death toll between the two cases, around 800,000 in Rwanda and around 8,000 in Srebrenica, President Clinton emphasized the U.S.' interests in the Balkans over those in Africa. Although Rwanda's genocide death toll was 100 times larger than Bosnia's, I would expect that the U.S. media covered Bosnia's genocide more extensively.

To test this hypothesis, I used a large corpus of articles mentioning human rights from three leading U.S. newspapers: The Washington Post, The New York Times, and USA Today. For the Rwandan case, I searched all articles in 1994 (the year of the Rwandan genocide) for articles containing both "genocide" and "Rwanda." This produced a sub-corpus of 305 articles. For the Bosnian case, I searched all articles in 1995 (the year of the Srebrenica genocide) for articles containing both "genocide" and "Bosnia." This search captured 443 articles.

Western media thus covered the Srebrenica genocide more extensively than the Rwandan genocide, at least when explicitly framing these episodes in human rights terms. (But see the earlier post on this blog about mentioning human rights in articles about genocide.). Although the death toll in Rwanda was 100 times greater than the one in Serbia, leading U.S. newspapers reported on the Serbian genocide 45% more.

These findings hint at a Western media bias towards white victims in an international context. This is not the first time that the media has treated tragedies in this way and it certainly will not be the last time. In the future, it is important to keep this coverage bias in mind, as we learn more about recent attacks.



A democracy bias? Using text-mining to explore word choices in the American media

Do the American media consistently differentiate between systems of government with their language? Are non-democratic regimes consistently treated more negatively than democratic regimes, or are democratic regimes more subject to criticism because of their close ideological proximity to the United States? Is there any difference at all?

Scholarly studies suggest that media is not only responsible for setting and altering the public agenda, as described in agenda-setting theory, but also for the formation of public opinion on the proclaimed relevant topics of a particular time period. As one of the most quintessentially "American" values, it would be easy to imagine that 'democracy' continuously carries a positive connotation in Western media. Alternatively, non-democratic forms of government like autocracies, monarchies, or dictatorships might have negative connotations, since they do not directly reflect American values.

To explore the potential bias for or against different forms of government in the media, I compiled two lists of words (actually: word stems) to reflect all of the non-democratic forms of government and democracies, respectively. The former list consisted of 'autocra', 'monarch', 'dictator', 'oligarch', 'technocra', 'theocra', 'aristocra', 'junta', and 'totalitarian'. Democracies were represented with the prefix 'democra'.

In the New York Times in 2015, 0.2% of all sentences included one of the aforementioned word stems. In total, 2,916 sentences were about democracies and 2,195 sentences were about one of the nine other forms of government. Consequently, for every one sentence containing a 'non-democratic' word, there were 1.3 sentences written about democracies / democrats (Note: I excluded 'Democra-' with a capital 'D', to avoid incorporating references to the U.S. Democratic Party). These results suggest that the difference between how often democracies and other regime types are spoken about in the media is hardly significant.

Noteworthy words used in sentences about democracies (the complete list appears below) included 'aggression', 'consolidation', 'reliable', 'whites', 'jails', 'eccentricity', 'icon', 'effectively', 'destroying', 'shameful', 'undermined', 'turnout', 'inaugural', 'corporations', 'privacy', 'diverse', 'tolerance', 'champion', 'peoples', 'expense' and 'campaigns'. As a whole, the list of words on democracies appeared political in nature and close to what one might have expected to see from liberal newspaper articles.

In contrast, the list of words on non-democratic forms of government was not as clearly tied to politics. Although several the words most commonly used in sentences with non-democratic governments were 'fascist', 'provincial', 'impeached', 'allied', 'tortured', 'queen', 'suppressed', 'leftists', 'el-qaddafi', and 'right-wing', the rest of the list was composed of words like 'mansion', 'rhyme', 'plays', 'spooky', 'playwright', 'iron', 'butterflies', 'gospel', 'household', 'lovingly', 'leapfrog', and 'nightmarish'.

The prominence of such non-political terms underscores the difficulty of precisely specifying a topic of interest. Thus, the word stems 'monarch' and 'aristocra' don't only refer to non-democratic forms of government, but also to individuals (monarchs and aristocrats) prominent in literature and the arts (Downton Abbey, for instance!), and even to butterflies (monarch).

To improve the analysis, the word associations found could be further analyzed using manual coding to determine the actor-subject identification of certain negative words found in the lists, such as 'destroying', 'undermined', 'aggression', 'suppressed', and 'tortured'. Determining who is the actor has major implications on the interpretation of each list. Additionally, further analysis could look into the relationship between private (media) bias and public (national) bias, and expand both the newspaper sources and years analyzed.

The table below provides the list of words most strongly associated with each category.


Associated with the word stem 'democra' Associated with word stems representing non-democratic forms of government
aggression, eligible, consolidation, sentimental, reliable, inevitably, shared, separation, wherever, eccentricity, whites, strategies, effectively, jails, destroying, inaugural, undermined, processes, shameful, mainland, consolidate, viable, campaigns, turnout, exposing, creates, promotion, reflected, corporations, privacy, advocated, screenplay, strengthen, topics, genuine, promotes, opposite, icon, diverse, session, tolerance, procedures, numerous, relevant, champion, undermines, developing, characteristics, peoples, expense mansion, imagines, rhyme, treating, plays, history-skimming, watchable, fascist, proxy, spooky, compulsively, reign, recall, leapfrogs, household, provincial, flat-out, executed, lovingly, destroyed, burial, reader, impeached, nightmarish, ministries, butterflies, allied, signature, playwright, infused, interviews, cousin, linked, evasion, audacious, collaboration, tortured, queen, vicariously, gospel, migration, resigned, iron, suppressed, license, leftists, adapted, el-qaddafi, von, right-wing



Identifying human rights-related news coverage of genocide

Scholars interested in the coverage of human rights issues in the media often identify articles by searching for the term "human rights". However, not all articles about human rights issues necessarily use that term. As part of ongoing research in the STAIR lab, we decided to look into this problem a bit more.

We selected a number of different human rights violations to examine. In this first post, I discuss our findings for genocide. Arguably one of the most serious crimes against humanity, genocide has occurred several times in the past few decades, with the targeting of Yazidis by the Islamic State merely the latest example. (Other recent cases have been Darfur in the early 2000s and Rwanda in 1994.)

For this analysis we focused on three major American newspapers, the New York Times, USA Today, and the Washington Post. We downloaded all articles containing search term 'human rights' over the past 35 years (1981-2015), and did the same for the word stem 'genocid' (in order to capture genocide, genocidal, genocidaire, etc.). More than 125,000 articles met the first search criterion; almost 21,000 met the second. Of the 20,858 genocide articles, however, only 4,271 include the search term human rights; nearly four times as many articles do not.

Not all human rights news mentions human rights!

Still, it is conceivable that articles not mentioning human rights as such are in fact not about genocide as a human rights violation. For example, sometimes people are described as looking or acting like `a genocidal maniac` without having anything to do with an actual genocide.

My next step was to divide the genocide articles into two categories: those that do and those that do not mention human rights. We then used machine learning techniques on the latter sub-corpus in order to identify those articles that are about human rights without mentioning the term. Specifically, I constructed a corpus of human rights articles not mentioning genocide and a corpus of general (non-human rights) articles also not mentioning genocide, and asked the computer to learn the difference.

The computer learned to distinguish the two categories quite well, scoring 93% accuracy and a Krippendorff alpha of 0.86 on a held-out test set. This gives us confidence that the genocide articles it classified as about human rights actually are so. Accordingly, we can add to our corpus of human rights articles about genocide almost 60% of those that do not mention human rights. In fact, looking through the articles classified by the computer suggests that the computer is being quite conservative, and likely we are still omitting quite a few human rights-related articles. Nonetheless, this process effectively triples our corpus of human rights-related media coverage of genocide.

As an additional verification step, I combined the genocide articles mentioning human rights with those that do not but were classified as human rights-related, and asked the computer to try to tell them apart (after removing the term human rights from the first group). This proved to be a very difficult task, with a Krippendorff alpha on the held-out test set of just 0.32. The results support our confidence in the computer's decisions on the classification of human rights-related articles.

Mentioning human rights is not arbitrary!

Nevertheless, it is interesting to take a closer look at what might distinguish the two groups: What might make a journalist decide to mention human rights in an article about genocide? To examine this question more closely, we look specifically at the sentences in these articles that actually mention genocide, and identify words more likely to appear in sentences from either group of articles.

The table below provides the list of words most strongly associated with each category.


No mention of human rights Human rights mentioned
histories, oral, emergency, deception, techniques, sway, documenting, reveals, lie, fascism, fascist, remembrance, annihilation, centennial, recognizing, commemorate, surviving, propaganda, racist, demonstrators, millions, condemning, condemn, historians, catastrophe, notably, empire, accuses, racism, conspiracy, priest, complicity, proof, mention, acquitted, acted, starvation, oppression, shame, horrors, minorities, aggression, tragedy, participated, cultural, scale, editor, century, atrocity, killers rights, abuses, gross, human, violations, advocates, disappearances, jurisdiction, activists, unwilling, penal, watch, prosecutions, universal, investigate, coined, responding, hoc, conventions, charter, violation, prosecute, human-rights, accountability, adviser, establishing, tribunals, conclusion, internationally, advocate, massive, commission, amounted, warnings, offenses, charging, expert, torture, treaties, determine, ratified, statute, ad, individuals, junta, large-scale, adopted, cases, lawsuit, unfolding


These two word lists suggest that articles that do not mention human rights are more likely to be about studying or remembering past genocides: in the first list, words like (oral) histories, documenting, remembrance, centennial, commemorate, historians, editor, and century all point in this direction.

In contrast, the second list contains more words focusing on the crime itself (and, relatedly, on prosecuting those commit it): abuses, violation(s), advocates, jurisdiction, activists, penal, investigate, etc. In addition, although we cannot see this from the table, articles that do mention human rights contain, on average, almost twice as many occurrences of the words stem 'genocid' as do articles that do not.

It is important to note that the words listed above are specifically selected for being more prominent in one group or the other: if we look at the two groups of articles overall, they look quite similar, as the following two word clouds show:


No mention of human rights
Human rights mentioned


The word clouds reference largely the same genocides: Armenia, Rwanda, Bosnia, Darfur. The first word cloud does hint at 2 older genocides less obvious in the second one: 'nazi' (for the Holocaust), and 'khmer' (for Cambodia).

Finally, I looked at whether this same pattern emerges when we look for the mention of proper names more systematically (the word lists presented above deliberately excluded proper names). If we do so, the list for the articles that do not mention human rights now includes a number of names associated with the Holocaust ('holocaust', 'jews', 'elders' & 'zion', 'nuremburg', 'adolf', 'hitler'), Ukraine ('ukrainian' and 'stalin'), and Armenia ('turkey', 'armenian', 'ankara', 'ottoman'). In contrast, the other word list now features names associated with Rwanda ('alison' & 'des' & 'forges'), Guatemala ('efrain' & 'montt' & 'rios', 'guatemala', 'guatemalan', 'mayan'), and Chile ('pinochet', 'chilean').

In sum, media coverage of genocide is not always associated with the term 'human rights' (indeed, only about 1 in 3 articles mention that term). If we wish to understand human rights coverage in the media, therefore, searching on that term alone is not satisfactory. (On the other hand, simply selecting all uses of the word root 'genocid' is no less problematic, as it will include many uses that are not germane to human rights).

Moreover, and of equal importance, the inclusion of the term 'human rights' is not random. Instead, use of the term is associated with the occurrence and short-term aftermath of a genocide, while omission of the term is more common in articles reporting on longer-term effects and reactions. This pattern means that any attempt to derive systematic conclusions about media coverage from just the subset of articles mentioning human rights will likely give rise to unwarranted conclusions.



The British media and Trump

How do other countries view Donald Trump's presidential campaign? Working with a complete set of all articles published in 2015 by The Guardian and its sister publication The Observer, I recreated the same two analyses as in the previous two blog posts (word clouds, word associations).

Together, the two papers published more than one and a half times as many articles as The New York Times in 2015, but not surprisingly fewer of these articles were about the U.S. presidential campaign. Only 0.14% of all sentences contained one or more names of the top Republican contenders (the ratio for the NYT was four times higher).

Still, given the unusual nature of Trump's candidacy, he received plenty of coverage: 2604 sentences contained his name and no names of other top candidates. This is the set of sentences I use here. A further 3942 sentences mentioned one or more of the other candidates but not Trump.

Note: A considerable fraction of the articles analyzed here appear to be online-only articles. It is possible to filter these out, but since I'm interested in overall coverage of Trump, not just printed coverage, I decided to leave them in.

For the first analysis, I once again identified those words that disproportionately appear in sentences about Trump (as compared to sentences about the other Republican candidates). The table below shows the top 50 such words in The Guardian/Observer, side-by-side with the same list for the New York Times.


The Guardian The New York Times
hateful, disinvite, schlonged, dirty, exults, jerk, swallows, radicalised, reverberate, magnate, venue, low-rent, nation-building, disablism, generator, pairings, bloody, dumbest, wealth, passport, retweets, riding, laugh, pageant, wore, universe, laughter, rapists, fascist, idiocy, register, appreciate, wind, developer, petition, elephant, incoherent, courses, sketches, troll, eccentric, offline, outburst, bald, monster, carnival, outcry, throws, enltrwow, xenophobic wherever, licensing, pageant, vulgar, gates, pageants, contestants, helicopter, slobs, anti-immigrant, protester, apprentice, golf, rapists, disgusting, mosques, realdonaldtrump, ministers, outlets, towers, hell, dealers, stone, pigs, celebrity, anchor, offended, signatures, sexist, stupid, plaque, incendiary, insulting, mogul, third-party, hatred, universe, courses, pollsters, pen, insults, petition, outsize, deposition, tweets, afraid, database, bigotry, murderers, deporting


As was the case for the NYT, about one quarter of the top words in the Guardian are markedly negative. However, the lists are quite distinct (in fact there is an overlap of just 5 words). Where the American coverage appears more focused on specific, often personal insults/statements by Trump (slobs, pigs, sexist, stupid), the British paper's words indicate a greater emphasis on Trump's overall ideology (or lack thereof): fascist, xenophobic, incoherent.

Note: 'enltrWow' represents an attempt at rendering a Tweet whose text begins 'Wow'. Some of the online articles in the Guardian report Tweets made by prominent people in reaction to the news. It appears Trump elicits many such reactions.

Next, I generated word clouds from the sentences about Trump, keeping only those words that appeared in them at least twice as often as in other sentences. The table below compares the result to the same set-up in the U.S. case.


The Guardian / The Observer
The New York Times


Trump's frontrunner status in a succession of polls is clearly one of the most salient issues to The Guardian. They are also more focused on his wealth than is the NYT, with both mogul and billionaire quite prominent. Finally, as with the word lists above, the British paper emphasizes ideological aspects more, with racist and xenophobia boths howing up, along with bigotry (in small print on the right).

It will be interesting to see whether and how coverage changes in both papers as the presidential campaigns enter the primary stages.



Picturing media coverage of Trump and Cruz

In the previous blog post, I looked at the New York Times coverage of Republican presidential candidates Trump and Cruz; in this post, I'll examine the same data using word clouds.

As a starting point, I took the 3552 sentences referring to Trump but not to any of the other top 7 Republican candidates, and the 1425 sentences referring to Cruz but not any of the others (see the previous post for more detail). I used these to generate the following two word clouds.



These clouds are not uninteresting, but they contain lots of words that don't really set Cruz or Trump apart. In order to home in on those aspects of the coverage, I filtered out all words whose relative frequency in Trump (or Cruz) sentences was not at least twice as high as in sentences about other Republican candidates, as well as in general sentences. This produced the following result:



This is much better. For Trump, we see the centrality of the immigration issue to his campaign — immigrant, border, wall — as well as the focus on his statements: remark, attention, statements, and language are all prominent. For Cruz, meanwhile, we see much more of an emphasis on religion — evangelical, religious, pastor — and on his ideological leaning: conservative is the most prominent word by far, and Tea (Party) shows up too.

In fact, for someone interested in a quick snapshot of the differences between the two candidates, these two wordclouds do a pretty good job. The viscerally negative terms identified in the previous post barely appear, since their frequency is not as high of the more common/neutral words we see here, but even without them it is clear what sets Trump apart.



Trump, Cruz, and the media

With the Iowa caucuses nearly upon us, the two clear Republican front-runners are Donald Trump and Ted Cruz. Few would have expected this half a year ago, when both candidates' chances were dismissed due to their unlikeability, with the predicate "most hated" often applied to both men (Trump, Cruz).

Most voters do not meet any presidential candidate in person, and their impressions are heavily shaped by media coverage. It thus seems timely to see how these two candidates have been covered. In particular: How, if at all, has their coverage differed from that of the other Republican contenders?

Working with a corpus (dataset) containing all articles published in the New York Times in 2015 (about 72,000 articles), I identified all sentences in these articles referring to one (or more) of the top Republican presidential candidates: Bush, Carson, Christie, Cruz, Kasich, Paul, Rubio, and Trump.

I did so by searching for the most likely phrases to refer to each candidate specifically: "firstname lastname", "mr/ms lastname", and "office lastname". Thus, for Jeb Bush I searched for "Jeb Bush, "Mr Bush", and "Governor Bush". This search is of course not perfect; for example, there are at least two other "Mr Bush"es that are likely to have appeared in news stories. Still, it seems a reasonable starting point.

I divided all sentences in the datasets into four categories: those referring only to the category of interest (e.g. Jeb Bush), those referring only to the category to compare against (e.g. all other Republican candidates), those referring to both, and those referring to neither.

After filtering out proper names, I identified those words most closely associated with the category of interest and those most closely associated with the comparison category. For each of these associations, I included only words that appear in either category at least twice as frequently as they occur in the rest of the corpus. This means that the word 'the', which appears a lot everywhere, is not going to get selected, while the word 'president' might.

How do we capture 'close association', especially relative to another category? For this, we need frequency information. Our frequency measure is the number of times we saw a word in a set of sentences divided by total number of words in the same set of sentences.

One intuitive measure of relative closeness is to divide the freqency of word x in the first set of sentences by its frequency in the second set; however, this causes problems if the word doesn't occur in the second set at all. So instead I divide its frequency by the sum of its frequencies in the two sets. In other words, if a is the frequency in the first category and b is the frequency in the second, I calculate x = a/(a+b) (to get the ratio from this, simply take (1-x)/x).

(As an alternative, we could simply measure differences in frequency. However, the advantage of the ratio-based measure is that if a word occurs ten times as frequently in one category than in the other, this will produce a very different value than if the word occurs only twice as frequently, even if the difference between the two frequencies were the same in these two cases (for example: 0.2 and 0.02 vs. 0.36 and 0.18).)

The corpus contains 3552 sentences referring to Trump but not to any of the other 7 candidates (based on the search phrases used), and 10,977 referring to one or more of those 7 but not to Trump. (for Cruz, the comparable figures are 1425 and 13,221 -- Trump received almost 3 times as many mentions!). Combined, these account for 0.54% of all sentences. Add in sentences referring to Trump as well as one or more of the other candidates, and 1 in 175 sentences in the New York Times in 2015 was about one or more Republican presidential candidates (across all sections of the paper)!

The table below shows the top 50 words most associated with Trump and Cruz when compared to the other Republican candidates. Unsurprisingly, since Trump's background is not in politics, there are more references to that background (The Apprentice, pageant, pageants, (Miss) Universe, celebrity, mogul) than is the case for Cruz. The most striking difference between the two lists, however, is the prominence of viscerally negative words for Trump.

More than 1 in 4 of the words most strongly associated with Trump fall into this category; I have highlighted them in the table. Some of these are words Trump himself has used; others are words that have been used to describe Trump or his statements. In contrast, Cruz's list contains just 2 words that are similarly negative (hated and maligned), and in both cases they are what you might call 'passive' words, with less of a visceral connotation, in my opinion.


Donald Trump Ted Cruz
wherever, licensing, pageant, vulgar, gates, pageants, contestants, helicopter, slobs, anti-immigrant, protester, apprentice, golf, rapists, disgusting, mosques, realdonaldtrump, ministers, outlets, towers, hell, dealers, stone, pigs, celebrity, anchor, offended, signatures, sexist, stupid, plaque, incendiary, insulting, mogul, third-party, hatred, universe, courses, pollsters, pen, insults, petition, outsize, deposition, tweets, afraid, database, bigotry, murderers, deporting marching, cell, vortex, civilians, passport, suggestive, carpet-bomb, upstate, calves, bombing, roommate, buffering, fled, pro-monopoly, debaters, hoteliers, pun, partyers, tournaments, blonds, carpet-bombing, aligning, fireside, maligned, leftist, marches, risking, gated, hated, anti-rubio, nude, oblivion, chat, informant, abolish, scream, firebrand, admirer, viewer, neutrality, clerk, flooded, pais, glow, chats, sand, realizes, pictures, courageous, casualties


So what does this mean? Do the findings simply reflect the different rhetoric of the two candidates? Are other people more willing to express negative opinions about Trump than about Cruz when interviewed? Do New York Times reporters dislike Trump more than they dislike Cruz, and is this dislike shining through? I'm not sure. Whatever the explanation, the difference is starker than I had expected.



This website's design is a work in progress

I learned basic html a long time ago, and for the longest time that sufficed. Last summer I decided it would be worth learning how to use cascading style sheets (css). For this website I wanted to be able to add a blog.

It turns out to be surprisingly difficult to have a blog fully integrated with your website. For example, Wordpress is great, but comes with a lot of extra machinery, and requires a heroic effort to make it look exactly the same as the non-blog pages of the website.

After searching around for a bit, I decided that I wanted a static site generator. A growing number of these are available. For a variety of reasons, I chose Nikola. Although I struggled a bit with installation & getting everything set up, I have been very happy with Nikola overall.

For now, the site is using the Bootswatch Lumen theme, with some minimal modifications to make the STAIR logo fit the navigation bar and to make the front page image work. I mostly like the general design, but I plan to make numerous tweaks over the coming months. Stay tuned!



Explaining the image on the front page

The image on the front page is of an issue of the Virginia Gazette, an early American newspaper printed right here in Williamsburg. The feature article in this issue was the first of several written by Thomas Paine under the rubric "The American Crisis". These supported the colonial side in the war for independence from Britain. In other words, it is a text about international relations.

Not only that: Paine's words illustrate just how rich texts can be. The opening paragraph consist of one eminently quotable line after another: "These are the times that try men's souls." "Tyranny, like hell, is not easily conquered." "What we obtain too cheap, we esteem too lightly."

All in all, it's a pretty good image to use as the opening backdrop for our site.