With the Iowa caucuses nearly upon us, the two clear Republican front-runners are Donald Trump and Ted Cruz. Few would have expected this half a year ago, when both candidates' chances were dismissed due to their unlikeability, with the predicate "most hated" often applied to both men (Trump, Cruz).
Most voters do not meet any presidential candidate in person, and their impressions are heavily shaped by media coverage. It thus seems timely to see how these two candidates have been covered. In particular: How, if at all, has their coverage differed from that of the other Republican contenders?
Working with a corpus (dataset) containing all articles published in the New York Times in 2015 (about 72,000 articles), I identified all sentences in these articles referring to one (or more) of the top Republican presidential candidates: Bush, Carson, Christie, Cruz, Kasich, Paul, Rubio, and Trump.
I did so by searching for the most likely phrases to refer to each candidate specifically: "firstname lastname", "mr/ms lastname", and "office lastname". Thus, for Jeb Bush I searched for "Jeb Bush, "Mr Bush", and "Governor Bush". This search is of course not perfect; for example, there are at least two other "Mr Bush"es that are likely to have appeared in news stories. Still, it seems a reasonable starting point.
I divided all sentences in the datasets into four categories: those referring only to the category of interest (e.g. Jeb Bush), those referring only to the category to compare against (e.g. all other Republican candidates), those referring to both, and those referring to neither.
After filtering out proper names, I identified those words most closely associated with the category of interest and those most closely associated with the comparison category. For each of these associations, I included only words that appear in either category at least twice as frequently as they occur in the rest of the corpus. This means that the word 'the', which appears a lot everywhere, is not going to get selected, while the word 'president' might.
How do we capture 'close association', especially relative to another category? For this, we need frequency information. Our frequency measure is the number of times we saw a word in a set of sentences divided by total number of words in the same set of sentences.
One intuitive measure of relative closeness is to divide the freqency of word x in the first set of sentences by its frequency in the second set; however, this causes problems if the word doesn't occur in the second set at all. So instead I divide its frequency by the sum of its frequencies in the two sets. In other words, if a is the frequency in the first category and b is the frequency in the second, I calculate x = a/(a+b) (to get the ratio from this, simply take (1-x)/x).
(As an alternative, we could simply measure differences in frequency. However, the advantage of the ratio-based measure is that if a word occurs ten times as frequently in one category than in the other, this will produce a very different value than if the word occurs only twice as frequently, even if the difference between the two frequencies were the same in these two cases (for example: 0.2 and 0.02 vs. 0.36 and 0.18).)
The corpus contains 3552 sentences referring to Trump but not to any of the other 7 candidates (based on the search phrases used), and 10,977 referring to one or more of those 7 but not to Trump. (for Cruz, the comparable figures are 1425 and 13,221 -- Trump received almost 3 times as many mentions!). Combined, these account for 0.54% of all sentences. Add in sentences referring to Trump as well as one or more of the other candidates, and 1 in 175 sentences in the New York Times in 2015 was about one or more Republican presidential candidates (across all sections of the paper)!
The table below shows the top 50 words most associated with Trump and Cruz when compared to the other Republican candidates. Unsurprisingly, since Trump's background is not in politics, there are more references to that background (The Apprentice, pageant, pageants, (Miss) Universe, celebrity, mogul) than is the case for Cruz. The most striking difference between the two lists, however, is the prominence of viscerally negative words for Trump.
More than 1 in 4 of the words most strongly associated with Trump fall into this category; I have highlighted them in the table. Some of these are words Trump himself has used; others are words that have been used to describe Trump or his statements. In contrast, Cruz's list contains just 2 words that are similarly negative (hated and maligned), and in both cases they are what you might call 'passive' words, with less of a visceral connotation, in my opinion.
|Donald Trump||Ted Cruz|
|wherever, licensing, pageant, vulgar, gates, pageants, contestants, helicopter, slobs, anti-immigrant, protester, apprentice, golf, rapists, disgusting, mosques, realdonaldtrump, ministers, outlets, towers, hell, dealers, stone, pigs, celebrity, anchor, offended, signatures, sexist, stupid, plaque, incendiary, insulting, mogul, third-party, hatred, universe, courses, pollsters, pen, insults, petition, outsize, deposition, tweets, afraid, database, bigotry, murderers, deporting||marching, cell, vortex, civilians, passport, suggestive, carpet-bomb, upstate, calves, bombing, roommate, buffering, fled, pro-monopoly, debaters, hoteliers, pun, partyers, tournaments, blonds, carpet-bombing, aligning, fireside, maligned, leftist, marches, risking, gated, hated, anti-rubio, nude, oblivion, chat, informant, abolish, scream, firebrand, admirer, viewer, neutrality, clerk, flooded, pais, glow, chats, sand, realizes, pictures, courageous, casualties|
So what does this mean? Do the findings simply reflect the different rhetoric of the two candidates? Are other people more willing to express negative opinions about Trump than about Cruz when interviewed? Do New York Times reporters dislike Trump more than they dislike Cruz, and is this dislike shining through? I'm not sure. Whatever the explanation, the difference is starker than I had expected.