Topic Modeling of Thailand in Transition
Do newspaper articles that are published before and after the occurrence of a regime transition differ systematically in terms of their topic coverage and content? Conventional wisdom holds that media patterns are likely to be responsive to changes in government and political institutions as a state undergoes a regime transition. However, it is difficult to make well-informed inferences about how such process occurs in practice and to understand its implications without a comprehensive content analysis of a large sample of newspaper articles that are published during periods of transitions.
Utilizing an original dataset containing all political news published from 2013 to 2015 by The Nation, an English-language daily newspaper in Thailand, this post explores the relationship between regime transition and the media. Specifically, I analyze newspaper articles that were published under democratic rule (pre-May 2014 coup) and those that were published under military rule (post-May 2014 coup), using latent Dirichlet allocation (LDA) to discover latent topics that are associated with the corpus and to trace the prevalence of these topics amidst a regime transition. Then, leveraging the topic labels generated by the model, I conduct a series of quantitative content analyses for assessing the frequencies of words and measuring the sentiments associated with a given topic over time and across the cutoff point of the May 2014 coup d’état.
A Flow Chart of Methodology
All newspaper articles published by The Nation from 2013 to 2015 were collected from the Nexis Uni database via Rselenium, which extracted relevant information such as the headline, content, and date. The search query used in this process of data collection is “thailand”. The rationale behind this data collection strategy is to gather as many articles by The Nation as possible since in theory there is no a priori justification to regard one type of articles as more or less relevant to the analysis of the effects of regime type than another. However, in practice, due to time constraints and computational efficiency, only articles that contain a set of keywords that are predictive of their political content were retained for the analysis.
These keywords include: “politic*”, “gov*”, “democ*”, “Yingluck”, “junta”, “Prayut”, “Pheu”, and “NCPO”. Any articles that contain at least one of these words or word stems are selected. One problem with this strategy is that the keyword selection process is potentially arbitrary, inadequate for minimizing false positives and lacking in terms of intercoder reliability. An alternative strategy would be to use a computer-assisted approach offered by King, Lam and Roberts (2017), relying on an unsupervised selection of documents in search set, S, into target set, T, based on a hand-coded reference set R. However, in this project, the dictionary-based approach is selected for computational efficiency. This process yields a total of 20,703 articles (D) where the average length is 580 words.
# Set working directory
setwd("~/polisci490/Project")
# Load packages
packages <- c("xml2","rvest", "dplyr", "tm", "tidytext", "ggplot2", "SnowballC", "tidyverse", "lubridate", "stringr", "httr", "SnowballC", "wdman", "RSelenium", "tcltk", "XML", "topicmodels", "stringi", "LDAvis", "slam", "ldatuning", "kableExtra", "widyr", "igraph", "ggraph", "fmsb", "pander")
load.packages <- function(x) {
if (!require(x, character.only = TRUE)) {
install.packages(x, dependencies = TRUE)
library(x, character.only = TRUE)
}
}
lapply(packages, load.packages)
########## Data Collection (Don't Run this!!)##########
# Run Selenium Server
rD <- rsDriver()
remDr <- remoteDriver(remoteServerAddr = "localhost"
, port = 4567L
, browserName = "chrome"
)
# Log in
remDr$open()
remDr$navigate("https://advance-lexis-com.turing.library.northwestern.edu/api/permalink/920dc8d1-6625-47c3-bdd4-b1c073f28f78/?context=1516831")
username <- remDr$findElement(using = 'id', value = "IDToken1")
password <- remDr$findElement(using = 'id', value = "IDToken2")
username$sendKeysToElement(list("id"))
password$sendKeysToElement(list("pass", "\uE007"))
# click
continue <- remDr$findElement(using = "xpath", "//*[@id='33Lk']/div/div/div/section/div/menu/input[1]")
continue$clickElement()
# Initializing an empy dataframe
# df <- data.frame(title=NULL, text=NULL, date=NULL, length=NULL)
# Scraping
remDr$setTimeout(type = "page load", milliseconds = 100000)
remDr$setImplicitWaitTimeout(milliseconds = 100000)
for(i in 1:20000){
tryCatch({
title <- remDr$findElement(using = "id", "SS_DocumentTitle")
date <- remDr$findElement(using = "xpath", "//*[@id='document']/section/header/p[3]")
length <- remDr$findElement(using = "xpath", "//*[@id='document']/section/span/div[2]")
body <- remDr$findElement(using = "xpath", "//*[@id='document']/section/span")
preptext <- function(x){ data.frame(matrix(unlist(x), nrow=length(x), byrow=T))}
title_txt <- preptext(title$getElementText())
date_txt <- preptext(date$getElementText())
length_txt <- preptext(length$getElementText())
body_txt <- preptext(body$getElementText())
docdf <- data.frame(title=title_txt, text=body_txt, date=date_txt, length=length_txt)
df <- rbind(df, docdf)
Sys.sleep(sample(1:4, 1))
next_page <- remDr$findElement(using="xpath", "//*[@id='_gLdk']/div/form/div[1]/div/div[14]/nav/span/button[2]")
next_page$highlightElement()
next_page$clickElement()
}, error=function(e){
Sys.sleep(sample(1:10, 1))
title <- remDr$findElement(using = "id", "SS_DocumentTitle")
date <- remDr$findElement(using = "xpath", "//*[@id='document']/section/header/p[3]")
length <- remDr$findElement(using = "xpath", "//*[@id='document']/section/span/div[2]")
body <- remDr$findElement(using = "xpath", "//*[@id='document']/section/span")
preptext <- function(x){ data.frame(matrix(unlist(x), nrow=length(x), byrow=T))}
title_txt <- preptext(title$getElementText())
date_txt <- preptext(date$getElementText())
length_txt <- preptext(length$getElementText())
body_txt <- preptext(body$getElementText())
docdf <- data.frame(title=title_txt, text=body_txt, date=date_txt, length=length_txt)
df <- rbind(df, docdf)
Sys.sleep(sample(1:4, 1))
next_page <- remDr$findElement(using="xpath", "//*[@id='_gLdk']/div/form/div[1]/div/div[14]/nav/span/button[2]")
next_page$highlightElement()
next_page$clickElement()
})}
# Assign variable names
names(df) <- c("title", "text", "date", "wordcount")
# Set date
df$date <- mdy(as.character(df$date))
# Clean content
df$text <- gsub("^.*Body\n|Classification.*$","", as.character(df$text))
df$text <- gsub( " *@.*?; *", "", as.character(df$text))
df$text <- gsub( "*â€", "", as.character(df$text))
df$text <- gsub( "*–", "", as.character(df$text))
df$title <- gsub( " *@.*?; *", "", as.character(df$title))
df$title <- gsub( "*â€", "", as.character(df$title))
df$title <- gsub( "*–", "", as.character(df$title))
df$title <- gsub( "*#124", "", as.character(df$title))
df$text <- gsub( "*#124", "", as.character(df$text))
df$text <- gsub("*Prayuth", "Prayut", as.character(df$text)) # Prayut & Prayuth
df$title <- gsub("*Prayuth", "Prayut", as.character(df$title)) # Prayut & Prayuth
# Get word count
df$wordcount <- as.numeric(gsub("Length: | words", "", df$wordcount))
# Filter for politics
keyword <- c("politic*", "gov*", "democ*", "Yingluck", "junta", "Prayut", "Pheu", "NCPO")
pattern <- grepl(paste(keyword, collapse = "|"), df$text)
df <- df[pattern, ]
# select date Jan 2013 - Sep 2015
df.2013 <- filter(df, date >= "2013-01-01" & date < "2016-01-01")
# Remove duplicate rows
df.2013 <- distinct(df.2013)
# Add doc_id
df.2013 <- df.2013 %>% mutate(doc_id = seq.int(nrow(df.2013)))
# Finalize dataset
df.2013 <- df.2013 %>% select(doc_id, text, everything())
Table 1: Data Structure
Using the tm package, I converted the data to a corpus object, transformed words to lower cases and removed stop words, numbers, punctuations, followed by stemming and removing white spaces. Then, I parsed the corpus to a document-term matrix with articles as documents in the rows and terms in the columns. Finally, sparse terms that appear in less than 0.01 of all documents are removed.
This process yields 2,946 unique word stems (W) for the analysis, producing a 20,703 by 2,946 input matrix which contains 3,384,956 non-sparse entities for the topic model. One consideration that must be taken into account is that the topic models may be sensitive to these pre-processing choices (Denny and Sperling 2017). In theory, transforming to lower cases, removing numbers and removing punctuations should not significantly alter the findings of this project, whereas stemming and removing stop words may result in a loss of information or generate noise in the model. Although a comparison of findings with different preprocessing choices is not presented here, this can be implemented in future work to test the robustness of the topic models.
# Create corpus
nation <- VCorpus(DataframeSource(df.2013))
# Pre-process + generate document-term matrix
nation.dtm <- nation %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeWords, stopwords("english")) %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(stemDocument) %>%
tm_map(stripWhitespace) %>%
DocumentTermMatrix() %>%
removeSparseTerms(sparse = 0.99)
Following Blei, Ng and Jordan (2003), I employ latent Dirichlet allocation (LDA) as an unsupervised method for learning the topics associated with the newspaper articles. One advantage that LDA offers over unigram models is that it treats documents as “random mixtures over latent topics, where each topic is characterized by a distribution over words” (996). An important implication of this probabilistic feature is that each document is allowed to represent multiple topics instead of being restricted as discrete entities to a single topic. It must be noted, however, that utilizing LDA for dimensionality reduction entails making a number of modeling assumptions, two of which will be discussed here.
First, LDA assumes “bag-of-words” such that words in a document are treated as exchangeable. Hence, not only are the words themselves treated as infinitely exchangeable across documents and topics, word ordering is also ignored, despite the fact that such ordering may be highly informative and meaningful for the generative process through which words are assigned probability of being associated to different topics.
Second, it assumes that topics are static across time, while the occurrence of topics may vary with time. That is, the number of topics is assumed to be fixed and the parameters which characterize these topics are treated as time-invariant, when in fact topics may emerge and disappear or constitute themselves differently over time. It is important to recognize these limitations because they inform the extent to which the LDA can be relied upon for representing the documents in the corpus. I also considered using structural topic modeling (STM) as an alternative strategy. However, given the lack of available covariates on which to condition the topic model on, LDA is chosen for the study. One disadvantage of this is that the model is likely to be noisy given that informative metadata such as news category is not included as part of the modeling. Given that this problem could be exacerbated in a model that poorly fits the data, it is crucial to determine the optimal number of topics for fitting the LDA.
For LDA, the dimensionality of the Dirichlet distribution, k, is assumed to be known and fixed. Hence, selecting an optimal value of k is critical for obtaining a topic model that best describes the data. Among the possible strategies for determining the optimal parameters for fitting LDA models, I use LDAtuning, an R package that estimates the performance of LDA models using Gibbs sampling and reports the performance based on four different metrics: harmonic means (Griffith and Steyvers 2004), cosine distance (Cao et. al. 2009), Kullback-Leibler divergence (Arun et. al. 2010) and Jensen-Shannon divergence (Deveaud et. al. 2014). Note, however, that this method does not involve validation, cross-validation, or splitting the data into train and test set, which is typically the case for perplexity-based methods. I fit 9 models from 20 to 100 topics, increasing k by 10.
Figure 1 shows the effects on the four metrics when varying parameter k by 10. According to these metrics, the performance of LDA models improves most significantly when moving to 40 and 60 topics. In addition to these objective measures, I also evaluate the models based on their interpretability. Among models with k=20, 30, 40 and 60, the model with 60 topics is least perplexing to the human coder and most informative in terms of generating topic labels that are helpful for making an informed guess as to what each topic is about. Only the findings based on this model is reported in this post.
result <- FindTopicsNumber(
nation.dtm,
topics = seq(from = 20, to = 200, by = 10),
metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
method = "Gibbs",
control = list(seed = 77),
mc.cores = 2L,
verbose = TRUE
)
FindTopicsNumber_plot(result)
Figure 1: LDA Tuning
LDA is performed on the entire corpus comprising of political news from 2013 to 2015 with parameter k = 60. This generates a 60 by 2,946 matrix containing the log odds of word being associated with a given topic, and a 20,703 by 60 matrix containing the probabilities of a document being associated with a given topic.
The objective of the topic modeling is to discover latent topics in the corpus that characterize a broad range of issues that have major political implications such as “Human Rights,” “Corruption,” and “Political Conflict,” in other words, topics that are likely to be affected by or correlated with the transition observed in May 2014. Hence, the topics must be sufficiently fine-grained, capturing beyond linguistic or broad categorical differences, but also general enough to maintain salience over time. Most importantly, the topics should contain high semantic validity such that the relationship between words that are generated by each topic are meaningful.
Table 2 provides a list of topics and keywords associated with each topic. The substantive labels for each topic are provided after reviewing the keywords which best distinguish one topic from another. These keywords are selected and sorted based on their rank as determined by the value of the log odds of word being associated with a given topic. The output of the model is also parsed through R package ldavis, creating an interactive web-based visualization which is accessible below.
### LDA
mod.out.60 <- LDA(nation.dtm, k=60, control = list(seed=6))
class(mod.out.60) <- "LDA"
tidy(mod.out.60)
### Generate topic labels (top 30 terms)
nation.topics <- topics(mod.out.60, 1)
nation.terms <- as.data.frame(terms(mod.out.60, 30), stringsAsFactors = FALSE)
### Table
topic_list <- data.frame(t(nation.terms))
topic_list <- within(topic_list, x <- paste(X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20, sep=", ")) %>% select(x)
colnames(topic_list) <- NULL
topic_list <- tibble::rownames_to_column(topic_list)
colnames(topic_list) <- c("Topic", "Keys")
Topic | Keys |
---|---|
Topic 1 | health, hospit, medic, patient, treatment, doctor, diseas, drug, healthcar, peopl, also, medicin, cancer, can, servic |
Topic 2 | rice, farmer, price, tonn, govern, scheme, agricultur, million, export, rubber, said, farm, market, pledg, crop |
Topic 3 | channel, digit, broadcast, licenc, nbtc, servic, oper, auction, telecom, nation, advertis, station, content, telecommun, true |
Topic 4 | myanmar, lao, countri, cambodia, border, yangon, vietnam, mekong, neighbour, foreign, open, local, group, develop, year |
Topic 5 | asean, region, econom, countri, cooper, aec, thailand, communiti, member, trade, also, develop, will, integr, agreement |
Topic 6 | provinc, chiang, water, mai, district, flood, local, river, dam, nakhon, buri, area, rai, mae, thani |
Topic 7 | will, year, new, next, thailand, also, expect, countri, thai, futur, take, come, hope, can, first |
Topic 8 | asia, region, southeast, asian, countri, world, global, vietnam, east, asiapacif, thailand, india, across, singapor, pacif |
Topic 9 | polit, thailand, countri, thai, state, militari, nation, democraci, govern, intern, year, right, power, conflict, peopl |
Topic 10 | media, social, facebook, post, news, onlin, peopl, thai, page, report, messag, websit, inform, also, video |
Topic 11 | meet, said, will, yesterday, month, thailand, day, week, report, two, last, announc, offici, schedul, committe |
Topic 12 | per, cent, year, quarter, growth, increas, expect, month, price, last, averag, rate, first, forecast, drop |
Topic 13 | malaysia, singapor, indonesia, fish, philippin, boat, malaysian, indonesian, sea, ship, vessel, said, jakarta, marin, countri |
Topic 14 | trade, thai, thailand, export, industri, product, invest, countri, said, import, busi, manufactur, year, will, market |
Topic 15 | attack, secur, bomb, kill, incid, death, polic, peopl, fire, accid, violenc, bangkok, injur, also, victim |
Topic 16 | busi, custom, servic, compani, thailand, provid, new, manag, market, oper, growth, store, retail, cloud, industri |
Topic 17 | mobil, use, technolog, internet, servic, devic, onlin, user, phone, data, card, can, app, applic, smart |
Topic 18 | peac, muslim, group, south, talk, thai, insurg, govern, islam, secur, leader, region, offici, deep, local |
Topic 19 | say, like, one, just, time, want, get, peopl, can, work, think, now, make, year, know |
Topic 20 | ministri, said, govern, minist, plan, nation, state, committe, budget, will, polici, financ, propos, offic, privat |
Topic 21 | economi, econom, growth, rate, thailand, invest, year, polici, domest, global, export, govern, bank, polit, market |
Topic 22 | polic, said, arrest, suspect, rohingya, yesterday, investig, thailand, report, two, nation, thai, alleg, found, offici |
Topic 23 | china, chines, south, hong, korea, kong, india, beij, korean, taiwan, japan, indian, thailand, mainland, includ |
Topic 24 | court, right, case, law, legal, nation, rule, public, also, issu, violat, human, alleg, justic, report |
Topic 25 | team, player, thai, footbal, club, game, match, cup, play, nation, coach, fan, leagu, world, thailand |
Topic 26 | film, music, perform, festiv, show, thai, danc, movi, theatr, year, stage, also, audienc, play, song |
Topic 27 | hotel, offer, call, wine, bangkok, room, resort, night, includ, beach, two, book, spa, restaur, ticket |
Topic 28 | tour, said, play, win, golf, round, last, shot, hole, year, open, two, birdi, week, first |
Topic 29 | japan, japanes, tokyo, thailand, thai, yen, visit, also, oversea, bangkok, open, first, year, zone, interest |
Topic 30 | bill, amnesti, law, senat, amend, hous, govern, parliament, draft, legisl, propos, will, pass, thai, debat |
Topic 31 | food, restaur, drink, coffe, tea, can, shop, fresh, fruit, dish, serv, thai, also, open, tast |
Topic 32 | market, product, said, sale, brand, year, compani, thailand, consum, new, thai, will, launch, also, local |
Topic 33 | elect, reform, constitut, polit, vote, parti, charter, member, new, nation, draft, peopl, power, candid, propos |
Topic 34 | billion, year, million, said, last, first, worth, total, month, revenu, target, new, will, plan, next |
Topic 35 | design, use, fashion, can, collect, also, new, colour, bag, brand, look, camera, light, black, featur |
Topic 36 | develop, work, sustain, technolog, communiti, innov, programm, organis, help, research, support, thailand, also, creat, world |
Topic 37 | said, nation, yesterday, peopl, thailand, countri, also, want, problem, help, need, believ, told, presid, meanwhil |
Topic 38 | prayut, minist, militari, prime, general, junta, ncpo, coup, chanocha, order, nation, govern, peac, thailand, countri |
Topic 39 | park, villag, anim, eleph, peopl, live, forest, area, local, nation, communiti, land, water, tree, year |
Topic 40 | project, construct, will, infrastructur, develop, invest, plan, transport, govern, thailand, railway, road, rail, line, train |
Topic 41 | tax, fund, pay, money, incom, million, cost, peopl, financi, will, fee, age, scheme, save, thailand |
Topic 42 | event, thailand, presid, bangkok, award, organis, recent, intern, thai, held, left, campaign, centr, execut, group |
Topic 43 | bank, market, stock, loan, rate, thai, investor, bond, set, fund, index, billion, per, capit, foreign |
Topic 44 | cambodia, sea, disput, cambodian, thailand, territori, court, templ, thai, area, border, minist, maritim, rule, foreign |
Topic 45 | compani, busi, invest, per, cent, group, firm, profit, oper, share, insur, list, execut, manag, revenu |
Topic 46 | govern, countri, corrupt, polici, public, need, reform, thailand, nation, system, sector, problem, must, transpar, develop |
Topic 47 | rate, star, young, girl, famili, direct, woman, mother, thai, also, man, father, english, daughter, boy |
Topic 48 | energi, power, plant, electr, oil, gas, ptt, use, product, generat, natur, produc, suppli, thailand, fuel |
Topic 49 | educ, student, school, univers, children, teacher, learn, studi, thai, teach, languag, institut, also, english, graduat |
Topic 50 | thailand, one, can, peopl, thai, will, like, letter, mani, just, must, even, right, make, countri |
Topic 51 | protest, polit, yingluck, govern, thaksin, shinawatra, democrat, minist, thai, parti, prime, peopl, leader, ralli, support |
Topic 52 | tourism, tourist, travel, thailand, thai, airport, airlin, flight, visitor, year, destin, intern, oper, foreign, number |
Topic 53 | properti, develop, project, land, bangkok, build, area, condominium, market, unit, hous, locat, squar, new, residenti |
Topic 54 | art, king, cultur, thai, artist, royal, exhibit, nation, templ, tradit, paint, museum, monk, majesti, show |
Topic 55 | can, time, may, chang, need, one, howev, even, new, must, mani, still, make, like, now |
Topic 56 | germani, australia, franc, usa, australian, german, yesterday, new, itali, result, group, french, leagu, open, zealand |
Topic 57 | worker, labour, thailand, work, migrant, traffick, illeg, human, report, thai, employ, govern, wage, problem, job |
Topic 58 | depart, thailand, agenc, ministri, foreign, regul, thai, inform, standard, issu, intern, oper, offici, requir, law |
Topic 59 | car, vehicl, model, drive, driver, auto, motor, engin, new, also, toyota, road, honda, offer, bike |
Topic 60 | gold, women, thailand, game, world, win, thai, men, team, final, medal, second, first, point, match |
Table 2: Fifteen most probable words for the topics
### LDAvis https://github.com/cpsievert/LDAvis & https://www.r-bloggers.com/a-link-between-topicmodels-lda-and-ldavis/
topicmodels_json_ldavis <- function(fitted, corpus, doc_term){
## Required packages
library(topicmodels)
library(dplyr)
library(stringi)
library(tm)
library(LDAvis)
## Find required quantities
phi <- posterior(fitted)$terms %>% as.matrix
theta <- posterior(fitted)$topics %>% as.matrix
vocab <- colnames(phi)
doc_length <- vector()
for (i in 1:length(corpus)) {
temp <- paste(corpus[[i]]$content, collapse = ' ')
doc_length <- c(doc_length, stri_count(temp, regex = '\\S+'))
}
temp_frequency <- as.matrix(doc_term[1:nrow(doc_term), 1:ncol(doc_term)])
freq_matrix <- data.frame(ST = colnames(temp_frequency),
Freq = colSums(temp_frequency))
rm(temp_frequency)
## Convert to json
json_lda <- LDAvis::createJSON(phi = phi, theta = theta,
vocab = vocab,
doc.length = doc_length,
term.frequency = freq_matrix$Freq)
return(json_lda)
}
nation.jason <- topicmodels_json_ldavis(mod.out.60, nation, nation.dtm)
serVis(nation.jason, out.dir = "~/polisci490/Project")
For the purpose of this study, which concerns the role of regime transition, it is important that we can observe the topic variation over time. To do so, I group the data by topic and month and plot the average probability of topics. Higher values indicate greater association of documents, and words in the documents, with a given topic. For the purpose of this analysis, I select 6 topics which provide insights into the relationship between regime transition and the media (See Figure 3). These include topic 2 (Agriculture), topic 24 (Law), topic 33 (Political Reform), topic 38 (Junta), topic 46 (Government) and topic 51 (Political Conflict).
### Selected Topic Frequencies Over Time
dft.select <- dft %>% select(id, date, "2","24", "33", "38", "46", "51")
M <- gather(dft.select,topic,value,-id,-date) %>%
group_by(topic,month = floor_date(date, "month")) %>%
summarize(value=mean(value))
ggplot(M,aes(x=month,y=value, col=factor(topic, labels=c("Agriculture", "Law", "Political Reform", "Junta", "Government", "Political Conflict")))) +
geom_point() +
geom_line() +
scale_x_date(date_breaks= "1 month", date_labels = "%B-%Y") +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
geom_vline(xintercept = as.numeric(as.Date("2014-05-22")), linetype=4) +
#ggtitle("Topic Prevalence") +
labs(color = "Topic")
Figure 3: The dotted vertical line represents the May 2014 coup which removed the Pheu Thai government and installed a military government under the authority of the National Council for Peace and Order (NCPO)
Overall, the Political Conflict topic undergoes the most dramatic shift, peaking in the months leading up to the the coup in May 2014. This represents the height of the conflict during the anti-government and anti-election protest by the People’s Democratic Reform Committee (PDRC) which effectively shut down Bangkok in an attempt to oust the Yingluck government and to boycott the election in February. The salience of this topic declines substantially after the military intervened and established a junta government. In contrast, the Junta topic surges after the coup, signaling the media’s attention to the NCPO and the military’s involvement in politics.
Other notable shifts during this transition of power include: (1) a gradual increase in Political Reform which reflects reform-related activities associated with the National Reform Council, National Legislative Assembly and Constitution Drafting Committee, all of which were appointed by the NCPO after the coup; (2) an overall decrease in the topic of Agriculture, reflecting reduced coverage on the rice-pledging scheme of the Yingluck government; (3) a rise in the topic of Law, denoting heightened coverage of anti-corruption and corruption related activities; (4) an increase in Government which denotes greater coverage of government-related activities.
It is plausible that some of these shifts may be attributable to the regime transition. However, without a reliable control group for comparison, it is difficult to make a causal statement regarding the relationship between the regime change and the topic variation observed. Nevertheless, it is possible to conduct a more fine-grained analysis into each topic in order to obtain potentially meaningful descriptive inferences of how topic prevalence and content vary amidst a regime change. See below for a closer look at the frequencies of words over time in documents classified by the model as being highly associated with Political Conflict (Figure 4) and Junta (Figure 5).
### Attach most associated topic to each document
nation.topics.df <- as.data.frame(nation.topics)
nation.topics.df <- transmute(nation.topics.df, doc_id = rownames(nation.topics.df), topic = nation.topics)
nation.topics.df$doc_id <- as.integer(nation.topics.df$doc_id)
df.out <- inner_join(df.2013, nation.topics.df, by = "doc_id")
### Tidy
tidy_df <- df.out %>%
group_by(topic) %>%
ungroup() %>%
unnest_tokens(word, text)
### Word frequencies for selected topics
tidy_df %>% filter(topic == "51") %>%
group_by(date) %>%
summarize(value=n()) %>%
ggplot(aes(x=date,y=value)) +
geom_bar(stat = "identity", fill="skyblue") +
scale_x_date(date_breaks= "3 month", date_labels = "%B %d, %Y") +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
geom_vline(xintercept = as.numeric(as.Date("2014-05-22")), linetype=4) +
ggtitle("The Number of Words Written on the ‘Political Conflict’ Topic Per Day")
tidy_df %>% filter(topic == "38") %>%
group_by(date) %>%
summarize(value=n()) %>%
ggplot(aes(x=date,y=value)) +
geom_bar(stat = "identity", fill="skyblue") +
scale_x_date(date_breaks= "3 month", date_labels = "%B %d, %Y") +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
geom_vline(xintercept = as.numeric(as.Date("2014-05-22")), linetype=4) +
ggtitle("The Number of Words Written on the ‘Junta’ Topic Per Day")
Figure 4: The Number of Words Written on the "Political Conflict" Topic Per Day
Figure 5: The Number of Words Written on the "Junta" Topic Per Day
How do sentiments differ prior to and after the coup? Using the bing lexicon which classifies words as positive or negative, I compute unigram sentiment scores for the following topics: Agriculture, Law, Political Reform, Junta, Government and Political Conflict. For corpus-wide sentiments, see Figure (6). In terms of the variation in sentiments over time, the sentiments associated with the six selected topics are on average more positive after the coup than prior to the coup (Figure 7). This is most clearly illustrated in the plot for Agriculture, Political Conflict, Political Reform and Junta. On the other hand, sentiments on Government are more ambiguous, whereas sentiments on Law are consistently negative.
topic.sentiment1 <- tidy_df %>% filter(topic == c("2","24", "33", "38", "46", "51")) %>%
inner_join(get_sentiments("bing")) %>%
count(topic, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
topic.sentiment1 %>%
ggplot(aes(factor(topic, labels=c("Agriculture", "Law", "Political Reform", "Junta", "Government", "Political Conflict")), sentiment)) +
geom_bar(stat="identity", fill="skyblue3") +
ylim(-700, 700) +
xlab("Topic") +
ylab("Sentiments")
Figure 6: Sentiment Scores for Selected Topics
topic.sentiment <- tidy_df %>% filter(topic == c("2","24", "33", "38", "46", "51")) %>%
inner_join(get_sentiments("bing")) %>%
count(topic, date, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
ggplot(topic.sentiment, aes(date, sentiment, fill = topic)) +
geom_col(show.legend = FALSE) +
facet_wrap(~factor(topic, labels=c("Agriculture", "Law", "Political Reform", "Junta", "Government", "Political Conflict")), ncol = 2, scales = "free_x") +
ylab("Sentiments") +
xlab("Date") +
ggtitle("Sentiment Analysis")
Figure 7: Sentiment Variation Over Time
On average, sentiments for the six selected topics are more positive after the coup than prior to the coup. This is most clearly illustrated in the plot for Agriculture, Political Conflict, Political Reform, and Junta. On the other hand, sentiments on Government are more ambiguous, whereas sentiments on Law are consistently negative.
In this part, I focus on two most prominent political actors, the now-exiled former prime minister Yingluck Shinawatra and the current prime minister and NCPO leader Gen. Prayut Chan-O-Cha. First, I examine the topics that are associated with the names of these two political actors. I find that Yingluck, on the one hand, is associated with Political Conflict, Agriculture, Deep South, Junta and Ministry. Prayut, on the other hand, is associated with Junta, Thailand, Political Reform, Ministry, and Government. Similar to the previous section, I also conduct sentiment analysis on the words that are closely related with the occurrence of the names of Yingluck and Prayut. To do so, I break the newspaper articles into sections of 30 words. Then, I conduct a sentiment analysis on the words within sections that mention Prayut and Yingluck. Overall, sentiments associated with Yingluck became more negative after the coup, whereas the result is more ambiguous for sentiments associated with Prayut.
### Sentiment Analysis
# Break doc into sections
section_words <- tidy_df %>%
group_by(doc_id) %>%
mutate(section = row_number() %/% 30) %>%
filter(section > 0) %>%
filter(!word %in% stop_words$word)
# Yingluck
yingluck_sections <- section_words %>%
group_by(doc_id, section) %>%
filter(word=="yingluck") %>%
select(doc_id, section) %>%
inner_join(section_words)
yingluck.sentiments <- yingluck_sections %>%
inner_join(get_sentiments("bing")) %>%
count(date, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative) %>%
group_by(week = floor_date(date, "week")) %>%
summarize(value=mean(sentiment))
# Prayut
prayut_sections <- section_words %>%
group_by(doc_id, section) %>%
filter(word == "prayut"|word == "prayuth") %>%
select(doc_id, section) %>%
inner_join(section_words)
prayut.sentiments <- prayut_sections %>%
inner_join(get_sentiments("bing")) %>%
count(date, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative) %>%
group_by(week = floor_date(date, "week")) %>%
summarize(value=mean(sentiment))
# Graph
yingluck.sentiments$label <- "Yingluck"
prayut.sentiments$label <- "Prayut"
yingluck.vs.prayut <- full_join(yingluck.sentiments, prayut.sentiments)
ggplot(yingluck.vs.prayut, aes(week, value, col=label)) +
geom_line() +
theme_classic() +
scale_x_date(date_breaks= "3 month", date_labels = "%b %Y") +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
geom_vline(xintercept = as.numeric(as.Date("2014-05-22")), linetype=4) +
geom_hline(yintercept = 0, size=0.1) +
theme(legend.position = "bottom", legend.title=element_blank()) +
xlab("Date") +
ylab("Sentiment") +
scale_color_manual(values=c("#9999CC", "#CC6666"))
Figure 8: Sentiment Scores for Prayut and Yingluck (Weekly Averages)
### Top Topics associated with Yingluck and Prayut
tidy(mod.out.60) %>%
filter(term == "yingluck") %>%
top_n(10, beta) %>%
arrange(-beta) %>%
ggplot(aes(x=reorder(topic, -beta), y=beta)) +
geom_bar(stat="identity", fill="tomato2") +
coord_flip()
tidy(mod.out.60) %>%
filter(term == "prayut"|term == "prayuth") %>%
top_n(20, beta) %>%
arrange(-beta) %>%
ggplot(aes(x=reorder(topic, -beta), y=beta)) +
geom_bar(stat="identity", fill="darkgreen")
### Table
topic_list <- data.frame(t(nation.terms))
topic_list <- within(topic_list, x <- paste(X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, sep=", ")) %>% select(x)
colnames(topic_list) <- NULL
topic_list <- tibble::rownames_to_column(topic_list)
colnames(topic_list) <- c("Topic", "Keys")
yingluck.pattern <- tidy(mod.out.60) %>%
filter(term == "yingluck") %>%
top_n(5, beta) %>%
arrange(-beta) %>%
select(topic)
prayut.pattern <- tidy(mod.out.60) %>%
filter(term == "prayut") %>%
top_n(5, beta) %>%
arrange(-beta) %>%
select(topic)
Topic | Keys |
---|---|
Topic 51 | protest, polit, yingluck, govern, thaksin, shinawatra, democrat, minist, thai, parti |
Topic 2 | rice, farmer, price, tonn, govern, scheme, agricultur, million, export, rubber |
Topic 18 | peac, muslim, group, south, talk, thai, insurg, govern, islam, secur |
Topic 38 | prayut, minist, militari, prime, general, junta, ncpo, coup, chanocha, order |
Topic 20 | ministri, said, govern, minist, plan, nation, state, committe, budget, will |
Top 5 Topics Associated with Yingluck Shinawatra
Topic | Keys |
---|---|
Topic 38 | prayut, minist, militari, prime, general, junta, ncpo, coup, chanocha, order |
Topic 50 | thailand, one, can, peopl, thai, will, like, letter, mani, just |
Topic 33 | elect, reform, constitut, polit, vote, parti, charter, member, new, nation |
Topic 20 | ministri, said, govern, minist, plan, nation, state, committe, budget, will |
Topic 46 | govern, countri, corrupt, polici, public, need, reform, thailand, nation, system |
Top 5 Topics Associated with Prayut Chan-o-cha
To conclude, the study reveals striking differences between newspaper articles that were published prior to and after the military takeover in May 2014 with regards to both topic prevalence and content within the topics identified by the model. However, it must be emphasized that this finding is merely descriptive, not causal. Further work is required in order to falsify or substantiate a causal interpretation of the observed differences which, in theory, could be due to a range of other factors beyond regime transition. Future studies should aim to incorporate a control group that is suitable for comparison and for ruling out potential confounding variables. Furthermore, even when the findings are robust, the difficulty of interpreting the variation in topic prevalence and content remains. Thus, theory development with regards to the relationship between regime transition and the media is necessary to spell out the mechanisms through which changes in political institutions and rules of the game translate into changes in public opinions and sentiments as reflected by the media.
In terms of the methodology, future work should improve upon the current dictionary-based method for document selection in order to reduce the number of false positives and produce a more reliable topic model with greater semantic validity. In addition, the topic model in this study makes modeling assumptions regarding the exchangeability of words and the static nature of topics over time. Implementing a topic model that accounts for the importance of word ordering (Wallach 2006) and the dynamic nature of topics (Blei and Lafferty 2006) would contribute to a more realistic representation of the latent topics and their variation in response to the regime transition. Finally, incorporating other news outlet, in particular those with different political leanings and those that are free from media censorship in Thailand, would offer additional analytical leverage both for inferring whether variation in topics and contents can be attributed to regime transition and for generating externally valid claims regarding the media and regime change.
Arun, Rajkumar, Venkatasubramaniyan Suresh, CE Veni Madhavan, and MN Narasimha Murthy. “On Finding the Natural Number of Topics with Latent Dirichlet Allocation: Some Observations.” In Pacific-Asia Conference on Knowledge Discovery and Data Mining, 391–402. Springer, 2010.
Blei, David M., and John D. Lafferty. “Dynamic Topic Models.” In Proceedings of the 23rd International Conference on Machine Learning, 113–120. ACM, 2006.
Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent Dirichlet Allocation.” J. Mach. Learn. Res. 3 (March 2003): 993–1022.
Cao, Juan, Tian Xia, Jintao Li, Yongdong Zhang, and Sheng Tang. “A Density-Based Method for Adaptive LDA Model Selection.” Neurocomputing 72, no. 7–9 (2009): 1775–1781.
Denny, Matthew J., and Arthur Spirling. “Text Preprocessing for Unsupervised Learning: Why It Matters, When It Misleads, and What to Do about It.” Unpublished Manuscript, Dep. Polit. Sci, Stanford Univ and Inst. Quant. Soc. Sci., Harvard Univ. Https://Ssrn. Com/Abstract 2849145 (2017).
Deveaud, Romain, Eric SanJuan, and Patrice Bellot. “Accurate and Effective Latent Concept Modeling for Ad Hoc Information Retrieval.” Document Numérique 17, no. 1 (2014): 61–84.
Griffiths, Thomas L., and Mark Steyvers. “Finding Scientific Topics.” Proceedings of the National Academy of Sciences 101, no. suppl 1 (2004): 5228–5235.
King, Gary, Patrick Lam, and Margaret Roberts. “Computer-Assisted Keyword and Document Set Discovery from Unstructured Text.” American Journal of Political Science, In Press 2017.
Wallach, Hanna M. “Topic Modeling: Beyond Bag-of-Words.” In Proceedings of the 23rd International Conference on Machine Learning, 977–984. ACM, 2006.