Introduction

The main goal this project was to collect news titles related to the foreign policy dispute surrounding installation of THAAD, a missile project which caused foreign policy crisis between South Korea and China on July 8, 2016.

This notebook consists of two sections. First section is a crawler which collected media publications on China found on Naver, the largest web portal in South Korea. The second section analyzes some dimensions of the collected texts. First is a comparison of word counts of news titles published before and after the crisis time point. Second is a comparison of news source outlets before and after the crisis time point.

Modules

I used the KoNLPy module to tokenize Korean texts. This module includes a open source tokenizer named Open Korean Text which allows tokenizing nouns, phrases, and POS. This module requires Jpype package. Documentations for konlpy is here (http://konlpy.org/en/latest/), for jpype (https://github.com/tcalmant/jpype-py3/).

pip3 install konlpy, jpype1, Jpype1-py3

OR

git clone https://github.com/tcalmant/jpype-py3.git
cd jpype-py3
python3.7 setup.py install

pip3 install konlpy

Section I

Parting from my original plan which was to collect all news links found by querying 'China' on Naver, I only collected titles, date of publication, news source, and text of news published on Naver news section for the time frame 2015-01-01 to 2017-07-31, which resulted in 24165 news items. Naver news pages have the same html structure and href pattern (http://news.naver.com) which allowed me to collect the dimensions above. I concluded that this is a better plan because each news outlets have different html structure, and it would be time consuming to collect texts and titles of all of the news that appears as a search result on Naver.

Contribution

I found many code snippets online that allows crawling Naver news for a query word, but they all had some problems. The biggest problem I found was that although Naver returns a lot of news results, the pagination is limited to 4,000, and this limit is not enough to cover more than one day of news. For instance, news results for one day would lead up to page 4,000. As a result, just relying on pagination and querying does not allow collection of news results across a time frame. To fix this problem, I concatenated the dates and made the function to collect a certain number of pages per day.

Python Package

The script for crawling naver news used in this notebook was published as a open source python package navernewscrawler which can be found here: (https://pypi.org/project/navernewscrawler/)

In [1]:
import requests
from bs4 import BeautifulSoup
import json
import re
import sys
import time, random


def get_news(n_url):
    """
    Uses request to access url, parse using BeautifulSoup, appends title, date, company
    and text of a given news in the link in a list
    
    Parameters
    ----------
    n_url: a http:// url
    
    Yields
    ------
    list
    """
    news_detail = []
    breq = requests.get(n_url)
    bsoup = BeautifulSoup(breq.content, 'html.parser')
    title = bsoup.select('h3#articleTitle')[0].text
    news_detail.append(title)
    pdate = bsoup.select('.t11')[0].get_text()[:11]
    news_detail.append(pdate)
    _text = bsoup.select('#articleBodyContents')[0].get_text().replace('\n', " ")
    btext = _text.replace("// flash 오류를 우회하기 위한 함수 추가 function _flash_removeCallback() {}", "")
    news_detail.append(btext.strip())
    pcompany = bsoup.select('#footer address')[0].a.get_text()
    news_detail.append(pcompany)
    return news_detail

def get_dates():
    """
    Creates a list of dates in string format 'Y-M-D'
    
    Yields
    ------
    list
    
    Note
    ------
    To change collection time frame, mend start_date and end_date
    """
    import datetime
    start_date = datetime.date(2018, 12, 26)
    end_date   = datetime.date(2018, 12, 27)
    date_range_list = []
    date_range = [ start_date + datetime.timedelta(n) for n in range(int ((end_date - start_date).days))]
    for date in date_range:
        date_range_list.append(str(date).replace("-","."))
    return(date_range_list)


def output(query,page,max_page):
    """
    Query a word and return the title, date, company, text of news in dictionary format
    and append all dictionaries in a list
    
    Parameters
    ----------
    query: a string
    page: start page 
    max_page: maximum pages to be crawled per date

    Returns:
    List of dictionaries in a list, with keys title, date, company, text
    """
    news_dicts = []
    # best to concatenate urls here
    date_range = get_dates()
    for date in get_dates():
        start_page = page
        s_date = date.replace(".","")
        while start_page < max_page:
            url = "https://search.naver.com/search.naver?where=news&query=" + query + "&sort=0&ds=" + date + "&de=" + date + "&nso=so%3Ar%2Cp%3Afrom" + s_date + "to" + s_date + "%2Ca%3A&start=" + str(start_page)
            header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
            req = requests.get(url,headers=header)
            cont = req.content
            soup = BeautifulSoup(cont, 'html.parser')
            for urls in soup.select("._sp_each_url"):
                try:
                    if urls["href"].startswith("https://news.naver.com"):
                        news_detail = get_news(urls["href"])
                        adict = dict()
                        adict["title"] = news_detail[0]
                        adict["date"] = news_detail[1]
                        adict["company"] = news_detail[3]
                        adict["text"] = news_detail[2]
                        news_dicts.append(adict)
                except Exception as e:
                    continue
            start_page += 10
    return news_dicts

Section II

I conducted some comparison of the news published before and after the crisis time point. First, I retrieved respective word counts for titles published before and after the crisis time point and compared common and different words. Second, I returned a counter for news sources and checked if there is a difference in the news soures that reported on issues related to China before and after the crisis time point.

Cleaning Text and Using KoNLPy Module to Return Word Counts

Before conducting the analysis, I needed to do several things - first was to convert the dates in the dictionary back to datetime object so that the texts can be called within a time frame. The datetime object was added as a key to the news item dictionary. Second was to make titles or text into one string so that the words can be counted regarding the entire set of titles or text during a specific period. The last step was to canonicalize the text by removing punctuations, digits, and alphabets, and using Okt(), a Open Korean Text class, a tokenizer package for Korean text in the KoNLPy module.

In [1]:
# Import JSON Collected Using Naver Crawler
import json
filename = '20150101_news_scrape_20170731.json'
with open(filename) as json_data:
    dt = json.load(json_data)

# Make Date Comparable
def ConvertDates(string_date):
    '''
    Convert string dates ("Y-M-D") to datetime object
    
    Parameters
    ----------
    string_date: a string
    
    Returns
    -------
    datetime object
    '''
    import datetime
    string_date = string_date.replace("-","")
    string_date = string_date.strip()
    date = datetime.datetime.strptime(string_date,"%Y%m%d").date()
    return date

# Add Converted Date as Key
def ChangeDicts(data):
    import datetime
    for d in data:
        d['date_conv'] = ConvertDates(d['date'])
    # update key
    return data

def delete_dup(alist):
    final = [i for n, i in enumerate(alist) if i not in alist[n + 1:]]
    return(final)

# Create String Before Conducting Word Count
def PreCorpus(data,key,s_date,e_date):
    """
    String join all text under time condition

    Parameters
    ----------
    data : a list of dictionaries
    key: the key in the dictionary within the list
    s_date: start date, in string "Y-M-D"
    e_date: end date, in string "Y-M-D"

    Returns
    -------
    a string
    """
    data = ChangeDicts(data)
    text_list = []
    for d in data:
        if (d['date_conv'] >= ConvertDates(s_date)) & (d['date_conv'] <= ConvertDates(e_date)):
            text_list.append(d[key])
    all_text = ' '.join(text_list)
    return all_text

class NaverNewsCorpus:
    '''
    Functions
    ---------
    __init__: returns a string to be cleaned
    text_cleaning: returns a list of characters  
    getPreWordCorpus: join the results of text_cleaning
    getWordCorpus: returns a string 
    getWordCounts: returns a word count dictionary
    '''
    def __init__(self, text):
        self.text = text
    def text_cleaning(self):
        import re
        result_list = []
        for item in self.text:
            cleaned_text = re.sub('[a-zA-Z]', '', item)
            cleaned_text = re.sub('[\{\}\[\]\/?.,;:|\)*~`!^\-_+<>@\#$%&\\\=\(\'\"...]',
                              '', cleaned_text)
            result_list.append(cleaned_text)
        return result_list
    def getPreWordCorpus(self):
        result=self.text_cleaning()
        doc = (''.join(result))
        return doc
    def getWordCorpus(self):
        doc = self.getPreWordCorpus()
        text = ''.join(c for c in doc if c.isalnum() or c in '+, ')
        text = ''.join([i for i in text if not i.isdigit()])
        return text
    def getWordCounts(self,k):
        from konlpy.tag import Okt
        from collections import Counter
        nouns_tagger = Okt()
        word_corpus = self.getWordCorpus()
        nouns = nouns_tagger.nouns(word_corpus)
        count = Counter(nouns)
        view_count = count.most_common(k)
        return dict(view_count)

# Counts News Sources
def NewsSourceCounter(data,n,s_date,e_date):
    '''
    Parameters
    ----------
    data: list of dictionaries
    n: number of most common word counts 
    s_date: start date of news publication
    e_date: end date of news publication
    
    Returns
    -------
    a dictionary
    '''
    from collections import Counter
    company_list = []
    for d in data:
        if (d['date_conv'] >= ConvertDates(s_date)) & (d['date_conv'] <= ConvertDates(e_date)):
            company_list.append(d['company'])
    count = Counter(company_list)
    view_counts = count.most_common(n)
    return dict(view_counts)


# Output as Word Counts
def output(data,key,s_date,e_date,n):
    """
    Parameters
    ----------
    data: list of dictionaries
    key: the key to retrieve text from (either "title" or "text")
    s_date: start date of news publication
    e_date: end date of news publication\
    
    Returns
    -------
    a dictionary
    
    """
    text_corpus = PreCorpus(data,key,s_date,e_date)
    corpus = NaverNewsCorpus(text_corpus)
    return corpus.getWordCounts(n)

Analyses

The functions below include a function that returns common words between two sets of word count dictionaries for two periods, and a function that returns different words.

In [2]:
# Get Common Words in Respective Word Count Results

def GetCommonWords(first,second):
    '''
    Compare two word count dictionaries and return overlapping words
    Create keys: word, count, year
    
    Parameters
    ----------
    first: a word count dictionary
    second: a word count dictionary
    
    Returns
    -------
    a panda data frame
    '''
    import pandas as pd
    common_keys = list(set(first).intersection(set(second)))
    first_new = {}
    second_new = {}
    for k,v in first.items():
        if k in common_keys:
            first_new[k] = v
    for k,v in second.items():
        if k in common_keys:
            second_new[k] = v
    CommonWordsDf = pd.DataFrame({'pre-crisis':pd.Series(first_new),'post_crisis':pd.Series(second_new)})
    return CommonWordsDf


def GetCounterDifference(first,second):
    '''
    Compare two word count dictionaries and return the difference
    From second dictionary, subtract the overlapping keys from the first dictionary, and return the remainder
    key and values of the second dictionary
    
    Parameters
    ---------
    first: a word count dictionary
    second: a word count dictionary
    
    Returns
    ------
    a panda data frame
    '''
    import pandas as pd
    first_keys = list(set(first))
    second_keys = list(set(second))
    counts_difference = list(set(second_keys) - set(first_keys))
    new = {}
    for k,v in second.items():
        if k in counts_difference:
            new[k] = v
    ContrastWordsDf = pd.DataFrame.from_dict(new, orient='index',columns = ['word_count'])
    return ContrastWordsDf

Results

I compared news titles published between 2015-07-31 and 2016-07-07, and titles published between 2016-07-08 to 2017-07-31. These two periods are split on July 8, 2016, which marks the crisis time point.

In [4]:
word_first = output(dt,'title','2015-07-31','2016-07-07',100)
word_second = output(dt,'title','2016-07-08','2017-07-31',100)

news_first = NewsSourceCounter(dt,30,'2015-07-31','2016-07-07')
news_second = NewsSourceCounter(dt,30,'2016-07-08','2017-07-31')
In [6]:
# Dictionaries to Pandas

import pandas as pd
pre_crisisDf = pd.DataFrame.from_dict(word_first, orient='index',columns = ['word_count'])
post_crisisDf = pd.DataFrame.from_dict(word_second, orient='index',columns = ['word_count'])

# Common Words Df:
CommonWordsDf = GetCommonWords(word_first,word_second)

# Contrast Words Df:
ContrastWordsDf = GetCounterDifference(word_first,word_second)

Common Frequent Words in News Titles Between Pre-Crisis and Post-Crisis Period

To compare the frequency of common words in the news titles, I compared the top 7 to 55 words in the dictionaries returned. I skipped the first six because they were expected words, such as China, Chinese, and North Korea.

In [9]:
subset = CommonWordsDf.iloc[7:55].copy()

eng = {'미국':'USA', '년':'year', '중국산':'made in China', '기업':'enterprise', '제재':'restrain', '수출':'export', '시장':'market', '배치':'install', '불법':'illega', '발':'start', '중':'middle', '첫':'first',
       '관광객':'tourist', '시진핑':'Xi Jinping', '일':'work', '방문':'visit', '조업':'fishing industry', '정부':'government', '것':'that', '제주':'Jeju Island', '사망':'death', '대북':'North Korea', '등':'etc', '척':'pretend',
       '해경':'maritime police', '세계':'world', '일본':'Japan', '종합':'comprehensive', '국내':'domestic', '월':'month', '투자':'investment', '대만':'Taiwan', '남중국해':'South China Sea', '나포':'seizure', '위':'above', '경제':'economy',
       '관광':'tourism','내':'inside','강화':'reinforce', '미사일':'missile', '검거':'arrest', '대통령':'president', '중국군':'Chinese army','스모그':'smog', '최대':'maximum','판매':'sell','국제':'international',
       '중국어':'Chinese'}
subset.rename(columns = {'pre_crisis':'pre_crisis','post_crisis':'post_crisis'}, index = eng, inplace=True)
mplot = subset[subset.columns[::-1]].sort_values(['post_crisis'], ascending = True).plot.barh(figsize=(30,50),fontsize=20)

handles, labels = mplot.get_legend_handles_labels()
mplot.legend(handles[::-1], labels[::-1],fontsize = 20, loc='upper right')
Out[9]:
<matplotlib.legend.Legend at 0x11821af98>

Contrasting Frequent Words in News Titles Between Pre-Crisis and Post-Crisis Period

The graph below returns a plot of frequent words in the headlines of news items in the post-crisis period which are NOT in the pre-crisis period titles.

In [107]:
# subset ContrastDf to the top 20
ContrastWordsDf2 = ContrastWordsDf.iloc[:15]
In [109]:
ContrastWordsDf2
Out[109]:
word_count
트럼프 321
미세먼지 128
외교부 122
차이나 119
수입 117
하나 111
압박 109
국제 96
롯데 91
인도 91
중단 85
85
홍콩 84
한반도 84
무역 79
In [110]:
eng = {'트럼프':'Trump','미세먼지':'fine dust','외교부':'foreign ministry','차이나':'China','수입':'import','하나':'one','압박':'pressure','국제':'international','롯데':'Lotte','인도':'India','중단':'stop','차':'car','홍콩':'Hong Kong','한반도':'penninsular','무역':'trade'}
In [111]:
ContrastWordsDf2.rename(columns = {'word_count':'word_count'},index=eng,inplace=True)
cplot = ContrastWordsDf2.sort_values(['word_count'], ascending = True).plot.barh(figsize=(20,20),fontsize = 15)
cplot.legend(fontsize = 20, loc='upper right')
/Users/eunhousong/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:4025: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)
Out[111]:
<matplotlib.legend.Legend at 0x10fab5748>

How Frequently Does "Fine Dust" Appear in News Titles Pre and Post-Crisis Period?

The graph above shows out of 50 most frequent words during the two periods, the words that did not appear in news headlines during the pre-crisis period. Some of the words are irrelevant to the THAAD crisis and reflects world events. Word “Trump” seems to appear often due to his inauguration as president of United States in early 2017; “India” also appeared frequently in association with “China” due to heightened tensions at the Indo-China border involving establishment of a missile base there. “Hong Kong” appeared numerous times due to the Umbrella Movement in the fall of 2016.

Other than the world event, words related to trade reflects retributions carried out by China as well as party-level debate over the negative effects of conflict with China. “Lotte” appeared frequently due to its involvement in establishment of the missile base. After this announcement Lotte was directly mentioned by state media in China that it will face repercussions in the Chinese market. Word trade reflects party-level debate and media attention on the negative effects of retribution on the Korean economy.

The most noticeable word that is unrelated to the missile issue is the word 'fine dust' which refers to fine-grained pollutants. In the next section, I examine whether the word appeared more frequently in news titles during the post-crisis period, and also examine whether these headlines attributed blame to China.

The graph below visualizes the frequency of the word 'fine dust' in pre-crisis versus post-crisis period. The word appeared more than twice during the post-crisis period compared to the pre-crisis period.

In [116]:
import datetime
post_fine_dust = []
pre_fine_dust = []
for d in dt:
    if d['date_conv'] > datetime.date(2016, 7, 8):
        if '미세먼지' in d['title']:
            post_fine_dust.append(d)
    if d['date_conv'] < datetime.date(2016, 7, 8):
        if '미세먼지' in d['title']:
            pre_fine_dust.append(d)

print(len(pre_fine_dust),len(post_fine_dust))
66 144
In [117]:
import numpy as np
import matplotlib.pyplot as plt
 
# Make dataset
height = [144,66]
bars = ('post-crisis','pre-crisis')
y_pos = np.arange(len(bars))
 
# Create horizontal bars
plt.barh(y_pos, height, color = 'rgbkymc')
 
# Create names on the y-axis
plt.yticks(y_pos, bars)
 
# Show graphic
plt.show()

How Frequently Does China Attribution Occur Among Articles that Mention "Fine Dust"?

To check whether China blaming articles increased during the post-crisis period, I checked whether the word attributing blame to China ‘[air pollution] from China’ increased during the post-crisis period. Out of all news items that associated China with “fine dust”, approximately 51% attributed blame to China during the pre-crisis period, while 60% fell under this characterization during the post-crisis period.

In [118]:
post = []
pre = []

for d in pre_fine_dust:
    if '중국발' in d['title']:
        pre.append(d)

for d in post_fine_dust:
    if '중국발' in d['title']:
        post.append(d)
        
print(len(pre),len(post))
34 89
In [119]:
# graphical notation of the percentage above

import numpy as np
import matplotlib.pyplot as plt
 
# dataset
height = [89/144,34/66]
bars = ('post-crisis','pre-crisis')
y_pos = np.arange(len(bars))
 
# Create horizontal bars
plt.barh(y_pos, height, color = 'pink','darkgrey')
 
# Create names on the y-axis
plt.yticks(y_pos, bars)
 
# Show graphic
plt.show()

News Source Counter Difference

Below returns a graph comparing top 30 news sources which reported on China during the pre and post-crisis period. Newsis reported slightly more on China during the post-crisis period compared to the pre-crisis period, but the difference does not seem significant. There is no significant difference in the frequency of reports during the pre and post-crisis period for the respective news companies.

In [16]:
news_all = pd.DataFrame({'pre-crisis':pd.Series(news_first),'post_crisis':pd.Series(news_second)})
news_all.index
eng = {'JTBC':'JTBC', 'KBS':'KBS', 'MBC':'MBC', 'MBN':'MBN', 'MoneyToday':'MoneyToday', 'SBS & SBSi ':'SBS & SBSi ', 'SBS CNBC':'SBS CNBC','The Internet Hankyoreh':'The Internet Hankyoreh', 'YTN':'YTN', 'edaily':'edaily', 'financial news':'financial news',
       'media KHAN':'media KHAN', '국민일보':'Kookmin Ilbo', '노컷뉴스':'Nocut News', '뉴스1':'News1', '뉴시스':'Newsis', '디지털타임스':'Digital Times', '매경닷컴':'Maekyung Dotcom', '부산일보':'Busan Ilbo',
       '서울경제':'Seoul Economy', '서울신문':'Seoul Journal', '세계닷컴':'World Dotcom', '아시아경제신문':'Asia Economy Journal', '연합뉴스':'Yeonhap News', '연합뉴스TV':'Yeonhap News Tv', '조선비즈':'Chosun Biz', '채널A':'Channel A',
       '한경닷컴':'Hankyung Dotcom', '한국경제TV':'Korean Economy TV', '한국일보':'Hankook Ilbo', '헤럴드경제':'Herald Economy'}
news_all.rename(columns = {'pre-crisis':'pre-crisis','post-crisis':'post-crisis'},index=eng,inplace=True)

fplot = news_all[news_all.columns[::-1]].sort_values(['post_crisis'], ascending = True).plot.barh(figsize=(30,50),fontsize=20)

handles, labels = fplot.get_legend_handles_labels()
fplot.legend(handles[::-1], labels[::-1],fontsize = 20, loc='upper right')
Out[16]:
<matplotlib.legend.Legend at 0x13febcef0>