Text Analysis and Sentiment Detection Algorithms Extended for Chinese Language Processing.
Over the course of the last months we have worked intensively on adding a whole new region and language to our Natural Language Processing (NLP) engine. After already doing NLP for English and German, we can now proudly announce that we have extended our text analysis algorithms to the Chinese language (traditional and simplified Mandarin). This will open up a new dimension of social media and big data analytics of communication about financial markets. It is commonly known that people in Asia, mainly in China, are heavy users of digital and social networks. Additionally, people are much more active on the stock markets on average compared to people in Europe and in parts also in the U.S.
Monitoring Chinese Social Media for 3.000 Chinese Securities
We are monitoring more than 3.000 securities from Shanghai and Shenzhen stock exchange. This includes major indices from these stock exchanges, for example the Shanghai Composite Index, Shanghai Stock Exchange 50, or the Shenzhen Component Index. There are several hundred more indices from these regions, including many different sector indices. Additionally, we are monitoring all major U.S. stocks and general important financial titles (e.g. global indices, Gold, Oil, etc.) on Chinese message boards. For example Apple and Tesla are of great interest to Chinese traders and receive a lot of attention in online discussions.
Challenges of Chinese Language Processing
It was one of the major challenges to understand the different specifics of Chinese compared to English or German. Without a native speaker who was trained on the specific context of analyzing social media content about financial markets this would clearly not have been possible.
Advantage for the Computer
One helpful apsect is the standardization of Chinese company names. Usually, people use an abbreviation and standardized form of company names which consists of four characters. It can be less than four and in some cases more, however, most company names follow this structure.
格力电器 : Gree Electric Appliances
In an English text people could use “Gree”, “Gree Electric”, the enitre name, or variations if they talk about the company. In a Chinese text or comment which is stock market related and discusses “Gree Electric Appliances” you will always find this string of four characters: 格力电器.
No Twitter – Far better
Twitter does not play an important role in China. The landscape of social networks is more fragmented in the Chinese online space. There is not one single most important hub, like Twitter in the western online sphere.
The following list shows some highly frequented stock message boards we are also monitoring:
Some news on xueqiu.com are read by 30 to 40 million people over a time period of just a few days. Almost every news on the front page of the website is read by more than 10 million people. See below a current screenshot of the homepage of xueqiu.com on April 27th 2017.
Our NLP engine was able to process all messages and comments from these websites historically. This is important as significant statistics can only be pulled from long ranging historic data series. Luckily we were able to produce these series which will be very helpful for further analyses of historic patterns and finding correlations between communication data and stock prices.
We have the feeling that we have just started to scratch the surface…
Dataset Featured in this Post:
China 50 Top Stocks
Blue Chip Companies