Big Brother is Watching You… So join in.
This is the first in a series I’m doing on Sentiment Analysis (SA) and working with Python. It’s currently all over the place, the National Security Agency are watching us online. Why are they watching? How might they do it? And can we do it?
Part 1 is the introduction, ultimately we’ll be creating a Python program to classify documents (strings), or determine the sentiment behind opinions. We’ll also be creating a simple web interface that can interact with our Python program using a Web Server Gateway Interface (WSGI). This time though we’ll be covering some of the terminology and introducing the tools and techniques.
The end goal could be used to monitor social media, such as Twitter for phrases or patterns and determine if the sentiment is positive or negative. A similar technique is used by IBM and the FBI – FBI Monitoring Social Media
On-line sentiment analysis interest has increased dramatically in recent years largely due to social media and its impact. The microblogging service Twitter has allowed users to easily share their thoughts at any given moment, with the most common mobile devices and applications integrating these services making it even easier to do so. With over 500 million accounts and over 400 million tweets per day, the potential is great.
Knowing and understanding the sentiments behind other people’s opinions has been of high interest to many factions including corporate entities, research institutions, Government agencies, and individuals. This information can prove valuable for marketing campaigns – whether political or consumer based, identifying potential trends, positive or negative or even identifying threats.
In recent years with the advent of low cost computing power, complex computational models and analyses are possible in areas such as Natural Language Processing (NLP), computational linguistics and text analytics.
Applying computers to parse text to identify, classify and measure the strength of an opinion has greatly benefited from these advancements, this has come to be termed Sentiment Analysis (SA), a multifaceted and multidisciplinary problem.
The main areas of interest within SA are:
- Object identification
- Orientation classification
- Feature extraction
Machine Learning and Naïve Bayes Classification
In our example we will be implementing a Naïve Bayes classifier, as it’s generally quite common and loved in some circles. Here we’ll cover the very basics of what it is, and what it can do for us.
Machine learning (ML) is taken from Artificial Intelligence (AI); at its core ML is an algorithm that allows the computer to learn. The algorithm will analyse data, learn and identify its characteristics allowing it to further identify similar properties in other pieces of data that were not part of its learning dataset.
So essentially we feed the computer a dataset (I’ll be using tweets, but we’ll cover this later) and it will look to identify patterns and linguistic usage, thus allowing it to identify similar traits in documents it has not seen before. Clever, eh?
Naïve Bayes is probability based, Bayesian probability comes from the mathematical propositional logic and is related to the work by Thomas Bayes. A Bayesian equation works by identifying common attributes or patterns and presenting a classification based upon the likelihood of the equation being true. I recently worked on a project for identifying sarcastic utterances in social media. One of the classifiers implemented for testing was a naïve Bayes, so I’ll explain briefly how it works if we were looking to identify a sarcastic Twitter tweet.
A basic implementation of the equation is:
• P(sarcasm|”great”) = x%
This equation translates into – What is the probability of sarcasm if “great” is implied. So if “great” were used in a tweet, what is the percentage that the tweet is sarcastic? Probability notation reads right to left, so “great” is true, and the probability of sarcasm is related to “great” being true.
In order for Bayes to work, it needs a training set of pre-classified tweets. Taking the example further looking at a supplied dataset of tweets:
If 40% of tweets are sarcastic and 60% are not, 30% of the sarcastic tweets contain “great” and 10% of the non-sarcastic tweets also contain “great”. What is the probability that a sarcastic tweet contains “great”? In probability notation this reads:
- P(sarcastic) = 40%
- P(“great”|sarcastic) = 30%
- P(“great”|~sarcastic) = 10%
- P(sarcastic|”great”) = ?% (~ is shorthand for not)
This can be calculated to say 12% of tweets have both sarcasm and “great”, 10% contain “great” and are not sarcastic so 6% are not sarcastic and contain “great”. A total of 18% contain “great”, meaning the probability of a tweet being sarcastic because it contains “great” is 12/18 or approximately 67%.
Don’t worry if this doesn’t make a whole lot of sense right now.
Python 2.6.6 has been chosen mainly due to support for NLTK and its support of libraries; Python 3 may have enhancements but possible lack of support for several libraries is an issue. This was also the version installed by Debian, newer versions should be fine though NLTK 3.0 is Alpha (Python 3).
Python does provide benefits regardless, it has very good file parsing abilities and being a high-level language allows for rapid deployment of a prototype system with a relatively shallow learning curve.
Python does have its downfalls in speed; this may become more of an issue when dealing with large datasets. Python being interpreted, as opposed to compiled like lower-level languages such a C, can suffer due to the byte code interpreter. The use of PyPy and Just In Time (JIT) compiler for Python is a possibility to increase speed and can give other benefits, namely security. These options were not explored due to the use of a relatively small dataset.
Natural Language Tool Kit
The Natural Language Toolkit (NLTK) is a Python library with many modules for Natural Language Processing (NLP). NLTK has been used to implement the naïve Bayes classifier. NLTK allows for Part of Speech (PoS) tagging, tokenisation and many more common NLP tasks, including other classifiers such as Maximum Entropy.
Attempting the same tasks without NLTK is possible, though NLTK reduces deployment time considerably. NLTK is very simple to install and is available from NLTK.org, with simple instructions nltk.org/install.html.
At this stage we want to make sure Python is running and NLTK can successfully be imported. You could do this via the Python IDE, initiate it using the command ‘python’. However, I won’t be so you can use any editor you’re comfortable with. I’m on OSX so I’ll be using Coda 2.
# Program to test NLTK and Python for TNT
# sydneydarnay 2013
hw_token = nltk.word_tokenize('Hello World!')
The above code takes our input, in this case hw (Hello World!), and tokenizes it. So what we want is to essentially extract the features of this string for no other reason right now other than to test NLTK. Run the program (python tnt.py) and you should see a Python list containing the tokens:
[box_light]['Hello', 'World', '!'][/box_light]
Okay if all went well that should have worked for you, so you have Python working nicely with NLTK.
In part 2 we will jump into the programming of our first Bayes program, including a classification of our input. Hopefully part 1 has introduced the topic and given you the requirements to contine. At this stage all that is required is a development environment running Python and NLTK. Other modules and libraries will be discussed during their implementation.