Over the summer, blogger burrsettles published a graph entitled On Geek versus Nerd. Based on mining tweets, it shows which words people associate with Geeks and which words are associated with Nerds. I thought I’d have a play with doing an audio equivalent, and hence this is about Headphones vs Loudspeakers.
What I did
Between July and December 2013, every week or so I used the twitter search api to retrieve tweets from the previous 7 days that included either the word ‘loudspeaker’ or the word ‘headphone’. In the end I had 17838 headphone tweets and 7050 loudspeaker tweets. I separated out all the other words mentioned in the tweets and cleaned them up e.g. collating synonyms like “amplifier” and “amp”. I also removed rare words, ones that appeared in less than 0.1% of tweets. This left me with 857 words.
I calculated the probability of words appearing in tweets with ‘loudspeaker’. And then the probability of words appearing in tweets with ‘headphone’. I then plotted the probabilities against each other. This graph was difficult to read with too many words sat on top of each other, so I produced a transformed infograph based on the probabilities. You’ll have to click on it to enlarge it.
How to read the graph:
- To the right are words that are more commonly used in the tweets, e.g. “audio”, “music” and “sound”
- To the left are words least commonly used in the tweets, e.g. “luck” and “child”
- The dashed line indicates words used equally often in tweets with “headphone” and “loudspeaker”
- Towards the top are words more often exclusively paired with “loudspeaker”, for example “mosque” and “stadium”.
- Towards the bottom, are words more often exclusively paired with “headphone”, e.g. “headband” and “earbud”.
For me, the main thing this graph illustrates is the trend towards headphones being fashion items because many of the tweets about headphones are to do with marketing and include brand names such as “dre”, “sennheiser”, “skullcandy”, “mac” and “philips”, and prices. The only brand I’ve spotted so far in the loudspeaker half of the plot is “Walmart”.
What can you see in the infograph? Feel free to comment below if you spot anything I can look into further.
Next I used a machine learning algorithm on Easy Text Classification to look at the sentiments portrayed by each of the tweets. This great tool predicts whether a tweet is positive, neutral or negative. So for example this tweet: “finally new headphones #yes #headphones #music #life #pink #great #fun #funny #abouttime #awesome” is classified as being positive, whereas this one “people who knot headphone and charger cables should just die #annoying” was classified as being negative. Overall, there were most positive tweets about headphones (25%) than loudspeakers (20%). Maybe a sign of more marketing tweets about headphones? Or people tweet more positive things about fashion brands they buy into? What do you think?
More detail on the method
The tweets were mined in MATLAB using Twitty. I looked for plurals of the keywords and also searched with and without a hash e.g. ‘#headphone’ and ‘headphone’. Retweets and duplicates were removed. When I cleaned up the list of words, this is what I did:
- I ignored short words.
- I removed all punctuation e.g. “don’t” became “dont”.
- I removed hashes e.g. “#music” because “music”.
- Using a lexicon of common English words I removed uninteresting common words e.g. “the”.
- I removed any characters and URLs except simply smileys.
- I removed any word that didn’t appear in more than 0.1% of tweets.
- I removed all words that were just digits.
- Using a list of synonyms, I reduced the number of words: e.g “amplifier” became “amp”
- I collated together plurals and singular words e.g. “amps” and “amp”, being careful not to change meaning e.g. “beat” and “beats”
- I collated together the same word but different tenses e.g. “pass” and “passed”
It might have been better to have used pointwise mutual information as was used in Geek vs Nerd. But I didn’t know the background probability of the words in Twitter. The infograph uses a rank order of the probabilities to space out the words onto the image so that words don’t overlap. In the centre of the graph, there were just too many words to be placed on one single line so the vertical difference isn’t significant between adjacent lines near the dashed line.