So, I made a Twitter bot. Yes, another Twitter bot, another data science kid doing sentiment analysis for their final. But, this project ends in a fully-automated livestream that uses that same sentiment analysis to start and manage conversations with Twitter users. A little data-science-meets-the-app-world, if you will. This project in its entirety begin as my capstone at Flatiron School, but I’ve got some updates coming already. This blog will address what got us here.
If you’ve played any table-top roleplaying games (TTRPG’s), it’s likely they were built around one of two fundamental dice mechanics: You were rolling a d20 (a 20-sided die) or a d100 (100-sided, you get it). If you were rolling that sweet, sweet d20, it’s likely it was in the context of Dungeons & Dragons, the world’s foremost roleplaying game. If it was a d100, well, my guess is gonna be less accurate but I’d wager it was either Call of Cthulhu or Mythras. At least, those are the two d100 systems I’ve rolled with.
I just finished a project on natural language processing (NLP) that looked at predicting the emotion of a tweet, provided its word choice. Three options were made available: Positive, negative, and no emotion. With about nine-thousand tweets I trained the neural network, and the results were just alright, with a validation accuracy just shy of a measly 80%. Towards the end of the project–the time when this always tends to happen–I had an insight that I had probably scrubbed my data too thoroughly. Or at least that it was missing something vital–emojis.
Have you ever dug a well? I sure haven’t. And consequently, I probably couldn’t tell you when one was broken. Or about to break. Or fully functional for that matter.
While there’s much to be said of how socially problematic machine learning models can be when used thoughtessly (such as these examples of systematically mis-diagnosing and failing to provide medical service to Black folks and failing to represent women in the hiring process), this little post is only going to tackle a specific instance of this problem. A machine-learning model I developed and recently used to assess this data from Kaggle–which contains informations on homes sold in and around Seattle frrom May 2014 to May 2015–was reflective of a racial bias present in the city’s contemporary housing market. In other words, the linear regression model I made had the goal of recommended cheap homes for first-time buyers. A large portion of these homes clustered in areas that were predominantly Black. These areas were: