The negative sentiment has to be more dominant if the tweet includes more than one sentiment. Negative: the tweet mentions negative connotations or has a negative/sad attitude. The positive sentiment has to be more dominant if the tweet includes more than one sentiment. Positive: the tweet mentions positive connotations or has a positive/happy attitude. Then, we execute queries to create multiple views by combining the data with the dictionary words to determine the tweet’s polarity. First, using Hive, we move unstructured data obtained in JSON format into structured format Hive tables. Sentiment analysis is a process of determining the attitude of the mass id positive, negative or neutral towards the subject of interest. The direct benefit of this project will be any twitter account holder who wants to analyze the sentiment or trend analysis of other people on any topic including or excluding COVID19. However, the amount of data that has to be processed and stored in a database is challenging, analyzing the data for the sentiment and performance is also a challenging task. After this, we also compare the performance of both approaches.īy analyzing the tweets/sentiment/hashtags, we can find out and understand the views of people on a specific topic of interest. To achieve this, we plan to use Apache flume for data extraction, HDFS to store data, and Hive and Spark for analyzing the data. We evaluate the popular hashtags related to COVID19 which are currently trending. In this project, we are extracting the data from Twitter and then performing sentiment and trend analysis on this data. Hadoop is one of the best tools to perform such analysis as it works with different types of data such as streaming data or distributed big data. This is done using Hadoop concepts, Flume, Hive, and Spark. In addition to that, we would analyze tweets of every user to get an opinion (positive, negative, or neutral) on a topic. The main objective of this project is to generate Twitter data and use it to find out people's opinions and views on a wide range of topics. Since it has a big volume of data, to find the trends or patterns of the information given, analyzing tweets in real-time is an interesting topic but challenging at the same time. There are more than 330 million active users on Twitter and every day, it receives millions of tweets. Social media has changed the way people get updates about existing or new information by providing real-time data. We then compare the Hive and Spark approaches to determine the best performance. This project aims to use the Hadoop framework to analyze unstructured data that we obtain from Twitter and perform sentiment and trend analysis using Hive on MapReduce and Spark on keyword “COVID19”. Project Done by: Rakesh Nagaraju, Raj Maharjan, Vy Tran as a part of CS257 Database System Principles Project, SJSU Twitter-Data-Analysis-on-COVID19-using-Hadoop-Flume-Hive-and-Spark.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |