Commit 620656f3 authored by Sathvick Reddy N's avatar Sathvick Reddy N
Browse files

Update contribute.md

parent 1c3f1fc7
## Project title
sentiment analysis on twitter
Description:
creating a model for sentiment analysis i.e the process of identifying a piece of text weather the text is positive,negative or neutral!
## Motivation
The aim of the project was to collect tweets, predict their positive and negative sentiment and determine the best model for the tweets with unknown sentiment.
## Features
This will be helpful for end users to visualize the number of tweets on a particular hashtag & keyword and sentiment analysis will be done on those tweets to show the impact of tweets belonging to a particular hashtag
See what people are saying about the business’s brand on Twitter.
Do market research on how people feel about competitors, market trends, product offerings etc.
Analyze the impact of marketing campaigns on Twitter users.
## API Reference
using the twitter developer account we can get the API
https://developer.twitter.com/en/apps/16512759
## How to use?
While we haven’t built empathetic robots yet, we have begun using machine learning to identify human emotions expressed in social media data, a technology known as sentiment analysis.
Simply enter a keyword, and the Tweet Visualizer automatically pulls recent tweets (from the past week, though the time range is shorter for popular subjects).
You can then explore the many visualization options that the tool offers for tweets.
## Dataset Information
We use and compare various different methods for sentiment analysis on tweets (a binary classification problem). The training dataset is expected to be a csv file of type tweet_id,sentiment,tweet where the tweet_id is a unique integer identifying the tweet, sentiment is either 1 (positive) or 0 (negative), and tweet is the tweet enclosed in "". Similarly, the test dataset is a csv file of type tweet_id,tweet. Please note that csv headers are not expected and should be removed from the training and test datasets.
Requirements
## library requirements
There are some general library requirements for the project and some which are specific to individual methods. The general requirements are as follows.
tweepy
numpy
scikit-learn
scipy
nltk
The library requirements specific to some methods are:
keras with TensorFlow backend for Logistic Regression, MLP, RNN (LSTM), and CNN.
Usage
## Preprocessing
Run preprocess.py <raw-csv-path> on both train and test data. This will generate a preprocessed version of the dataset.
Run stats.py <preprocessed-csv-path> where <preprocessed-csv-path> is the path of csv generated from preprocess.py. This gives general statistical information about the dataset and will two pickle files which are the frequency distribution of unigrams and bigrams in the training dataset.
After the above steps, you should have four files in total: <preprocessed-train-csv>, <preprocessed-test-csv>, <freqdist>, and <freqdist-bi> which are preprocessed train dataset, preprocessed test dataset, frequency distribution of unigrams and frequency distribution of bigrams respectively.
For all the methods that follow, change the values of TRAIN_PROCESSED_FILE, TEST_PROCESSED_FILE, FREQ_DIST_FILE, and BI_FREQ_DIST_FILE to your own paths in the respective files. Wherever applicable, values of USE_BIGRAMS and FEAT_TYPE can be changed to obtain results using different types of features as described in report.
Baseline
Run baseline.py. With TRAIN = True it will show the accuracy results on training dataset.
## Naive Bayes
Run naivebayes.py. With TRAIN = True it will show the accuracy results on 10% validation dataset.
## SVM
Run svm.py. With TRAIN = True it will show the accuracy results on 10% validation dataset.
## Multi-Layer Perceptron
Run neuralnet.py. Will validate using 10% data and save the best model to best_mlp_model.h5.
## Information about other files
dataset/positive-words.txt: List of positive words.
dataset/negative-words.txt: List of negative words.
dataset/glove-seeds.txt: GloVe words vectors from StanfordNLP which match our dataset for seeding word embeddings.
Plots.ipynb: IPython notebook used to generate plots present in report.
## Contribute
To contribute into our project [contributing guideline]
(https://code.swecha.org/social-media/sentiment-analysis).
\ No newline at end of file
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment