Required Libraries

Always remember to import libraries that are needed to work on your data. In this course we will work with 4 main libraries: Scikit learn, numpy, scipy, and Pandas. An easy way to install all of them in one go is to get Anaconda: https://anaconda.org/anaconda/python. When you install Anaconda you will also get access to Jupyter notebook using which you can run all the below commands. An alternative, which I highly recommend is to first install Anaconda and run your code using editors such as Text Wrangle, Sublime Text, Notepad ++, or Vim editor. Doing the later will highly be beneficial for you after this course is over (but it does have a learning curve).

Example on how to install a library

In [81]:
pip3 install pandas ## This will install the pandas library. Do this on your terminal and not on jupyter. 
  File "<ipython-input-81-49e0a484917e>", line 1
    pip3 install pandas ## This will install the pandas library. Do this on your terminal and not on jupyter.
               ^
SyntaxError: invalid syntax

To install pip, follow the directions on this link: https://pip.pypa.io/en/stable/installing/

Importing libraries

In [84]:
import pandas as pd
import numpy as np
import scipy
In [85]:
print("Hello world!") ## This is something I typically write before anything to check if the environment is running. 
Hello world!

Reading the data.

In [87]:
df = pd.read_csv("/Users/boY/Desktop/code_me/text_mining/summer_2019/ManualSentimentClassifier/train.csv",encoding='ISO-8859-1') # there are different encoding and for HW1 this particular encoding works well.
In [91]:
df.head(3)
Out[91]:
text class
0 It was clear right from the beginning that 9/... positive
1 The most hillarious and funny Brooks movie I ... positive
2 Along with Fernando Fragata João Mário Gril... positive
In [92]:
df.columns ## To print out the column names
Out[92]:
Index(['text', 'class'], dtype='object')
In [93]:
df.shape ## To know the dimensions of the data frame. The output should be read as [rows,columns]
Out[93]:
(2000, 2)
In [94]:
[rows,columns] = df.shape # Print both values out to check the output yourself. 
In [95]:
df.describe()
Out[95]:
text class
count 2000 2000
unique 2000 2
top Despite of the success in comedy or drama the... negative
freq 1 1000
In [96]:
df['class'].unique() ## To know the unique labels. 
Out[96]:
array(['positive', 'negative'], dtype=object)
In [97]:
df['class'].value_counts() ## To know the distribution of the labels. 
Out[97]:
negative    1000
positive    1000
Name: class, dtype: int64

Convert labels into a machine readable format

In [14]:
from sklearn import preprocessing ### Importing a preprocessor to convert the labels in the target class. 
In [15]:
train_class_y = ['negative','positive']
In [16]:
le = preprocessing.LabelEncoder() ## Label encoder does the trick. 
In [17]:
le.fit(train_class_y) ## We are fitting the categories now. 
Out[17]:
LabelEncoder()
In [19]:
train_y = le.transform(df['class']) ## Here we are transforming our labels to 0's and 1's. Basically binary values.
In [21]:
train_y ## These are out labels now. The output is an array of the labels (binary values).  
Out[21]:
array([1, 1, 1, ..., 0, 0, 0])
In [22]:
le.transform(['positive','negative','positive']) ### Just to check. "le" is the object we have created which transforms the data.
Out[22]:
array([1, 0, 1])
In [23]:
le.inverse_transform([0,1,1]) ### Doing an inverse of the transformation
Out[23]:
array(['negative', 'positive', 'positive'], 
      dtype='<U8')

Working with the text data.

In [98]:
train_x = df['text']
In [99]:
train_x.shape ## Just making sure that we have what we want. 
Out[99]:
(2000,)

Tokenizing using bag of words

In [100]:
from sklearn.feature_extraction.text import CountVectorizer ## CountVectorizer gives you the bag of words. 
In [101]:
count_vect = CountVectorizer()
In [104]:
X_train_counts = count_vect.fit_transform(train_x)
In [105]:
X_train_counts.shape
Out[105]:
(2000, 25736)
In [107]:
X_train_counts.toarray()
Out[107]:
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)
In [108]:
np.count_nonzero(X_train_counts.toarray()) ## Just checking for non zeros. 
Out[108]:
272324
In [109]:
count_vect.get_feature_names()
Out[109]:
['00',
 '000',
 '000s',
 '00am',
 '01',
 '02',
 '03',
 '04',
 '05',
 '07',
 '08',
 '0ne',
 '10',
 '100',
 '1000',
 '10000000000000',
 '1000lb',
 '100x',
 '101',
 '102',
 '104',
 '108',
 '10lines',
 '10p',
 '10pm',
 '10th',
 '11',
 '116',
 '12',
 '120',
 '12383499143743701',
 '12th',
 '13',
 '13th',
 '14',
 '1415',
 '14s',
 '14th',
 '15',
 '1561',
 '16',
 '163',
 '16mm',
 '16s',
 '16th',
 '17',
 '1700',
 '1780s',
 '17million',
 '18',
 '180',
 '1800',
 '1800s',
 '1809',
 '1813',
 '1820',
 '1837',
 '1852',
 '1861',
 '1863',
 '1865',
 '1880s',
 '1881',
 '1886',
 '188o',
 '1890',
 '1896',
 '1898',
 '18s',
 '18th',
 '19',
 '1900',
 '1900s',
 '1901',
 '1910',
 '1910s',
 '1913',
 '1914',
 '1915',
 '1918',
 '1919',
 '1920',
 '1920s',
 '1922',
 '1924',
 '1926',
 '1927',
 '1929',
 '1930',
 '1930s',
 '1931',
 '1932',
 '1933',
 '1934',
 '1935',
 '1936',
 '1937',
 '1938',
 '1939',
 '193o',
 '1940',
 '1941',
 '1942',
 '1943',
 '1944',
 '1945',
 '1946',
 '1947',
 '1948',
 '1949',
 '1950',
 '1950s',
 '1951',
 '1952',
 '1953',
 '1954',
 '1955',
 '1956',
 '1957',
 '1958',
 '1959',
 '1960',
 '1960s',
 '1961',
 '1962',
 '1963',
 '1964',
 '1965',
 '1966',
 '1967',
 '1968',
 '1969',
 '1970',
 '1970s',
 '1971',
 '1972',
 '1973',
 '1974',
 '1975',
 '1976',
 '1977',
 '1978',
 '1979',
 '197o',
 '1980',
 '1980s',
 '1981',
 '1982',
 '1983',
 '1984',
 '1985',
 '1986',
 '1987',
 '1988',
 '1989',
 '1990',
 '1990s',
 '1991',
 '1992',
 '1993',
 '1995',
 '1996',
 '1997',
 '1998',
 '1999',
 '19th',
 '1h',
 '1h40',
 '1hour',
 '1hr',
 '1s',
 '1st',
 '1ton',
 '20',
 '200',
 '2000',
 '2001',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '2010',
 '2019',
 '2036',
 '2054',
 '206',
 '20k',
 '20s',
 '20th',
 '21',
 '210',
 '21st',
 '22',
 '225',
 '23',
 '24',
 '25',
 '250',
 '25th',
 '26',
 '27',
 '275',
 '278',
 '27th',
 '28',
 '29',
 '2d',
 '2hrs',
 '2nd',
 '2oo4',
 '2oo5',
 '2pac',
 '2point4',
 '30',
 '300',
 '3000',
 '303',
 '30ish',
 '30s',
 '31',
 '32',
 '33',
 '34',
 '34th',
 '35',
 '35mm',
 '35pm',
 '36',
 '360',
 '36th',
 '37',
 '39',
 '3am',
 '3d',
 '3dvd',
 '3rd',
 '40',
 '400',
 '40s',
 '40th',
 '41',
 '42',
 '42nd',
 '43',
 '44',
 '442nd',
 '45',
 '451',
 '46',
 '465',
 '469',
 '47',
 '475',
 '47s',
 '48',
 '4am',
 '4d',
 '4th',
 '50',
 '500',
 '500lbs',
 '50k',
 '50s',
 '51',
 '51st',
 '52',
 '529',
 '53',
 '54',
 '55',
 '5539',
 '56',
 '57',
 '58',
 '59',
 '5ive',
 '5min',
 '5th',
 '60',
 '600',
 '607',
 '60s',
 '60â',
 '61',
 '62',
 '64',
 '65',
 '666',
 '67',
 '67th',
 '6hours',
 '6th',
 '6yo',
 '70',
 '700',
 '70s',
 '70th',
 '71',
 '72',
 '73',
 '737',
 '74',
 '747',
 '75',
 '750',
 '76',
 '77',
 '78',
 '79',
 '79th',
 '7eventy',
 '7th',
 '80',
 '800',
 '8000',
 '80s',
 '81',
 '82',
 '83',
 '84',
 '85',
 '850pm',
 '86',
 '88',
 '89',
 '8mm',
 '8p',
 '8th',
 '90',
 '90210',
 '90s',
 '91',
 '911',
 '93',
 '95',
 '96',
 '99',
 '999',
 '9lbs',
 '9pm',
 '______',
 '_dr',
 '_film_',
 '_is_',
 '_much_',
 '_not_',
 'a10',
 'aaaarrgh',
 'aage',
 'aaliyah',
 'aaron',
 'abahy',
 'abandon',
 'abandoned',
 'abandoning',
 'abandonment',
 'abandons',
 'abbey',
 'abbot',
 'abbott',
 'abbreviated',
 'abc',
 'abdicates',
 'abducted',
 'abductee',
 'abducting',
 'abduction',
 'abductors',
 'abe',
 'abel',
 'abercrombie',
 'abets',
 'abeyance',
 'abhay',
 'abhor',
 'abi',
 'abiding',
 'abigail',
 'abilities',
 'ability',
 'abject',
 'ablaze',
 'able',
 'ably',
 'abnormally',
 'abo',
 'aboard',
 'abode',
 'abolish',
 'abominable',
 'abomination',
 'abominations',
 'aboriginal',
 'aborted',
 'abortion',
 'abortions',
 'abos',
 'abound',
 'about',
 'abouts',
 'above',
 'abraham',
 'abrahamic',
 'abreast',
 'abroad',
 'abrupt',
 'abruptly',
 'abs',
 'abscond',
 'absence',
 'absent',
 'absolute',
 'absolutely',
 'absolution',
 'absorb',
 'absorbed',
 'absorbing',
 'absorption',
 'abstains',
 'abstract',
 'abstraction',
 'absurd',
 'absurder',
 'absurdism',
 'absurdity',
 'absurdly',
 'absurdness',
 'abu',
 'abundance',
 'abundant',
 'abuse',
 'abused',
 'abuses',
 'abusing',
 'abusive',
 'abysmal',
 'academic',
 'academics',
 'academy',
 'accent',
 'accented',
 'accents',
 'accentuated',
 'accentuating',
 'accept',
 'acceptable',
 'acceptance',
 'accepted',
 'accepting',
 'accepts',
 'access',
 'accessible',
 'accessories',
 'accessory',
 'accident',
 'accidental',
 'accidentally',
 'accidents',
 'acclaim',
 'acclaimed',
 'accolade',
 'accommodate',
 'accommodations',
 'accompanied',
 'accompanies',
 'accompany',
 'accompanying',
 'accomplice',
 'accomplices',
 'accomplish',
 'accomplished',
 'accomplishes',
 'accomplishing',
 'accomplishment',
 'accordance',
 'according',
 'accordingly',
 'accordion',
 'accorsi',
 'accosted',
 'account',
 'accountant',
 'accounted',
 'accounting',
 'accounts',
 'accumulated',
 'accumulates',
 'accumulation',
 'accuracy',
 'accurate',
 'accurately',
 'accusations',
 'accusatory',
 'accuse',
 'accused',
 'accuser',
 'accustomed',
 'ace',
 'acharya',
 'ache',
 'achieve',
 'achieved',
 'achievement',
 'achievements',
 'achievers',
 'achieves',
 'achieving',
 'achilles',
 'aching',
 'achingly',
 'acid',
 'acidic',
 'ackland',
 'acknowledge',
 'acknowledged',
 'acknowledges',
 'acknowledging',
 'ackroyd',
 'acme',
 'acolyte',
 'acolytes',
 'acorn',
 'acquaintance',
 'acquaintances',
 'acquire',
 'acquired',
 'acquires',
 'acres',
 'acrid',
 'acrobatic',
 'across',
 'act',
 'actally',
 'acted',
 'actess',
 'acting',
 'actingâ',
 'action',
 'actioner',
 'actioners',
 'actions',
 'active',
 'actively',
 'activism',
 'activist',
 'activists',
 'activities',
 'activity',
 'actor',
 'actors',
 'actress',
 'actresses',
 'acts',
 'actual',
 'actuality',
 'actually',
 'acually',
 'acute',
 'ad',
 'ada',
 'adalbert',
 'adam',
 'adama',
 'adams',
 'adapt',
 'adaptation',
 'adaptations',
 'adapted',
 'adapter',
 'adapting',
 'adaption',
 'adapts',
 'add',
 'addam',
 'added',
 'addendum',
 'adder',
 'addict',
 'addicted',
 'addiction',
 'addictions',
 'addictive',
 'addicts',
 'adding',
 'addio',
 'addition',
 'additional',
 'additionally',
 'additions',
 'address',
 'addressed',
 'addresses',
 'addressing',
 'adds',
 'ade',
 'adele',
 'adept',
 'adeptly',
 'adequate',
 'adequately',
 'adequateâ',
 'adherence',
 'adhering',
 'adieu',
 'adjacent',
 'adjani',
 'adjective',
 'adjust',
 'adjusted',
 'adjusts',
 'adjutant',
 'adler',
 'administered',
 'administration',
 'admirable',
 'admirably',
 'admiral',
 'admiration',
 'admire',
 'admired',
 'admirer',
 'admirers',
 'admires',
 'admission',
 'admit',
 'admits',
 'admittadly',
 'admitted',
 'admittedly',
 'admonishing',
 'adolescence',
 'adolescent',
 'adolescents',
 'adolf',
 'adolph',
 'adopt',
 'adopted',
 'adopting',
 'adopts',
 'adorable',
 'adore',
 'adored',
 'adoring',
 'adoringly',
 'adorned',
 'adorning',
 'adrenaline',
 'adrian',
 'adriana',
 'adrianne',
 'adrien',
 'adrienne',
 'adrift',
 'adroitly',
 'ads',
 'adt',
 'adult',
 'adulterous',
 'adultery',
 'adulthood',
 'adults',
 'advance',
 'advanced',
 'advancement',
 'advancements',
 'advances',
 'advantage',
 'advantages',
 'advent',
 'adventure',
 'adventures',
 'adventuresome',
 'adventurous',
 'adversaries',
 'adversary',
 'adverse',
 'advert',
 'advertise',
 'advertised',
 'advertisements',
 'advertises',
 'advertising',
 'advice',
 'advise',
 'advised',
 'advisers',
 'advising',
 'advisors',
 'advocate',
 'advocates',
 'aerial',
 'aesop',
 'aestheically',
 'aesthetic',
 'aesthetically',
 'aesthetics',
 'afar',
 'affair',
 'affairs',
 'affect',
 'affectation',
 'affected',
 'affecting',
 'affection',
 'affections',
 'affects',
 'affinity',
 'affirm',
 'affirmation',
 'affirmed',
 'affirming',
 'affirms',
 'affixed',
 'affleck',
 'afflict',
 'afflicted',
 'affluence',
 'affluent',
 'afford',
 'affront',
 'afgan',
 'afganistan',
 'afghan',
 'afghanistan',
 'aflame',
 'afloat',
 'afoot',
 'afore',
 'aforementioned',
 'aforesaid',
 'afoul',
 'afraid',
 'africa',
 'african',
 'afro',
 'afros',
 'after',
 'afterall',
 'afterlife',
 'aftermath',
 'afternoon',
 'afternoons',
 'aftershock',
 'aftertaste',
 'afterward',
 'afterwards',
 'afterword',
 'aftra',
 'ag',
 'again',
 'against',
 'agape',
 'agatha',
 'age',
 'aged',
 'ageing',
 'ageless',
 'agencies',
 'agency',
 'agenda',
 'agent',
 'agents',
 'ages',
 'aggh',
 'aggie',
 'aggrandizing',
 'aggravation',
 'aggression',
 'aggressive',
 'aggressively',
 'aggressiveness',
 'aghast',
 'agility',
 'aging',
 'agitprop',
 'agnes',
 'ago',
 'agonising',
 'agonizing',
 'agonizingly',
 'agony',
 'agree',
 'agreed',
 'agreeing',
 'agreement',
 'agrees',
 'agro',
 'agrument',
 'agusti',
 'agustin',
 'ah',
 'aha',
 'ahamad',
 'ahamd',
 'ahead',
 'ahet',
 'ahh',
 'ahhhhhh',
 'ahn',
 'ahoy',
 'aid',
 'aida',
 'aide',
 'aided',
 'aiden',
 'aides',
 'aiding',
 'aids',
 'aiello',
 'aielo',
 'ailing',
 'aim',
 'aimants',
 'aime',
 'aimed',
 'aimee',
 'aiming',
 'aimless',
 'aimlessly',
 'aims',
 'ain',
 'aintry',
 'air',
 'airbag',
 'airball',
 'aircraft',
 'aircrafts',
 'aired',
 'airing',
 'airlift',
 'airline',
 'airlock',
 'airplane',
 'airport',
 'airways',
 'aisle',
 'aisles',
 'ajax',
 'ak',
 'aka',
 'akane',
 'akbar',
 'akelly',
 'akerston',
 'akin',
 'akira',
 'akroyd',
 'akshay',
 'akshaye',
 'akst',
 'aku',
 'al',
 'ala',
 'alabama',
 'aladdin',
 'alain',
 'alamo',
 'alan',
 'alarm',
 'alarming',
 'alarmist',
 'alarms',
 'alas',
 'alastair',
 'alba',
 'albeit',
 'albert',
 'alberta',
 'albertine',
 'alberto',
 'albiet',
 'albino',
 'albright',
 'album',
 'albums',
 'alcatraz',
 'alchemy',
 'alcohol',
 'alcoholic',
 'alcoholics',
 'alcoholism',
 'alderich',
 'aldrich',
 'alec',
 'alecky',
 'alejandro',
 'alejo',
 'alert',
 'alerted',
 'alex',
 'alexander',
 'alexandra',
 'alexandre',
 'alexio',
 'alexis',
 'alf',
 'alfie',
 'alfre',
 'alfred',
 'alger',
 'algernon',
 'ali',
 'alias',
 'alibi',
 'alice',
 'alicia',
 'alien',
 'alienated',
 'alienates',
 'alienation',
 'aliens',
 'align',
 'alignment',
 'alike',
 'aline',
 'alisan',
 'alison',
 'alistair',
 'alittle',
 'ality',
 'alive',
 'all',
 'allan',
 'alldredge',
 'allegations',
 'alleged',
 'allegedly',
 'allegiance',
 'allegorical',
 'allegory',
 'allen',
 'allende',
 'allergic',
 'allergies',
 'alleviate',
 'alley',
 'alleyway',
 'alliance',
 'allied',
 'allies',
 'alligator',
 'allison',
 'alloted',
 'allow',
 'allowance',
 'allowed',
 'allowing',
 'allows',
 'alloy',
 'allport',
 'alludes',
 'allure',
 'alluring',
 'allusion',
 'allusions',
 'ally',
 'allâ',
 'almerayeda',
 'almighty',
 'almodovar',
 'almora',
 'almost',
 'alock',
 'aloha',
 'alok',
 'alone',
 'alones',
 'along',
 'alongside',
 'alongwith',
 'alonso',
 'alonzo',
 'aloof',
 'alot',
 'aloud',
 'alpha',
 'alphabet',
 'alphonse',
 'alpo',
 'alps',
 'already',
 'alredy',
 'alright',
 'als',
 'also',
 'alt',
 'altar',
 'alter',
 'altered',
 'altering',
 'alternate',
 'alternately',
 'alternates',
 'alternating',
 'alternation',
 'alternative',
 ...]

TF Idf's

In [110]:
from sklearn.feature_extraction.text import TfidfVectorizer ## Importing the library that will help us do this. 
In [111]:
tf = TfidfVectorizer(min_df=1,stop_words='english',max_features=5000) ## Ask yourself, why min_df =1? We are using english stopwords. 
####max_features=3000
In [112]:
train_x_tfidf = tf.fit_transform(train_x)
In [44]:
tf.get_feature_names() ## Be careful to check your feature names with tf and not with train_x_tfidf
Out[44]:
['000',
 '10',
 '100',
 '101',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16',
 '17',
 '18',
 '1930',
 '1932',
 '1933',
 '1939',
 '1943',
 '1950',
 '1950s',
 '1960s',
 '1964',
 '1965',
 '1970',
 '1971',
 '1973',
 '1978',
 '1979',
 '1980',
 '1982',
 '1983',
 '1984',
 '1990',
 '1995',
 '1998',
 '1999',
 '1st',
 '20',
 '2000',
 '2001',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '20s',
 '20th',
 '21st',
 '24',
 '25',
 '26',
 '28',
 '2nd',
 '30',
 '30s',
 '35',
 '3d',
 '3rd',
 '40',
 '45',
 '50',
 '60',
 '60s',
 '70',
 '70s',
 '747',
 '77',
 '80',
 '80s',
 '90',
 '90s',
 '95',
 '99',
 'abandoned',
 'abc',
 'abilities',
 'ability',
 'able',
 'abound',
 'abraham',
 'absence',
 'absent',
 'absolute',
 'absolutely',
 'absurd',
 'absurdity',
 'abu',
 'abundance',
 'abuse',
 'abusive',
 'abysmal',
 'academy',
 'accent',
 'accents',
 'accept',
 'acceptable',
 'accepted',
 'access',
 'accident',
 'accidentally',
 'acclaimed',
 'accompanied',
 'accomplished',
 'according',
 'account',
 'accounts',
 'accuracy',
 'accurate',
 'accurately',
 'accused',
 'achieve',
 'achieved',
 'achievement',
 'acid',
 'act',
 'acted',
 'acting',
 'action',
 'actions',
 'active',
 'actor',
 'actors',
 'actress',
 'actresses',
 'acts',
 'actual',
 'actually',
 'ad',
 'adam',
 'adams',
 'adaptation',
 'adaptations',
 'adapted',
 'adaption',
 'add',
 'added',
 'addict',
 'adding',
 'addition',
 'addressed',
 'adds',
 'adequate',
 'admit',
 'admittedly',
 'adopted',
 'adorable',
 'adult',
 'adults',
 'advance',
 'advantage',
 'adventure',
 'adventures',
 'advertised',
 'advertising',
 'advice',
 'affair',
 'affect',
 'affected',
 'affection',
 'aforementioned',
 'afraid',
 'africa',
 'african',
 'afternoon',
 'age',
 'aged',
 'agent',
 'agents',
 'ages',
 'aggressive',
 'aging',
 'ago',
 'agony',
 'agree',
 'ah',
 'ahead',
 'aid',
 'aim',
 'aimed',
 'aiming',
 'ain',
 'air',
 'aired',
 'airport',
 'ajax',
 'aka',
 'akshay',
 'al',
 'alan',
 'alas',
 'albeit',
 'albert',
 'album',
 'alcohol',
 'alcoholic',
 'alert',
 'alex',
 'alexandre',
 'alfred',
 'alice',
 'alien',
 'aliens',
 'alike',
 'alive',
 'alleged',
 'allen',
 'alligator',
 'allow',
 'allowed',
 'allowing',
 'allows',
 'alongside',
 'alright',
 'altered',
 'alternative',
 'altman',
 'altogether',
 'amateur',
 'amateurish',
 'amazed',
 'amazing',
 'amazingly',
 'ambiguous',
 'ambitious',
 'america',
 'american',
 'americans',
 'amusing',
 'analysis',
 'ancient',
 'anderson',
 'andrew',
 'andy',
 'angel',
 'angeles',
 'angels',
 'anger',
 'angles',
 'angry',
 'animal',
 'animals',
 'animated',
 'animation',
 'anime',
 'ann',
 'anna',
 'anne',
 'annie',
 'annoyed',
 'annoying',
 'answer',
 'answers',
 'anthony',
 'anti',
 'antics',
 'anton',
 'antonioni',
 'anybody',
 'anymore',
 'anytime',
 'anyways',
 'apart',
 'apartment',
 'ape',
 'apes',
 'appalling',
 'apparent',
 'apparently',
 'appeal',
 'appealing',
 'appear',
 'appearance',
 'appearances',
 'appeared',
 'appearing',
 'appears',
 'appreciate',
 'appreciation',
 'approach',
 'appropriate',
 'april',
 'area',
 'areas',
 'aren',
 'arguably',
 'argument',
 'arkin',
 'arms',
 'army',
 'arnie',
 'arnold',
 'arquette',
 'arrested',
 'arrival',
 'arrives',
 'arriving',
 'arrogance',
 'arrogant',
 'art',
 'arthur',
 'artist',
 'artistic',
 'artists',
 'arts',
 'artsy',
 'artwork',
 'ashamed',
 'asian',
 'aside',
 'ask',
 'asked',
 'asking',
 'asks',
 'asleep',
 'aspect',
 'aspects',
 'ass',
 'assassin',
 'assigned',
 'associated',
 'assume',
 'assumed',
 'astonishing',
 'asylum',
 'atlantis',
 'atmosphere',
 'atmospheric',
 'atrocious',
 'attached',
 'attack',
 'attacked',
 'attacks',
 'attempt',
 'attempted',
 'attempting',
 'attempts',
 'attenborough',
 'attend',
 'attention',
 'attila',
 'attitude',
 'attitudes',
 'attracted',
 'attraction',
 'attractive',
 'audience',
 'audiences',
 'audio',
 'august',
 'aunt',
 'aussie',
 'austen',
 'austin',
 'australia',
 'australian',
 'australians',
 'auteur',
 'authentic',
 'authenticity',
 'author',
 'authorities',
 'authority',
 'automatically',
 'available',
 'average',
 'avoid',
 'awake',
 'awakening',
 'award',
 'awards',
 'aware',
 'away',
 'awe',
 'awesome',
 'awful',
 'awhile',
 'awkward',
 'baby',
 'bacall',
 'backdrop',
 'backed',
 'background',
 'backgrounds',
 'backs',
 'bad',
 'badly',
 'bag',
 'bakshi',
 'balance',
 'balduin',
 'ball',
 'balls',
 'bamboo',
 'banana',
 'band',
 'bands',
 'bang',
 'bank',
 'banned',
 'bar',
 'barbara',
 'barbra',
 'bare',
 'barely',
 'bargain',
 'barman',
 'barnes',
 'baron',
 'barry',
 'bart',
 'base',
 'baseball',
 'based',
 'basement',
 'bashing',
 'basic',
 'basically',
 'basinger',
 'basis',
 'basketball',
 'bath',
 'batman',
 'battle',
 'battles',
 'bbc',
 'beach',
 'bear',
 'bears',
 'beast',
 'beat',
 'beaten',
 'beating',
 'beatty',
 'beautiful',
 'beautifully',
 'beauty',
 'bed',
 'beer',
 'began',
 'begin',
 'beginning',
 'begins',
 'behavior',
 'behaviour',
 'behold',
 'beings',
 'bela',
 'belief',
 'beliefs',
 'believable',
 'believe',
 'believed',
 'believes',
 'believing',
 'belongs',
 'beloved',
 'ben',
 'benefit',
 'berlin',
 'bernard',
 'best',
 'bet',
 'bette',
 'better',
 'bettie',
 'beverly',
 'beware',
 'biased',
 'bible',
 'big',
 'bigger',
 'biggest',
 'biko',
 'bilko',
 'billed',
 'billy',
 'bin',
 'biography',
 'birds',
 'birth',
 'birthday',
 'bit',
 'bite',
 'bits',
 'bitten',
 'bitter',
 'bizarre',
 'black',
 'blade',
 'blah',
 'blair',
 'blake',
 'blame',
 'bland',
 'blank',
 'blanks',
 'blatant',
 'bleak',
 'bleed',
 'blew',
 'blind',
 'block',
 'blockbuster',
 'blonde',
 'blood',
 'bloodshed',
 'bloody',
 'bloom',
 'blow',
 'blowing',
 'blown',
 'blows',
 'blue',
 'blunt',
 'board',
 'boat',
 'bob',
 'bobby',
 'bodies',
 'body',
 'bold',
 'boll',
 'bollywood',
 'bomb',
 'bomber',
 'bombshells',
 'bonanza',
 'bond',
 'bone',
 'bonham',
 'bonus',
 'boogeyman',
 'book',
 'books',
 'boom',
 'border',
 'bore',
 'bored',
 'boredom',
 'boring',
 'born',
 'boss',
 'bother',
 'bothered',
 'bouchet',
 'bought',
 'bound',
 'bourne',
 'bouzaglo',
 'bowl',
 'box',
 'boxer',
 'boxing',
 'boy',
 'boyfriend',
 'boyle',
 'boys',
 'brad',
 'brady',
 'brain',
 'brains',
 'branagh',
 'brand',
 'brave',
 'break',
 'breaking',
 'breaks',
 'breasts',
 'breath',
 'breathtaking',
 'brian',
 'bride',
 'bridge',
 'brief',
 'briefly',
 'bright',
 'brilliant',
 'brilliantly',
 'bring',
 'bringing',
 'brings',
 'britain',
 'british',
 'broad',
 'broadcast',
 'broadway',
 'broke',
 'broken',
 'bronte',
 'brooding',
 'brooklyn',
 'brooks',
 'brosnan',
 'brother',
 'brothers',
 'brought',
 'brown',
 'bruce',
 'brutal',
 'brutally',
 'bsg',
 'buck',
 'bucks',
 'buddies',
 'buddy',
 'budget',
 'buffs',
 'bugs',
 'build',
 'building',
 'buildings',
 'builds',
 'built',
 'bulk',
 'bullet',
 'bullets',
 'bully',
 'bumbling',
 'bunch',
 'buried',
 'burke',
 'burn',
 'burned',
 'burning',
 'burns',
 'burton',
 'bus',
 'bush',
 'business',
 'businessman',
 'busy',
 'butcher',
 'butt',
 'buy',
 'buying',
 'cabin',
 'cable',
 'cage',
 'cagney',
 'cain',
 'caine',
 'cake',
 'caliber',
 'california',
 'called',
 'calling',
 'calls',
 'came',
 'cameo',
 'cameos',
 'camera',
 'cameras',
 'camp',
 'campaign',
 'campbell',
 'campy',
 'canada',
 'canadian',
 'cancer',
 'candy',
 'cannon',
 'canto',
 'capable',
 'capital',
 'capshaw',
 'captain',
 'captivating',
 'capture',
 'captured',
 'captures',
 'capturing',
 'car',
 'card',
 'cards',
 'care',
 'cared',
 'career',
 'careers',
 'careful',
 'carefully',
 'cares',
 'caring',
 'carol',
 'carpenter',
 'carradine',
 'carrie',
 'carried',
 'carries',
 'carry',
 'carrying',
 'cars',
 'carter',
 'cartoon',
 'cartoonish',
 'cartoons',
 'caruso',
 'case',
 'cases',
 'cash',
 'cassidy',
 'cast',
 'casting',
 'castle',
 'casual',
 'cat',
 'catch',
 'catching',
 'catchy',
 'category',
 'catherine',
 'catholic',
 'caucasian',
 'caught',
 'cause',
 'caused',
 'cave',
 'cd',
 'celine',
 'cell',
 'cells',
 'celluloid',
 'cena',
 'center',
 'centered',
 'centers',
 'central',
 'cents',
 'century',
 'certain',
 'certainly',
 'cg',
 'cgi',
 'chain',
 'chair',
 'challenge',
 'challenged',
 'champion',
 'championship',
 'chan',
 'chance',
 'chances',
 'change',
 'changed',
 'changes',
 'changing',
 'channel',
 'chaos',
 'character',
 'characterization',
 'characters',
 'charge',
 'charisma',
 'charismatic',
 'charles',
 'charlie',
 'charlotte',
 'charm',
 'charming',
 'chase',
 'chased',
 'chasing',
 'chavez',
 'che',
 'cheap',
 'cheaply',
 'cheating',
 'check',
 'checked',
 'checking',
 'cheech',
 'cheek',
 'cheese',
 'cheesy',
 'chemistry',
 'chick',
 'chief',
 'child',
 'childhood',
 'children',
 'chill',
 'chilling',
 'chills',
 'china',
 'chinese',
 'choice',
 'choices',
 'choir',
 'chomsky',
 'chong',
 'choose',
 'chooses',
 'choreographed',
 'choreography',
 'chorus',
 'chose',
 'chosen',
 'chris',
 'chrissy',
 'christ',
 'christian',
 'christmas',
 'christopher',
 'chuckle',
 'church',
 'cia',
 'cinderella',
 'cinema',
 'cinematic',
 'cinematography',
 'circumstances',
 'citizen',
 'citizens',
 'city',
 'civil',
 'civilization',
 'claim',
 'claimed',
 'claiming',
 'claims',
 'claire',
 'clarity',
 'clark',
 'clarke',
 'clash',
 'class',
 'classic',
 'classics',
 'claude',
 'clean',
 'clear',
 'clearly',
 'clever',
 'cliche',
 'clichã',
 'cliff',
 'cliffhanger',
 'climactic',
 'climax',
 'clip',
 'clips',
 'cloak',
 'clock',
 'clooney',
 'close',
 'closer',
 'closest',
 'closing',
 'clothes',
 'clothing',
 'clown',
 'club',
 'clue',
 'clueless',
 'clues',
 'clumsy',
 'coach',
 'coast',
 'coaster',
 'code',
 'cohen',
 'coherent',
 'cold',
 'cole',
 'collar',
 'collection',
 'collective',
 'collector',
 'college',
 'collette',
 'colman',
 'colonel',
 'color',
 'colorful',
 'colors',
 'colour',
 'colours',
 'columbo',
 'com',
 'combat',
 'combination',
 'combine',
 'combined',
 'combines',
 'come',
 'comedian',
 'comedians',
 'comedic',
 'comedies',
 'comedy',
 'comes',
 'comfortable',
 'comic',
 'comical',
 'coming',
 'command',
 'commendable',
 'comment',
 'commentary',
 'commented',
 'comments',
 'commercial',
 'commercials',
 'commit',
 'commitment',
 'committed',
 'common',
 'communist',
 'community',
 'company',
 'compare',
 'compared',
 'comparing',
 'comparison',
 'compelled',
 'compelling',
 'competent',
 'competition',
 'complain',
 'complaining',
 'complaint',
 'complete',
 'completely',
 'complex',
 'complicated',
 'composed',
 'composer',
 'compromise',
 'computer',
 'concept',
 'concerned',
 'concerning',
 'concerns',
 'concert',
 'conclusion',
 'condemned',
 'condition',
 'confess',
 'confession',
 'conflict',
 'confused',
 'confusing',
 'confusion',
 'connect',
 'connected',
 'connection',
 'connections',
 'connery',
 'connolly',
 'connor',
 'conquest',
 'conservative',
 'consider',
 'considerable',
 'considered',
 'considering',
 'consistent',
 'consistently',
 'consists',
 'conspiracy',
 'constant',
 'constantly',
 'construction',
 'contact',
 'contain',
 'contained',
 'contains',
 'contemporary',
 'content',
 'contest',
 'contestants',
 'context',
 'continue',
 'continued',
 'continues',
 'continuing',
 'continuity',
 'contract',
 'contrary',
 'contrast',
 'contribution',
 'contrived',
 'control',
 'controlled',
 'controversial',
 'conversation',
 'convey',
 'conveyed',
 'conviction',
 'convince',
 'convinced',
 'convinces',
 'convincing',
 'convincingly',
 'convoluted',
 'cook',
 'cooking',
 'cool',
 'cooper',
 'cop',
 'cope',
 ...]
In [113]:
train_x_tfidf_array = train_x_tfidf.toarray()
In [114]:
train_x_tfidf_array[0]
Out[114]:
array([ 0.,  0.,  0., ...,  0.,  0.,  0.])
In [115]:
tf.inverse_transform(train_x_tfidf_array[0]) ## just to check what all features are there. 
Out[115]:
[array(['11', '1973', 'add', 'air', 'alas', 'angry', 'anti', 'appealing',
        'appearing', 'audience', 'bad', 'beginning', 'best', 'big', 'bin',
        'bizarre', 'black', 'bomb', 'boys', 'bunch', 'calls', 'car',
        'certainly', 'cheesy', 'choreography', 'cia', 'cinema', 'classic',
        'claude', 'clear', 'collective', 'combined', 'come', 'consider',
        'cuts', 'danger', 'deaf', 'did', 'different', 'directly',
        'director', 'directors', 'don', 'double', 'easy', 'effort',
        'elected', 'end', 'ending', 'entire', 'episodes', 'ernest', 'event',
        'example', 'explain', 'extremely', 'faces', 'falling', 'family',
        'far', 'features', 'film', 'films', 'finally', 'forget', 'girl',
        'good', 'great', 'happened', 'happy', 'hard', 'haven', 'help',
        'henry', 'hope', 'hysterical', 'ii', 'imagine', 'impressive',
        'indian', 'inspire', 'instead', 'interesting', 'international',
        'just', 'ken', 'killed', 'known', 'leading', 'life', 'll', 'long',
        'love', 'luck', 'mainly', 'make', 'man', 'masterpiece', 'mentioned',
        'minute', 'movies', 'muslim', 'needless', 'new', 'open', 'opening',
        'order', 'parts', 'pearl', 'penn', 'people', 'phone', 'pictures',
        'portrayal', 'president', 'pretty', 'probably', 'prologue',
        'promise', 'question', 'react', 'real', 'really', 'recognition',
        'recognize', 'release', 'remembered', 'reporter', 'reward', 'right',
        'sad', 'say', 'screen', 'sean', 'segment', 'segments', 'september',
        'shares', 'shocked', 'silence', 'son', 'sound', 'starring',
        'starting', 'story', 'strange', 'suggests', 'sure', 'surprisingly',
        'takes', 'tale', 'talking', 'tastes', 'terrible', 'things',
        'thirty', 'tower', 'towers', 'tradition', 'tries', 'trying', 'twin',
        'unique', 'vietnam', 'war', 'watch', 'way', 'western', 'women',
        'work', 'world', 'years', 'yes', 'york'], 
       dtype='<U16')]

Importing Learning models

Multinomial Naive Bayes

In [116]:
from sklearn.naive_bayes import MultinomialNB
In [118]:
mnb = MultinomialNB(alpha=1.0) # Check what this alpha value is. You have already learnt most of the math to understand this.
In [119]:
mnb.fit(train_x_tfidf_array,train_y)
Out[119]:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Preparing the test data

In [120]:
test_df = pd.read_csv("/Users/boY/Desktop/code_me/text_mining/summer_2019/ManualSentimentClassifier/test.csv",encoding='ISO-8859-1')
In [121]:
test_x_tfidf = tf.transform(test_df['text']) ## Where did we get the tf from? 
In [122]:
test_x_tfidf_array = test_x_tfidf.toarray()
In [123]:
test_y = le.transform(test_df['class']) ## Where did we get  "le" from? 
In [124]:
test_y.shape
Out[124]:
(2000,)
In [125]:
test_x_tfidf_array.shape
Out[125]:
(2000, 5000)
In [126]:
predictions = mnb.predict(test_x_tfidf_array)
In [58]:
predictions.shape
Out[58]:
(2000,)
In [127]:
count = 0 
for i in range (len(predictions)):
    if predictions[i]==test_y[i]:
        count=count+1
In [128]:
count/2000
Out[128]:
0.821

Logistic Regression

In [129]:
from sklearn.linear_model import LogisticRegression # load the library
In [131]:
log_reg = LogisticRegression(C=4.0)
In [132]:
log_reg.fit(train_x_tfidf_array,train_y)
Out[132]:
LogisticRegression(C=4.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
In [133]:
log_reg.score(train_x_tfidf_array,train_y) # running it on the train set itself. 
Out[133]:
0.99050000000000005
In [135]:
log_reg.score(test_x_tfidf_array,test_y) # running it on the test set. 
Out[135]:
0.84099999999999997

SVM

In [136]:
from sklearn import svm
In [142]:
clf = svm.SVC(C=1.0,degree=1,kernel='linear')
In [143]:
clf.fit(train_x_tfidf_array,train_y)
Out[143]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=1, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
In [144]:
predicted = clf.predict(test_x_tfidf_array)
In [145]:
count = 0 
for i in range (len(predicted)):
    if predicted[i]==test_y[i]:
        count=count+1
count
Out[145]:
1662
In [147]:
count/2000
Out[147]:
0.831

Random Forest

In [148]:
from sklearn.ensemble import RandomForestClassifier
In [149]:
forest = RandomForestClassifier(max_depth=10,n_estimators=100,min_samples_leaf=2)
#max_depth=10,n_estimators=100,min_samples_leaf=2
In [150]:
forest.fit(train_x_tfidf_array,train_y)
Out[150]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=10, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=2, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
In [151]:
forest.score(train_x_tfidf_array,train_y)
Out[151]:
0.92049999999999998
In [152]:
forest.score(test_x_tfidf_array,test_y)
Out[152]:
0.80200000000000005
In [153]:
forest_predictions = forest.predict(test_x_tfidf_array)

Confusion matrix

In [154]:
from sklearn.metrics import confusion_matrix
In [155]:
confusion_matrix(test_y, forest_predictions) ## We have done this in class. 
Out[155]:
array([[781, 219],
       [177, 823]])
In [ ]: