Country-Attribute Pair Identification Using Supervised Learning
CS215 Project: Autumn 2014
Computer Science and Engineering, IIT Bombay
Team Members
Ghurye Sourabh Sunil - 130050001
Nikhil Vyas - 130050023
Utkarsh Mall - 130050037
Objective
Using statistical data of various countries as training data to extract relations expressed in random sentences.
Assigning confidence values to these extracted data of country-value pair.
Instructions
Put the c++ code (cs215.cpp) in the working directory along with knowledgebasefolder.
Put the sentences.tsv demo file in working directory.
Compile the code with c++11 --> "g++ -std=c++0x cs215.cpp"
Confidence values are calculated using Standard Normal Distribution Table which must be present in the knowledgebase folder as normal.tsv
Values with less than 10% confidence are rejected.
Output is in output file present in working folder.
Details about the code are present in presentation.
Major Function and Classes
init : Initialises normal array by normal.tsv
exist : Searches for keywords in the given vector of strings.
conff : return confidence given mean, standard deviation, value and whether keywords exist or not.
Country : stores the different values of attributes, averages and standard deviations for each country.
Algorithm
We use our knowledge base to get a certain value (mean and standard deviation)
of attributes of country.We assume that the value given in the sentences
are recent. Since the knowledge base gives us values in chronological order.
So giving more weightage to recent values rather than to older will be helpful.
Now we start moving on sentences taking out the numbers and country names,
given at end of line.Since we still do not know about what the attribute is
we try to match the number with every attribute of every country given in sentence.
Finding out confidence value with each of these assuming normal distribution of values.
In finding confidence values we also see whether the keyword or related keywords are
found in sentence or not, if found more confidence else not much.The attribute country
pair having maximum confidence value is the required attribute.
While giving final confidence we use every attributes confidence in it, since we are
not as sure as earlier if we are comparing it with other values.