MasterScript.py calculates various features of communities. 
--------------------------------------------------
USAGE:

To use, run "python MasterScript.py EDGE_FILE COMM_FILE OUT_FILE [CLASS_NUM]"

EDGE_FILE is the name of the file containing the edges of the (undirected) network.  
In this file, every line must contain an edge, represented as two nodes separated by whitespace. E.g.:
node1	node2
node2	node3
node1	node4
...

COMM_FILE is the name of the file containing the communities.
In this file, every line must contain the nodes of one community, separated by whitespace.  E.g.:
Comm1Node1	Comm1Node2	Comm1Node3	Comm1Node4 ...
Comm2Node2	Comm2Node2	Comm2Node2 ...
...

OUT_FILE is the name of the file where the features will be written,
in a format suitable for use with the LibSVM software package (http://www.csie.ntu.edu.tw/~cjlin/libsvm/). In this format, each line corresponds to the features of one community, written as follows:
0:feature0	1:feature1	2:feature2	3:feature3 ...

CLASS_NUM is an optional argument.  
If specified, the output file will contain this string before each line in the output file.  When training a LibSVM model, each feature vector must belong to some class, specified in this manner. To generate a file with multiple classes, create a separate comm_file for each class, and then run MasterScript.py separately for each file, specifying a different class_num each time.  Once all the output files have been produced, simply append them into one file.  If you don't want to specify the CLASS_NUM argument, then you can include all the communities in the same file and just run the script once (as long as you separately keep track of which community came from which class).  If you do this, you'll need to insert the class labels in by yourself.  

See end of this file for an example  
------------------------------------------------------
FEATURES:
The output file contains the following features, in the order listed.  
For some features, a large set of values is calculated (e.g., each node has its own node betweenness value).
In such cases, we calculate all such values, sort them, and report different percentile values.

Number of nodes
Number of edges
Diameter
Edge Density
Conductance
Transitivity
Triangle Density
Shortest Path Length 0%
Shortest Path Length 25%
Shortest Path Length 50%
Shortest Path Length 75%
Shortest Path Length 100%
Edge Betweenness 0%
Edge Betweenness 25%
Edge Betweenness 50%
Edge Betweenness 75%
Edge Betweenness 100%
Node Betweenness 0%
Node Betweenness 25%
Node Betweenness 50%
Node Betweenness 75%
Node Betweenness 100%
Alpha 0%
Alpha 25%
Alpha 50%
Alpha 75%
Alpha 100%
Beta 0%
Beta 25%
Beta 50%
Beta 75%
Beta 100%
Treesum
Information Centrality 0%
Information Centrality 25%
Information Centrality 50%
Information Centrality 75%
Information Centrality 100%

Explanation of features:
Shortest path lengths are calculated by considering all pairs of nodes in the community and calculating the distance between them.

Edge betweenness is described here: http://networkx.lanl.gov/reference/generated/networkx.algorithms.centrality.edge_betweenness_centrality.html#networkx.algorithms.centrality.edge_betweenness_centrality

Node betweenness is described here: http://networkx.lanl.gov/reference/generated/networkx.algorithms.centrality.betweenness_centrality.html#networkx.algorithms.centrality.betweenness_centrality

Alpha and beta values describe the number of neighbors inside the community that various nodes have.  For a node N that is outside the community, but has neighbors inside the community, the alpha value of N is its number of neighbors inside the community.  For a node N inside the community, the beta value of N is its number of neighbors inside the community.

Treesum is the number of spanning trees in the community.

Information centrality is described here: http://networkx.lanl.gov/reference/generated/networkx.algorithms.centrality.current_flow_closeness_centrality.html
--------------------------------------------------
CLASSIFICATION:
To perform classification, we recommend using the LibSVM software (available: http://www.csie.ntu.edu.tw/~cjlin/libsvm/ ).

LibSVM will output either a simple or probabilistic classification, which can be used to analyze the structural similarity between different classes.  For non-probabilistic classification, we recommend using the script easy.py included with LibSVM, as this takes care of all the classification steps for you (e.g., cross-validation, training, testing).

--------------------------------------------------
EXAMPLE 1:
Suppose you are interested in studying network "example," with its edges in file "example_links" (format described above).
 
You have applied three community detection algorithms to "example" and obtained three community files in the format described earlier, named "example_comm1", "example_comm2", and "example_comm3".  You are interested in seeing whether these algorithms produce outputs that are similar to one another or fundamentally different.

To do this, run MasterScript.py three times:
python MasterScript.py example_links example_comm1 example_features1 1
python MasterScript.py example_links example_comm2 example_features2 2
python MasterScript.py example_links example_comm3 example_features3 3

This produces three output files ("example_features1", "example_features2", "example_features3").  In file "example_features1", each row is prefaced with a 1, each row in "example_features2" is prefaced with a 2, and each row in "example_features3" is prefaced with a 3.  

Next, merge all of these files into a single file "example_features_all".

Because you are interested in learning whether the classes are similar or different, we now want to split "example_features_all" into smaller training and test files, each containing representatives from all 3 classes.  For example, create "example_features_training", which contains a random 90% of the elements from "example_features_all", and put the remaining 10% of the vectors into "example_features_test".  (You could also do cross-validation here).  

If "example_features_all" is very large, you can sample elements from it (you might want to do this anyway to ensure class balance).  

Now use LibSVM to train a classifier on "example_features_training", and then run that classifier on "example_features_test".  Look at how it classified the elements in "example_features_test"- do elements from class 1 tend to be correctly classified as class 1?  This would indicate that class 1 is structurally different from the other two classes.  Conversely, if the classifier often mis-classifies class 1 elements as classes 2 or 3, this would indicate that, at least using the features considered, class 1 is similar to the other classes.

--------------------------------------------------
EXAMPLE 2:
Suppose you are again interested in studying network "example", and have the three feature files "example_features1", "example_features2", "example_features3" described above.  Suppose you also have examples of real communities in file "example_comm_real", and want to see which algorithm produces output that resembles the real communities.  Create another feature file "example_features_real" by running MasterScript.py on "example_comm_real".  Even though there will be only one class represented here, the LibSVM format still requires that you label the feature vectors.  The label in this case for "example_comm_real" doesn't matter and won't affect the classification of the vectors.  (LibSVM will report an "accuracy," but that is meaningless in this example).  So you could, for instance, run:
python MasterScript.py example_links example_comm_real example_features_real 1
even though you already used label 1.  

Now train a LibSVM classifier on a file containing representatives from "example_features1", "example_features2", and "example_features3", and run that classifier on "example_features_real".  Then observe how the elements in "example_features_real" were classified, and this will tell you which of the algorithms each particular real community most resembled.
