Asif Haque [resume]Ph.D., Cornell UniversityComputer Science (minor : Operations Research) Committee : Paul Ginsparg [wiki], Eric Friedman, David Williamson Email: asif@cs.cornell.edu
ResearchThe broad aim of my research is to develop methodologies to facilitate the understanding of the interaction between information systems and social systems. Currently I am exploring arXiv, a scholarly communication system, to develop techniques that would ultimately generalize to a variety of settings.As a system that has been functional and growing for almost two decades arXiv provides us with an ideal testbed to investigate scholarly behavior. A combination of author and citation network analysis, text analysis, and mining the wealth of log data such as downloads, referral logs, blog tracebacks is required to answer sociological queries. Furthermore, there are temporal dimensions to the analysis as simple as normalizing across time or as computationally challenging as tracing evolution. While conventional machine learning is appropriate for many prediction tasks, the issue of scalability suggests augmenting it with distributed computing paradigms such as Map-Reduce programming, better suited for large scale computation. [pagerank] In our investigation [positional effects] of how the position of a paper in the daily arXiv listing affects its citation and readership the reasons for and the consequences of the phenomenon of authors jockeying for the top positions were thoroughly examined. Using machine learning methods long-term citations were correlated with a variety of readership features and suggestions for a hybrid measure of popularity were provided. Further investigation [additional positional effects] revealed interesting procrastination effects for the last few positions of the daily listing. Geographic bias as a strong reason for the effects associated with positions was ruled out. Complementary to the metadata analysis is our attempt [phrases] to extract subtopical concepts, characterized by phrases, through a combination of text and network analysis, augmented by logs of search and readership data. Tracking concepts can provide a useful temporal overview of linked corpora [click trends] and this method of extracting subtopics automatically has potential. A principled way of using n-grams for categorization and subdocument similarity is also being explored. I am contributing to an ethnographic study of comparing and contrasting co-author networks of different fields of science. The quantitative aspect of the study involves [mesoscopic analysis] clustering of co-author networks, characterizing subnetworks of clusters and identifying node roles in the network. An author name disambiguation algorithm [disambiguation], using co-authorship and self-citation, has been devised. We have measured the effectiveness of our algorithm through manually disambiguated samples from our data set, with an eye on the network structure.
Publications |