NSF Grant Funds Development of New Tools for Social Scientists

The National Science Foundation (NSF) has awarded an interdisciplinary team of Cornell researchers $2 million over the course of two years to develop tools that will support work in computational social science. Researchers will develop, test, and refine these tools as they work on specific research applications to a problem of broad theoretical and practical interest, the diffusion of innovations.

"The long-term impact of this research promises to make high-performance computing and data storage systems practical tools for all social scientists much as they have become a mainstay of work in the physical sciences and engineering," said Michael Macy, sociology department chair and the projects principal investigator. "Cornell is the hub of interdisciplinary activity in the area of computational social science; our work will provide new tools for researchers seeking to use Web data to better understand social life."

In the study, Very Large Semi-Structured Datasets for Social Science Research, Cornell scientists have partnered to build cybertools: devices that will enable social scientists to leverage the power of high-performance computing, the Internet, and Web services. Cybertools will allow for the manipulation, search and processing of vast semi-structured data, such as Web pages, blog postings, chat room logs, XML databases, and computer files and will provide access to the content, relational structure, and evolution of these data.

"Physical scientists have been using high-performance computing to do simulations and modeling for decades," said Dan Huttenlocher, professor of computer science. "The computational tools that can offer an alternative to expensive and time-consuming data collection and analysis have not yet been available in the social sciences."

For the purposes of this study, researchers will use content from the Internet Archive, a repository of 40 billion Web pages which has the potential to open up new frontiers in social science research on the collective behavior of individuals. Developed by Brewster Kahle in 1996, the archive now resides in the Presidio in San Francisco. Large portions of the data are being moved to the Cornell Theory Center as part of a $1.8 million grant funded in 2004 by the NSF's directorate of Computer and Information Science and Engineering to develop an information access and analysis system that will meet data-intensive needs of research projects at Cornell University.

The attempt to analyze data collected from the Internet Archive presents researchers with several challenges:

  • Content and structure are implicit rather than explicit, unlike traditional tabular databases with specified fields and relations. Automated tools to handle the lack of structure are not available.
  • Data results from independent actions of many agents (both individuals and organizations) rather than by being gathered centrally for particular purposes. Thus the paradigm shifts from one of data collection and analysis to one of data mining and information discovery.
  • The sheer scale of billions of items poses both computational and conceptual challenges. Manually coding such vast quantities of data is beyond human capabilities and requires new tools such as semi-supervised machine learning.
  • Even for data gathered from publicly available sources such as the Web, privacy is a substantial concern. Combining, analyzing and mining data can easily reveal information that was not apparent in the original sources. Thus privacy-preserving data mining and discovery techniques must be employed.

"The Internet Archive is larger in scale and more heterogeneous in content than any other social science dataset we know of," said Macy. "In its current form, this dataset exists in the form of a vast archive that can be accessed only via individual Web page URLs. Our research will transform the dataset into an essential and accessible resource for social science research. This transformation will involve significant computer science research. The resulting resource will be built on the Internet Archive data. It will provide the ability for social scientists to work with and manipulate the data in order to run experiments."

The new tools will be developed by an interdisciplinary team of researchers from various Cornell departments and the Faculty of Computing and Information Science (CIS). The team includes Macy and David Strang (Sociology), Huttenlocher (Johnson Graduate School of Management, CIS, CS) Jon Kleinberg and William Arms (CS, CIS), and Geri Gay (Communication and CIS). The results of their work will make it possible for social scientists to identify trends, such as the spread of business practices. They will also improve understanding of network structures and the effect on the diffusion of innovation, which includes the spread of new technologies, fads and fashions, norms, opinions, and urban legends.

"Sociology, anthropology, geography, economics, organizational behavior, population ecology, and communication studies all have long-standing traditions of research on the processes by which ideas and practices spread across the social landscape," said Macy. "Pioneers in this area know that we need a centralized solution for archiving and access, and the data for this centralized solution already exists: the Internet Archive. Our research will exploit this data source's potential and provide the opportunity for social scientists to observe very large-scale social interactions that leave a digital record. We have assembled a team of leading specialists in machine learning, language processing, data mining, privacy, and Web analysis, alongside social scientists with expertise in the study of diffusion and social influence. The tools under development will provide and facilitate access by social scientists around the world, without the prohibitive need to start each study from scratch."