A National Library for Undergraduate Science, Mathematics, Engineering, and Technology Education: Needs, Options, and Feasibility

Technical and Technological Considerations

William Arms
Corporation for National Research Initiatives


This is the author's manuscript of a paper given to the National Research Council workshop on August 7-8, 1997 on "A National Library for Undergraduate Science, Mathematics, Engineering, and Technology Education: Needs, Options, and Feasibility". The published version is available from the National Academy Press.


Digital Libraries and Undergraduate Education

Introduction

This is a discussion paper for the National Research Council workshop on August 7-8, 1997. The paper is arranged as a series of key topics that fall within the theme of the workshop. Although the paper emphasizes technical aspects of a digital library, it is impossible to introduce technical considerations without discussion of the overall goals and form of the library.

"Please can I use the Web. I don't do libraries."
Anonymous Cornell student, reported by Carl Lagoze.

The fundamental question for the workshop is how can a national digital library enhance undergraduate science education. My basic assumption is that there is little utility in taking existing education materials, designed for other media, and simply placing them on a computer network. The greatest benefits will be gained by modification of curricula and creation of different forms of materials, in parallel with the deployment of the digital library.

Some Personal Examples

Each of us brings to this workshop pre-conceptions based on our own experiences. Here are two examples of my own.

During the 1980s, as part of the Andrew project at Carnegie Mellon University, we invested heavily in the creation of educational materials. They were delivered over the campus network, through a networked file system -- a campus digital library for education. The computing initiatives that grew out of the Andrew project have had an impressive impact on education at Carnegie Mellon. Our regular surveys of faculty showed more than half the faculty regularly using computing as an integral part of their courses, but the surveys also showed that most of this impact came from materials that were not developed explicitly for education. The surveys showed that the dominant educational uses were as follows:

  1. Professional computing tools. Many of the enhancements in education came from providing students with the same tools that the faculty use in their research and professional activities. These include applications programs (e.g., statistical packages such as SAS, symbolic mathematics such as Maple and Mathematica, graphical programs such as AutoCad or Quark), and mainstream computing applications (e.g., electronic mail, databases, and compilers). They also include data sets such as census data, NASA's images from space, and the genome data base. Although some of these tools began as non-commercial materials, by the time that they became widely used in science education, they were of a scale and complexity that required a commercial framework of support.

  2. Communication. For many years, the dominant applications over the campus network were electronic mail and bulletin boards. In addition, from 1986, extensive reference materials were provided by the university libraries over the network. These materials were widely used by both faculty and students. They also appear to have helped stimulate the steady increase in the use of traditional library materials that occurred during the same period, though it must be admitted that the use of libraries by engineering and computer science students (and faculty) never reached the level that one would hope. As soon as the Mosaic browser was released, the World Wide Web was adopted by the Carnegie Mellon community as a very important source of communication, both for finding and for publishing information.

The second example comes from my time as a faculty member at the British Open University in the early 1970s. This was the first large-scale university organized completely around home-based learning. Although Britain has good public libraries, many students do not have easy access to a library. Therefore, we were forced to construct courses on the assumption that students had access to no materials other than those provided by the course team. The university provided each student with a set of educational materials. These materials included printed texts, reprints of articles, and home experimental kits. Television and radio were used to augment these materials; they were an important part of some courses, but less important in my areas of mathematics and computer science.

The academic achievements of the Open University have shown that good undergraduate education is possible without providing the students access to a library. However, it places serious limitations on course design. In particular the options for independent work are severely limited. As distance learning becomes more common, the workshop might ask the question how can modern technology help a home-based university, or any university, improve on the Open University's approach thirty years ago.

Both these examples show the importance of creating services where the teaching faculty have a large measure of control over how the services are used in education.

Potential Benefits

To begin to answer the question how can a national digital library enhance undergraduate science education, here is a list of the potential benefits that might be hoped from a digital library aimed at undergraduate science education.

  1. Provide faculty and students with access to original scientific materials. Studying science from original papers, research reports, data sets, etc. is fundamentally different from learning based on distilled materials, such as textbooks. As the volume of scientific information that is crammed into undergraduate courses has grown, universities have moved from the ideal of a liberal education in which students explore a subject through reading original materials to heavily structured curricula. Recently we have seen a trend, at least in some universities, that is partially reversing this direction by encouraging students to carry out independent work, which requires easy access to the source materials of science. Independent work requires good libraries and a digital library has much to offer.

  2. Provide faculty with materials used in preparing courses. Preparation of a good course is extremely labor intensive. Faculty need ways to discover and evaluate educational materials and scientific source materials. They also need access to curricula, course notes, problem sets, etc. The better the services that are provided to faculty, the more they are able to build on the successes of others, and the less likely to use inappropriate materials or to re-create materials.

  3. Provide communication among faculty and students. Communication can be within a university or college, or across organizations. Many faculty, particularly in small colleges, are quite isolated. Networked services, such as bulletin boards and the World Wide Web, develop a community where they can cooperate in both education and research. In a similar manner, students can interact with others from around the world. There is continuing development in collaborative tools that allow faculty and students to distribute their work to others, including annotations and comments.

  4. Deliver specific educational materials. An increasing variety of educational materials are intrinsically digital. They include computer programs, data sets, various categories of multi-media items, etc. Computer networks and digital libraries provide a cost-effective way to store, retrieve, and deliver these materials.

One topic has been deliberately left out of this list, reflecting a personal bias. Because of a combination of technical and economic issues, my instinct is not to focus on using the digital library as a substitute for traditional text books. Computer networks have long proved to be an effective way to deliver course notes and other supplementary materials, but text books and courses built on text books are so closely tied to the strengths of printed volumes that they are difficult to migrate to digital libraries.

The Technology of Digital Libraries

Assumptions

The following are my basic assumptions about the proposed library.

  1. This is a digital library. Although materials will sometimes be printed by the user, and some materials may be available on CD-ROM, the focus is on materials that are created and stored in digital formats, and transmitted to the user over the Internet.

  2. It will be a virtual library. This will not be a conventional library in that it will not acquire and store all its materials. The digital library collections will be managed by many organizations, with materials stored on many different computers. Three models of delivery of information to faculty and students are possible: (a) directly from the originator of the materials (e.g., a publisher), (b) from a service center at the educational establishment (e.g., a library or media center), (c) from collections maintained by the national digital library.

  3. The library will contain both proprietary and public materials. Many of the best educational materials are created by companies or individuals who wish to be paid for their efforts. However, as the World Wide Web has shown, there is also an enormous quantity of high quality material that is made publicly available at no cost. In some areas of science, large amounts of scientific source material are available on-line with no restrictions on access.

  4. Faculty and students will be able to interact with the collections. In a traditional library, it is a serious misdemeanor to write on the books or otherwise alter the collections. In a digital library the collections can be dynamic. People can annotate the materials, or link them to others; some materials are programs that students can execute or interact with; others can carry out computations, simulations, searches, or other actions on behalf of the user.

A Possible Technical Framework

Today's remarkable growth in digital libraries results from the maturing of several technologies: personal computers, the Internet, the World Wide Web, and protocols for searching on-line databases. Major areas where technical barriers remain include: interoperability among disparate systems, user interfaces, authentication and security, archiving, real time and other non-static media, copyright management, payment for services, and searching vast amounts of information.

In each of these areas, there are adequate short-term solutions, supported by extensive research and development. Hardware costs and performance continue to improve rapidly. There are no fundamental, technical barriers to the development of digital libraries for scientific education.

A rough technical outline might be as follows:

  1. The digital library will be built on the Internet. Almost every university and college now has a good connection to the Internet. Faculty and students working at home can dial-up to their university or connect through an Internet service provider. All protocols will be based on the TCP/IP suite.

  2. Users will have a standard personal computer (PC or Macintosh) running widely available software. For the foreseeable future, the user interface will be a Web browser, such as Netscape Navigator or Microsoft's Internet Explorer. The library will select a specific set of standard formats and protocols. The aim will be to follow the technical mainstream as it evolves with time, but the library will probably need to provide some additional software to handle special formats, authentication and payment, and identification of materials. These will be provided as applets, plug-ins, or other extensions that can be installed over the network.

  3. Materials in the digital library will be stored on a variety of servers. The collections will be managed by a variety of organizations including universities, publishers, and libraries. With a large scale library, where collections are maintained by many organizations, it is naive to believe that all the computers will be equally up-to-date or run the same protocols and formats. The library must accommodate the problems that are associated with heterogeneity. Today, many of the servers will be HTTP Web servers, but there will also be servers based on other protocols, such as relational databases (SQL), and Z39.50. Object oriented systems using IIOP may be the next important development. Interoperation among such systems is not easy but can be achieved by adopting suitable formats and protocols. (The Stanford University Infobus project has done good work in this area.)

  4. Materials in the digital library will be entered into a registry. The registry is a centrally managed list of materials that have been selected for the library. The registry contains information about each item, but not the item itself. The information includes an identifier, a digital signature, the location of the material, and perhaps indexing information and annotations. (CNRI has developed a registry for the US Copyright Office and is planning to deploy a modified version in other library applications.)

  5. There will be a central index to materials in all the collections. The indexing information will include cataloging and classification information, organized for distributed retrieval using modern methods of information retrieval. (Several good commercial systems are available today.)

  6. The library will permit annotation of materials. When an item is selected for the collections, an annotation is entered into the registry evaluating the material. Subsequently, users of the library can add annotations that comment on the effectiveness of the materials. (A fascinating approach to annotation is the Multi-Valent Document protocol developed by the Berkeley Digital Library Initiative. This is still a research project that has not yet resulted in any products.)

  7. Key parts of the library will be replicated. The registry and the indexing information will be replicated at several locations for performance and reliability.

Technically, all these components already exist, at least in preliminary form. Assembling and integrating them is, however, a significant undertaking. One challenge for the workshop is to set a framework that balances long term ambition against short term implementation difficulties.

Student Access

Student access is a problem. Students will use the digital library routinely only if it is convenient. Although student ownership of computers is increasing steadily, it is far from universal. The capabilities of their computers vary considerably and network access is still patchy.

Currently universities follow several different approaches to providing student access; none is ideal. Most universities provide some computers in student labs or computing clusters, connected to a campus network. This forces the students to go to the computer, thus wasting some of the potential of a digital library. Supply and demand are always a problem at peak times.

Another approach is for students to own their own laptop computers and to connect to the campus network on an ad hoc basis. This is increasingly common with law school students, who computing needs are very simple, but in science education there are problems of cost, access to software, and hardware limitations. Many scientific applications require substantial computational power or network bandwidth. A variant to this approach, which is followed by some of the best undergraduate colleges, is to urge students to buy their own computers with their own money. The institution supports them by providing software, training, and network connections in the dormitories. This approach is usually supplemented by public computers in labs or clusters.

Each of these approaches is more convenient for a student or faculty member than being restricted to a traditional library. None is as convenient as owning a physical copy of each book. As a result, we are seeing a broad movement to provide digital copies of lightly used scientific information, such as journals, but limited enthusiasm for replacing core education materials with on-line materials, except for those education materials that are intrinsically computer applications.

Information Discovery and Guidance for Students and Instructors

The organization of materials in the library collections is central to its success. Faculty and students must be able to find relevant material quickly; they must have confidence in its accuracy and its suitability for their purpose.

Recently, the NSF sponsored a workshop in Santa Fe to discuss what should follow the current Digital Libraries Initiative. One clear theme emerged from the discussions. The Digital Libraries Initiative emphasized the creation of on-line library collections. Now, four years later, enormous amounts of material are on-line. Some of it is excellent, some is junk. Information overload is emerging as a fundamental issue in digital libraries.

The undergraduate science library faces this problem. Instructors need help in identifying materials and evaluating their potential for specific courses. Students need help in exploring beyond the required materials. Because of their inexperience, students are often unable to evaluate the quality of materials. Therefore, evaluation and systematic description of material is a vital part of the library. How best to do this is a research topic, but there are some basic approaches that can be used today. Here is a possible framework.

  1. All materials in the library will be selected by members of the library staff. Sometimes, selection will be at an item level, at others by groups of material. The method of selection and the selection criteria will be stored with all material, so that users will know why each item is in the library.

  2. Reviews and other annotations will be added to the materials. The library will systematically assemble reviews of materials and feedback of educational usage, from both faculty and students.

  3. External annotations will be welcomed. The library will encourage unsolicited annotations and recommendations from third parties. (As described below, some editorial control will probably be needed.)

  4. There will be a central index. Descriptive metadata about all materials will be consolidated in a central index to the library. It is anticipated that the process of creating this metadata will combine automatic indexing, with selective human cataloguing and quality control.

  5. There may be other indexes to parts of the collection. Many of the individual collections that constitute the library will have their own indexes, catalogs, or finding aids.

This strategy does not expect the central library staff to be responsible for all aspects of information discovery. The Internet has shown us the power of private initiatives in organizing and presenting information in novel ways. The digital library needs to harness this utility and creativeness.

Policy Questions

Economic and Licensing Issues

Some materials in the library will be openly available. Others will be commercial products. Most core educational materials are created commercially as business ventures. The budget for each new edition of a major textbook approaches a million dollars; publishers of research papers are large and profitable; software packages and multimedia materials are equally expensive to produce.

Copyright has been used as the mechanism by which materials are controlled. At present there is intensive debate about the form that copyright should take in digital libraries. One opinion (which I share) is that this is fundamentally an economic debate. Whatever legal framework develops will enable the owners of educational materials to control their use, set terms and conditions, and price them as the market will bear.

Materials are paid for in three different ways. The first is by the student directly, by purchasing books, photocopies, computer software, lab fees, etc. The second is by the educational establishment, through its library, computing, and media budgets. The third is by the producer of the materials, such as by creating Web sites.

  1. Controls on access to materials. We are currently seeing a change in the balance between these three methods of payment, particularly with the growth of open-access publication of scientific research and other resources over the Internet. Thus we can expect that large amounts of good material will not require payment, but the library must be built around a framework that permits control of access to materials if required by the owner. (Currently the tools to do this are rather limited, expect where a university or college has installed a comprehensive authentication system, such as Kerberos. Because progress has been disappointingly slow, systems have to rely on crude authentication, such as IP address or ID and password.)

  2. Controls on accuracy. The principal reason that authors and publishers wish to control educational materials is the desire to make money. A secondary reason is the wish to control the content, in particular to ensure accurate representation of the ideas and concepts, with appropriate attribution. One approach to this issue is to register each item as it is added to the collection with a unique identifier and a digital signature, which can be used to verify that an item has not changed. (This technology is becoming widely available from several sources.)

Good Science versus Bad Science

A tough policy decision is how much the library will be an arbiter of good science. The library must anticipate pressures from those whose political, economic, or religious agendas are antagonistic to good science and good education. With considerable reluctance, I suggest that, from the start, the library will need an editorial board of scientists committed to defending the library from these pressures. For example, unsolicited annotations are highly desirable, but the library must be prepared to exercise editorial control if necessary. The aim is to find a balance between openness to new or controversial ideas, while weeding out the cranks and the bigots.

Conclusion

To build a large-scale distributed library for undergraduate science education is technically difficult. It faces no fundamental barriers, but to do it well requires a skilled and motivated team. It is vitally important that this team be driven by the wish to build a practical, high-quality service for education.

Even more importantly, the creators of the library must focus on the underlying challenge, how to have a major impact on science education. The challenge is to create a framework that will allow the teaching faculty flexibility to use the library in ways that were not envisaged by its creators. In this manner, it can indeed become the premier focus of materials for undergraduate science education.


William Y. Arms
July 28, 1997
Revised: August 19, 1997

wya@cs.cornell.edu