CS5152 - Open-Source Software Engineering

Overview

CS5152 - Open-Source Software Engineering Each student will work in a team on an established code base from an active open-source project using the guidance of an industry mentor from that project.

Teams Teams and projects will be decided before the semester begins. Unlike previous years, teams will be made up of solely Cornell students.

Kickoff Hackathon The Kickoff Hackathon, sponsored by Facebook, will kick off the projects by putting students in face-to-face contact with their project mentors from industry. All students are required to attend. The Kickoff Hackathon will be the weekend of February 11-12.

Projects Students will rank the available projects in order of interest, and a matching process will be run to determine which project a student is assigned to. Team rosters will be settled prior to the Kickoff Hackathon.

Grading 75% of a student's grade will be determined by contribution to the code base. The industry mentor will provide the majority of this evaluation. The remaining 25% of a student's grade will be determined by class participation and deliverables. There will be no final exam, but there will be final deliverables such as short papers and presentations to be done at the scheduled final time.

Weekly Meetings Every week students will video conference with their entire team, including their mentor, for 30 minutes hopefully during lab hours (some teams may need to meet outside of class due to external constraints). These meetings should quickly review what has been done, what problems people are hitting, and what should be accomplished in the next week. In order to keep meeting times flexible so that we may meet the constraints of others on the projects, students must be able to attend all lab times, at least until a meeting time has been decided. Every student needs to quickly meet with the professor to explain what they are working on and to check that they are keeping on track.

Lecture W 11:15-12:05 in Olin 245
Lab MF 11:15-12:05 in Olin 245
Units 4

Application Process - CLOSED

By November 19th, students must e-mail the instructor a PDF containing their name, Net ID, major, degree program (e.g. B.S.), expected graduation date, and a short description (as in less than a page) of their qualifications and motivations for the class. This should include prior programming experiences from classes, internships, or programming projects. For students newly admitted for Spring 2017, please send me an application as soon as possible. You may provide a Cornell faculty member and/or a mentor from a past internship as a reference, but it is not necessary. Senior and M.Eng. Computer Science students will be given priority, and others will be admitted if space permits.

You will not be able to enroll in this class directly, since enrollment is tentative on being admitted. Please enroll in classes as if this class were unavailable, since it is easier to drop classes than add them, and once admitted into CS 5152 your enrollment is guaranteed.

Instructor Ross Tate
Office 434 Gates Hall
Office Hours by appointment

Projects

CFSSL

Mentors: Zi Lin and Kyle Isom

Team Size: 3-4

Summary: OCSP Service using CFSSL

Description: OCSP is an important component in the full Certificate Authority (CA) stack, combining SQL, http, and cryptography. We are building an standard OCSP service that incorporates Certificate Transparency (CT). CT is one of the future industry standards that promotes public CA auditing. CT fights bad CAs who issue certificates against baseline, such as issuing certificates with weak cryptography, or not doing an acceptable authentication/verification. Nowadays CAs are required to run an OCSP service. Certificate Transparency eco-system includes an OCSP services that staples signed certificate timestamp (SCT). Doing so mandates CA to submit to a public log for every certificate they issued. A very informative design doc about CT and OCSP can be found at https://www.certificate-transparency.org/how-ct-works.

Skills: You will learn about public key infrastructure, cryptography engineering, and systems engineering with Go.

GeoMesa

Mentor: Jim Hughes

Team Size: 4-5

Summary: GeoMesa for Cassandra

Description: GeoMesa is an open-source, distributed, spatio-temporal database built on a number of distributed cloud data storage systems, including Accumulo, HBase, Cassandra, and Kafka. Leveraging a highly parallelized indexing strategy, GeoMesa aims to provide as much of the spatial querying and data manipulation to Accumulo as PostGIS does to Postgres.

Recently, there has been some activity with GeoMesa's HBase and Cassandra support. The goal of this project would be to explore porting distributed computation available in Accumulo's iterator stack to HBase's filters or co-processors or Cassandra's user-defined functions. As a concrete example, GeoMesa's Accumulo support can accelerate the creation of heatmaps and histograms.

Skills: Basic Java would be expected; the code is in Scala, and that can be learned during the semester. Software development (building with Maven and using Git for version control) and an understanding of distributed computing (MapReduce paradigm and distributed filesystems/databases) are nice-to-haves; participants will have a better understanding of those techniques and technologies at the end of the semester.

GeoWave-Cassandra

Mentor: Rich Fecher

Team Size: 3-5

Summary: GeoWave for Cassandra

Description: GeoWave is a software library aimed at connecting geospatial software with distributed computing frameworks. GeoWave leverages the scalability of a distributed key-value store for effective storage, retrieval, and analysis of massive geospatial datasets. While the core toolkit is generally applicable to multi-dimensional use cases, GeoWave has focused on tailored extensions to support spatial types and operators, with or without temporal timestamps or time ranges. Additionally, it provides advanced features to leverage a distributed backend for visualization or analysis. The software is intended to be easily pluggable into any sorted key-value store, and its modular design is intended to enable feature extension into various geospatial toolkits.

Apache Cassandra is a popular open source distributed data store. It is a hybrid of a key-value store and column-oriented store, and it can benefit from GeoWave's ability to index keys within a multi-dimensonal space, which is the underpinning of geospatial or spatio-temporal indexing. This effort will focus on creating a data store for Apache Cassandra similar to GeoWave's existing Accumulo, HBase, and BigTable data store extensions.

Skills: GeoWave is written in Java, it uses Maven for build and dependency management, and Git for version control. Familiarity with these technologies will be helpful. In working on the project, students will gain an understanding of distributed computing, and geographic information systems (GIS), as well as an understanding of the processes for contributing to open source.

GeoWave-Rest

Mentor: Michael Whitby

Team Size: 4-5

Summary:GeoWave Rest Endpoint Refactor/UI Creation

We are looking to develop a User Interface that will allow people who may be uncomfortable using the command line to utilize the GeoWave framework. A significant precursor to this will be to dynamically expose and create rest endpoints for the robust CLI operations currently available in GeoWave. Authorization concerns will need to be addressed and file uploading for ingest may need to be revised. In addition to the desired background for the overall GeoWave project, students should also have experience with HTML and JavaScript.

PredictionIO-Java

Mentor: Donald Szeto

Team Size: 7-8

Summary: Universal Recommender Engine for Java

Description: Apache PredictionIO is an open-source machine learning server for developers to build and deploy predictive engines as web services on production in a fraction of the time. It comes with a lot of ready-to-use engine templates, e.g. the Universal Recommender is a very popular engine being used by hundreds of companies in the world. The original Universal Recommender is written in Scala. While Scala is a great programming language, a lot of users want us to provide a Java engine template instead.

In this project, we would like to re-create the Universal Recommender engine template in Java. An example of Java-based engine can be found here.

Skills: You will be able to learn Scala and some machine learning concepts for personalization through studying deeply the existing Universal Recommender and core PredictionIO codebase. You will also utilize your Java skill to re-create and distribute an open-source recommender engine that will potentially be used by thousands of companies. Of course, you will get yourself familiar with Apache Spark, as PredictionIO is built on top of it.

PredictionIO-Koober

Mentor: James Ward

Team Size: 7-8

Summary: PredictionIO for Koober

The Koober project is a data pipeline example that brings together many pieces of a modern system in the context of a taxi-like transportation service. The example includes a web user interface, data streaming platform, data analytics, and data persistence but currently lacks a machine learning component. The goal of this project will be to add PredictionIO to the example app to predict surge events based on past wait times.

Pyret

Mentor: Joe Gibbs Politz, Ben Lerner, and Shriram Krishnamurthi

Team Size: 3-4

Summary: Teacher Dashbord for Pyret

Description: code.pyret.org uses Google Drive files to store and share student programs. Currently, it has no special support for teachers to organize their assignments and classes, and no notion of a student belonging to a class. Teachers simply use Google Drive features and plugins if they are savvy Drive users, or have students submit "Publish" links of copies of their code. In addition, this means that on the development end, we have little idea which users belong to which classes, which makes feedback from different sources difficult to interpret.

code.pyret.org is interesting architecturally in that it eschews as much as possible in the way of a backend database. In this case as well, we don't want code.pyret.org to become a central repository for class information. So one constraint on the project is that it should store whatever information it needs entirely within the Google Drive accounts of its users. This could mean that a teacher creates a directory per-course, and points the dashboard at that directory, which is then filled with configuration files that code.pyret.org can use to help distribute assignments, for example.

At minimum, the teacher dashboard should support:

Creating a new course
Adding a list of student accounts to a course
Creating a new assignment within a course, containing potentially several starter files
Distributing the assignment to students
When student activity is logged by the IDE, it should track enough information that we can figure out which class that student is in (maybe not identifying the actual teacher or course, but using an opaque class id)

Nice-to-haves are:

Associating student work with grades
Hooking up student work with another service for automatically grading it
Authoring and distributing feedback to students (perhaps integrating with services like GradeScope)
Previews or easy ways to see student progress – for example, the teacher might own all the files the students are working on, and the dashboard could easily track that.

Skills: You will learn about

Using the Google Drive APIs (code.pyret.org uses both https://developers.google.com/drive/v2/reference/ and https://developers.google.com/drive/v3/reference/)
Using JavaScript in the browser
Using JavaScript in NodeJS
An understanding of cross-origin security concerns when programming in a browser
Using Selenium or other automation tools to test the interface

You might learn about

Using the Facebook libraries React and Redux
Learning the Pyret programming language and module system

CS 5152 (Spring 2017) - Open-Source Software Engineering