CS 5152 (Fall 2019) - Open-Source Software Engineering
Overview
CS-5152: Open-Source Software Engineering This class is about learning software engineering, especially as employed by the open-source community, through a hands-on experience with mentorship, guidance, and peers. Each student will work in a team on an established code base from an active open-source project using the guidance of an industry mentor from that project. This class is not about "open source" as an entity in and of itself, though; we might not cover aspects of open source like its history, philosophy, or legal complexities (such as licensing).
Teams Teams and projects will be decided before the semester begins. They are usually in the range of 4 to 8 students working together with one or two industry mentors.
Kickoff Hackathon (in Gates 114 and 122 from 10am to 5pm on September 14 and 15) The Kickoff Hackathon will kick off the projects by putting students in face-to-face contact with their project mentors from industry. All students are required to attend. The Kickoff Hackathon will be the weekend of September 14-15. The Kickoff Hackathon will not be an overnight endeavor; it will start after breakfast, include lunch, and end before dinner.
Projects Students will rank the available projects in order of interest, and a matching process will be run to determine which project a student is assigned to. So far, students have always been able to get on one of their top 3 choices. Team rosters will be settled prior to class enrollment.
Grading 75% of a student's grade will be determined by contribution to the code base. The industry mentor will provide the majority of this evaluation. The remaining 25% of a student's grade will be determined by class participation and deliverables. There will be no final exam, but there will be final deliverables such as short papers and presentations to be done at the scheduled final time.
Weekly Meetings Every week students will video conference with their entire team, including their mentor, for 30 minutes hopefully during lab hours (some teams may need to meet outside of class due to external constraints). These meetings should quickly review what has been done, what problems people are hitting, and what should be accomplished in the next week. In order to keep meeting times flexible so that we may meet the constraints of others on the projects, students must be able to attend all lab times, at least until a meeting time has been decided. Plus, each week every student needs to quickly meet with the professor to explain what they are working on and to check that they are keeping on track.
Lecture W 11:15-12:05 in TBD
Lab MF 11:15-12:05 in TBD
Units 4
Final 2pm-4:30pm on Monday, December 16th in Hollister 110
Instructor Ross Tate
Office 434 Gates Hall
Office Hours by appointment
Announcements
- No lecture on Sep 9.
- Lab on Sep 16 is optional.
- No lecture on Sep 25, and labs on Sep 23 and Sep 27 are optional.
Resources
- Git and GitHub Walkthrough for Common Open-Source Workflows — e-mail me your GitHub username to get access to the playground repo.
Application (Closed)
You must apply and be admitted into the class in order to enroll because slots are limited and the class requires an atypical amount of upfront commitment. The application process is now closed.
Projects
GeoMesa
Mentor: Jim Hughes
Team Size: 6
Description: LocationTech GeoMesa is an open source suite of tools that enables large-scale geospatial querying and analytics on distributed computing systems. GeoMesa provides spatio-temporal indexing on top of the Accumulo, HBase, Google Bigtable and Cassandra databases for massive storage of point, line, and polygon data. Through GeoServer, GeoMesa facilitates integration with a wide range of existing mapping clients over standard OGC (Open Geospatial Consortium) APIs and protocols such as WFS and WMS. GeoMesa supports Apache Spark for custom distributed geospatial analytics.
For each technology we worth with, we will first learn about what it does, how it is built, and how it is tested. As we perform the upgrade, we will manage dependencies, verify our work, and look for new features which can be implemented.
Skills: You will learn about:
- Building a project containing Java and Scala with Maven
- The open-source geospatial software ecosystem (JTS, GeoTools, GeoServer)
- Several Apache 'big-data' projects
- A little bit about how intellectual property is reviewed and managed by the Eclipse Foundation
- How the Java Virtual Machine manages the classpath
GeoWave Admin Console
Mentor: JP Prochazka
Team Size: 5
Description: GeoWave is a software library that connects the scalability of distributed computing frameworks and key-value stores with modern geospatial software to store, retrieve and analyze massive geospatial datasets. While the core toolkit is generally applicable to multi-dimensional use cases, GeoWave has focused on tailored extensions to support spatial types and operators, with or without temporal timestamps or time ranges. Additionally, it provides advanced features to leverage a distributed backend for visualization or analysis. The software is intended to be easily pluggable into any sorted key-value store, and its modular design is intended to enable feature extension into various geospatial toolkits.
The basis for this project is to design and build an administrative web console where a user can exercise GeoWave's services. The end goal would be to lower the barrier of entry - to enable a less-technical user with the ability to go end-to-end with at least a basic GeoWave use case without touching a terminal and needing commandline access. It will take GeoWave from being purely headless to putting a concrete, generally consumable front-end on it.
Students will learn full stack development, with a key-value store database, database services, an application service tier, and a web front-end. They will also learn about geospatial applications. Languages will include Java and Javascript.
GeoWave-FoundationDB
Mentor: Rich Fecher
Team Size: 5
Description: GeoWave is a software library that connects the scalability of distributed computing frameworks and key-value stores with modern geospatial software to store, retrieve and analyze massive geospatial datasets. While the core toolkit is generally applicable to multi-dimensional use cases, GeoWave has focused on tailored extensions to support spatial types and operators, with or without temporal timestamps or time ranges. Additionally, it provides advanced features to leverage a distributed backend for visualization or analysis. The software is intended to be easily pluggable into any sorted key-value store, and its modular design is intended to enable feature extension into various geospatial toolkits.
FoundationDB was open sourced by Apple in April 2018. It is a distributed database designed to handle large volumes of structured data across clusters of commodity servers. It organizes data as an ordered key-value store and employs ACID transactions for all operations. FoundationDB can benefit from GeoWave's ability to index multi-dimensonal datasets within ordered key-value stores, which is the underpinning of geospatial or spatio-temporal database support. This effort will focus on creating a data store extension within GeoWave for FoundationDB similar to GeoWave's existing data store extensions.
Students will learn distributed system design particularly related to data storage and retrieval concerns. They will also learn about geospatial applications. The language for this project will be Java.
Hummingbot
Mentor: Martin Kou
Team Size: 4
Summary: Contribute to Hummingbot, the open source algo trading software for crypto.
Description: Hummingbot is an open source project that helps you build and run high-frequency algorithmic trading bots. Previously, algorithmic trading strategies were available only to quantitative hedge funds and trading firms, but since cryptocurrency exchanges have open APIs and free data feeds, it allows Hummingbot to make algo trading available to everyone.
In this project, you will extend Hummingbot to support "execution algorithms": trading strategies that minimize price slippage by transacting over time, rather than all at once. During the course of the project, you will learn how to write production-quality Python code, utilize Cython, interface with WebSocket-based network connections, and write good tests/documentation. In addition, students will gain familiarity with basic high-frequency trading strategies and trading-related terminology.
Skills: Python, data engineering, systems programming.
MIT App Inventor
Mentors: Evan W. Patton and Susan Rati Lane
Team Size: 4-6
Summary: Component and performance improvements for MIT App Inventor
Overview: MIT App Inventor is a free, worldwide platform for anyone to build their own app. It has been used by over 8 million people across every country on Earth and is available in 13 different languages. While its primary focus is on computer science education and social good, the project is used by people to build their own Android (and soon iOS) apps. App Inventor is used as part of Mobile CSP, a CollegeBoard approved AP Computer Science Principles course, and Technovation, a worldwide competition to encourage girls to become entrepreneurs and solve social problems through mobile apps.
Description: For this project, students will start by working with the MIT App Inventor development team to address performance and bug issues in App Inventor. After becoming familiar with the code base, students will have the opportunity to implement new features for the App Inventor platform to address requirements for its educational mission. Some potential projects include new arrangement types include a tab bar and page view layouts, implement organization tools in the project editor, and work on implementing a debugger tool. Students interested in improving the web interface for building apps should have a proficiency in JavaScript, CSS, and HTML. Students interested in implementing features for the Android code should have proficiency in Java.
Repository: https://github.com/mit-cml/appinventor-sources
ParlAI
Mentor: Kurt Shuster
Team Size: 4
Summary: Support improvements to ParlAI, a platform developed by researchers at Facebook AI Research (FAIR), ranging from expanding the testing framework, adding more features to the annotation platform, adding support for more datasets, improving the logging system, reproducing key related paper results, or supporting additional model types through the core modeling class.
Description: ParlAI (pronounced "par-lay") is a framework for dialogue AI research, implemented in Python. Its goal is to provide researchers:
- a unified framework for sharing, training and testing dialogue models
- many popular datasets available all in one place -- from open-domain chitchat to visual question answering
- a wide set of reference models -- from retrieval baselines to Transformers
- seamless integration of Amazon Mechanical Turk for data collection and human evaluation
- integration with Facebook Messenger to connect agents with humans in a chat interface
Many tasks are supported, including popular datasets such as SQuAD, bAbI tasks, MS MARCO, MCTest, WikiQA, WebQuestions, SimpleQuestions, WikiMovies, QACNN & QADailyMail, CBT, BookTest, bAbI Dialogue tasks, Ubuntu Dialogue, OpenSubtitles, Cornell Movie, VQA-COCO2014, VisDial and CLEVR. See here for the current complete task list. Included are examples of training neural models with PyTorch, with batch training on GPU or hogwild training on CPUs. Using Tensorflow or other frameworks instead is also straightforward. Our aim is for the number of tasks and agents that train on them to grow in a community-based way. ParlAI is described in the following paper: ParlAI: A Dialog Research Software Platform. See the news page for the latest additions & updates, and the website http://parl.ai for further docs.
Skills: Students will learn/gain experience in some or all of the following areas:
- Developing for a large, open-sourced research platform, including...
- Designing and implementing features to enhance the platform, whether from an efficiency, speed, or code legibility standpoint
- Designing and implementing unit/integration testing
- Reporting and fixing public issues; including providing adequate documentation, following up with filers, and testing fixes adequately.
- Writing and updating documentation to reflect changes to the platform or systems.
- Artificial Intelligence (AI) Research in Dialogue, including experience with PyTorch, a leading platform for AI development
- Experience with connecting to various external platforms, including Amazon's Mechanical Turk, Facebook Messenger, etc.
- Coding - mostly in Python (and some JavaScript) - held to a standard required for a large open-source project.
Pyret
Mentor: Joe Politz
Team Size: 4
Summary: A Sound Library for Pyret
Description: Pyret is a programming language designed for education. Its online programming environment code.pyret.org provides built-in support for image and animation-based media, which are a large component of several introductory curricula that use it. These features are supported by an implementation atop built-in browser APIs like canvas and the browser's event system, which are then exposed to Pyret through its foreign-function interface. This implementation strategy makes them usable and available on any machine with a stock web browser, through the designed-for-beginners view of the Pyret language (and this last step is sometimes nontrivial, see [1,2]).
A frequently-requested feature is the addition of sound and music primitives to the language. A sound library would be a terrific addition to the set of media available for students to use - there are existing curricula that explore sound that could be adapted to Pyret, student games could be accompanied by music and sound effects, and it provides an experience that can be especially impactful for visually-impaired students. Web standards have also reached the point where there are robust ways to reliably play sounds and music across browsers.
This project has several parts.
A minimum viable product would allow the Pyret team to explore different options for sound libraries atop basic infrastructure:
- Expose browser-level sound primitives to Pyret programs though its foreign-function interface
- Add support to code.pyret.org to play and explore sounds at the interactive REPL, in analogy to the ability to render images interactively
A full project would involve implementing and designing more:
- Design a student-facing library for creating sounds, reading sounds from files, and manipulating sounds. Since Pyret and its introductory curricula encourage a functional-first approach, the library should be written in that style (the existing image library may provide some inspiration)
- Add support for saving/downloading a student-created sound so students can export their work from the IDE.
- As a proof-of-concept for integrating with a curriculum, port and/or adapt assignments from the Media Computation curriculum to versions that use this student-facing sound library.
There are some other extensions:
- Augment the animation/interaction toolkit provided by Reactors to support background music
- Augment Reactors to support sound effects for games in response to events
Wireshark
Mentor: Peter Wu
Team Size: 4-5
Summary: Add decryption support to the SSH dissector.
Description: Wireshark is an open-source network protocol analyzer. It is used in education to provide a visual and practical understanding of networking concepts, and in industry for network-related troubleshooting and to facilitate development of new products and standards.
As increasingly more network traffic is encrypted, decryption is required to enable users to achieve an optimal understanding of application behavior. Wireshark is able to decrypt several protocols, including TLS, QUIC, and WireGuard. Decryption does require additional key material to be extracted from applications, see the previously linked protocol pages for examples.
The Secure Shell (SSH) protocol is commonly used for managing remote systems, ranging from a Raspberry Pi to a global fleet of servers. OpenSSH is the most popular implementation, while Dropbear is a smaller implementation that is found in some routers.
Wireshark has basic support for dissection of SSH protocol messages. However, most of the interesting details (commands, input/output, file transfers) are present in the encrypted fields. Tasks:
- Study the SSH specifications (see the references at SSH) with a focus on understanding message flow, the key exchange, and decryption.
- Build a mechanism to extract session secrets from OpenSSH, Dropbear, or your favorite SSH application into a text file.
- Modify the SSH dissector (epan/dissectors/packet-ssh.c) to support decryption using these secrets.
- Add support for dissection of the decrypted message structures.
If time permits, one could add even more functionaliy:
- Investigate the feasibility to implement "Follow Stream" for individual SSH channels, similar to "Follow TCP". This could be used to see the command input/output from an interactive session. This also affects the GUI.
- Work with other projects (libssh?) to upstream changes that make it easier to dump the secrets.
- Make it possible to embed these secrets in a pcapng capture file using the Decryption Secrets Block (example for TLS).
- Add support for various key exchange algorithms and ciphers supported in SSH.
Skills: You will learn about:
- Practical cryptography engineering.
- Development of a cross-platform application in C. Ideally on Linux or macOS, but Windows also works.
- (Optional) GUI development in Qt (C++).
- Efficiently read internet standards.
- Extending or developing new protocol dissectors in Wireshark.
- Development workflow using Git and Gerrit Code Review.
- Continuous integration testing on multiple platforms (Travis CI, Buildbot, AppVeyor, Gitlab) and operating systems.
- (Optional) Use of tools that support development, including a debugger (GDB) and sanitizers (ASAN, UBSAN).