Goal: Understanding barriers and opportunities in scalable distributed data filtering.
Problem statement: We are a team creating a scalable photo analysis system for a smart highway.
- Input: High rates of photos captured from the outside world. Assume that each camera waits for motion and then takes photos continuously, one per second, pausing only if everything becomes motionless.
- Output: A single high quality photo for each (vehicle,driver) observed on the highway, plus records of where the pair first entered the highway, where they left the highway, and interval by interval average speeds. (Such a system raises huge privacy and data security and data use questions, and could be used in positive or negative ways. We’re only interested in positive uses and strong privacy and security!)
- Constraints: Our system will cost money to operate: we want to use the fewest computers and the least amount of storage and other equipment consistent with solving the problem. We also might have some applications down the road where the speed of doing this task would be important, so we also seek the quickest possible classifications of the photos.
- Fault Tolerance: We are worried about failures, so we want the entire solution to be tolerant of one computer crash, where what it means to “tolerate” a crash can be interpreted in a somewhat flexible way. Don’t blindly assume that a fault-tolerant system must replicate every bit that it captures, but on the other hand, if some person drives recklessly for 30 minutes and causes serious injuries, but then there turn out to be absolutely no images at all, that would be a violation of our goals.
- Scale: A high-volume highway might have an average of perhaps 25 cameras per mile, situated at known locations, and might be hundreds of miles long. Such a highway could carry perhaps 25,000 unique vehicles per mile per day, and vehicles might average perhaps 25 miles on the highway, although some would drive much less or much further. Some vehicles might be driven by different people at different times of the day.
Definition: a high quality photo is one that isn’t blurred by movement or poor focus, reflections, debris, etc. Given a set of photos of the same (vehicle,driver) pair, our colleagues who do machine learning would have ideas for how to recognize similarities and differences, how to pick the best one, etc.
Today’s in-class activity:
Come up with a concept for how a distributed image capture and filtering system might operate.
- Do a back-of-the-envelope calculation of the highest performance you believe that a modern computing system should be able to achieve (“cost in computer resources, per mile of highway”)
- Did your concept require special hardware? What are you assuming about the hardware (tasks, performance, etc)?
- What will be the obstacles to actually building and operating a system such as this, in “real systems”, using real programming languages, real operating systems and file systems, existing image processing software, etc? Note: I am not assuming we know much about existing image processing software. You can make a “best guess” about what “probably” exists! But also keep in mind that many existing fancy image processing tools are designed for offline use, from programs like PhotoShop. The existing options for online use, right in the data capture path, might be much more limited. One way to make a guess is to assume that “things my Canon camera can do probably already exist in a form that can run on the data path. Things I do in photoshop probably only exist in offline systems that run much more slowly.”