Lecture 28: Course Summary and Open Questions¶

CS4787/5777 — Principles of Large-Scale Machine Learning Systems¶

Course Summary and Open Questions¶

Scaling machine learning methods is increasingly important.

In this course, we addressed the high-level question: What principles underlie the methods that allow us to scale machine learning?

To answer this question, we used techniques from three broad areas: statistics, optimization, and systems.

We articulated three broad principles, one in each area.

  • Statistics Principle: Make it easier to process a large dataset by processing a small random subsample instead.

  • Optimization Principle: Write your learning task as an optimization problem, and solve it via fast general algorithms that update the model iteratively.

  • Systems Principle: Use algorithms that fit your hardware, and use hardware that fits your algorithms.

Now, some open questions in scalable ML that relate to the principles.

Open Problem: Energy Use of AI¶

  • Data centers consumed about 4.4% of total U.S. electricity in 2023 and are expected to consume approximately 6.7 to 12% of total U.S. electricity by 2028.

  • From that report:

Historically, data center electricity use increased substantially from 2000–2005, roughly doubling during that period. During the early and mid-2010s, a shift from on-premise data centers to colocation or cloud facilities helped enable efficiency improvements that allowed data center electricity use to remain nearly constant at a time when the data center industry grew significantly, with a large expansion of data center services. The efficiency strategies that allowed the industry to avoid increased energy needs during this period included improved cooling and power management, increased server utilization rates, increased computational efficiencies, and reduced server idle power.

While many of these efficiency strategies continue to provide significant energy efficiency improvements in data center design and operation, the expansion of data center services into areas that require new types of hardware has ended the era of generally flat data center energy use. Most notably, the rapid growth in accelerated servers has caused current total data center energy demand to more than double between 2017 and 2023, and continued growth in the use

of accelerated servers for AI services could cause further substantial increases by the end of this decade. The current and possible near-future surge in energy demand highlights the need for future research to understand the early-stage, rapidly changing AI segment of the data center industry and identify new efficiency strategies to minimize the resource impacts of this growing and increasingly significant sector in our economy.

  • Personal conversations with chatbots only represent a small part of this total
    • It's not all just a matter of consumer overconsumption
  • Puts strain on the grid
    • Opportunities for improved efficiency here. E.g. can we run our workloads when more power is available? Problem: GPUs are expensive and we want them running all the time...
  • Environmental impact
    • Mostly due to carbon emissions from the power
    • But there are issues here at all levels of the supply chain

We should think about the power grid as part of the "hardware" when we design algorithms that fit our hardware and hardware to fit our algorithms.

Open Problem: Is Scaling Really All We Need For AI?¶

Dominant trend over past five years is to train bigger and bigger models!

  • e.g. GPT-3 is a language model that has 175 billion parameters
  • We can use these large models for zero-shot learning, and this usually outperforms non-transfer-learning approaches
  • Performance of these models seems to improve further with size, following scaling laws
  • Is scaling up the size of modern transformers where we should expect to see the most gains? Should we devote most of our resources to this?
    • Companies seem to be investing billions of dollars, but most acknowledge this is a bet, not a sure thing
  • May be reaching the limits of scaling using language data, since an AI language model can only be as "smart" as a distribution over token string

Supposing that we can't get much benefit from hyperscaling beyond present systems, what will the next big thing be?

  • symbolic reasoning?
  • integrating search?
  • relying more on tool use?
  • making better use of RAG?
  • getting multmodality to do more work?

Open problem: Reproducibility and debugging of machine learning systems.¶

  • Most of the algorithms we discussed in class are randomized, and random algorithms are hard to reproduce.
    • Even when we don't use explicitly randomized methods, floating point imprecision can still make results difficult to reproduce exactly.
      • For hardware efficiency, the compiler loves to reorder floating point operations (this is sometimes called fast math mode) which can introduce slight differences in the output of an ML system.
      • As a result, even running the same learning algorithm on the same data on different ML frameworks can result in different learned models!
  • Reproducibility is also made more challenging when hyperparameter optimization is used.
    • Unless you have the random seed, it's impossible to reproduce someone else's random search.
    • Hyperparameter optimization provides lots of opportunity for (possibly unintentional) cheating, where the test set is used improperly.
  • ML models are difficult to debug because they often learn around bugs.
  • Reproducibility is especially hard when hardware is nondeterministic!
    • Might get different results if I switch GPUs or switch to a different vendor
    • Might even get different results on the same hardware at different times

Open problem: More scalable distributed machine learning¶

  • Distributed machine learning has this fundamental tradeoff with the batch size.
    • Larger batch size good for systems because there's more parallelism.
    • Smaller batch size good for statistics because we can make more "progress" per gradient sample. (For the same reason that SGD is generally better than gradient descent.)
  • Communication among workers is expensive in distributed learning.
    • Need provably robust ways of compressing this communication to use fewer bits.
  • The datacenters of the future will likely have many heterogeneous workers available.
    • How can we best distribute a learning workload across heterogeneous workers?
  • When running many workers in parallel, the performance will start to be bound by stragglers, workers that take longer to work than their counterparts. How can we deal with this while still retaining performance guarantees?

Open problem: Robustness to adversarial examples.¶

  • It's easy to construct examples that fool a deep neural network.
  • How can we make our scalable ML methods provably robust to these type of attacks?
  • How can we do this while retaining scalability? I.e. how can we be robust without extreme computational overhead?

This is part of a general concern for AI safety

Open problem: Training Data Extraction and Copyright¶

  • Current models let you extract large pieces of their training data.
    • Study: Meta AI model can reproduce almost half of Harry Potter book
  • Much of the methodology of large-scale AI raises copyright concerns! Some have even settled.
    • "Anthropic AI, a leading company in the generative AI space, has agreed to pay $1.5 billion to settle a copyright infringement lawsuit brought by a group of authors."
  • Part of the problem is that scaling makes it difficult to get copyright to all the texts (or images, videos, etc.) that we might want to train on, simply because there are so many texts. Texts are also hard to deduplicate, which can lead to memorization.

Thank you!¶

See you on Sunday for the final exam!

In [ ]: