import torch
from matplotlib import pyplot
import torchvision

# this is a big dataset tokenized with the llama3 tokenizer
tokens = torch.load('../../../Research/tokenizer/train/0.llama3.pt',weights_only=True)

counts = torch.bincount(tokens)
counts = counts.sort(descending=True).values;

pyplot.plot(counts)
pyplot.ylabel('number of occurrences of token in dataset');
pyplot.xlabel('token rank order');

pyplot.loglog(counts)
pyplot.ylabel('number of occurrences of token in dataset');
pyplot.xlabel('token rank order');

pyplot.loglog(counts, label='empirical frequency')
pyplot.semilogy(2e7 * (1+torch.arange(128000)).float()**(-1.0), label='power law')
pyplot.ylabel('number of occurrences of token in dataset');
pyplot.xlabel('token rank order');
pyplot.legend();

X = train_dataset = torchvision.datasets.MNIST(
                root = './data',
                train = True,
                transform = torchvision.transforms.ToTensor(),
                download = True).data.view(-1,28*28).float()

svdX = torch.linalg.svdvals(X)

X.shape

torch.Size([60000, 784])

pyplot.loglog(svdX, label='empirical')
pyplot.semilogy(5e5 * (1+torch.arange(28*28)).float()**(-0.8), label='power law')
pyplot.ylim((1e2,1e6));
pyplot.title('Singular Values of MNIST Dataset');
pyplot.ylabel('singular value');
pyplot.xlabel('rank order');
pyplot.legend();

n = 1024
Z = torch.randn(n,n)
svdZ = torch.linalg.svdvals(Z)

pyplot.loglog(svdZ)

[<matplotlib.lines.Line2D at 0x14597b6d0>]

pyplot.plot(svdZ)

[<matplotlib.lines.Line2D at 0x154b1fc10>]

0.34/(0.34 + 0.28)

0.5483870967741935

0.28/(0.34 + 0.28)

0.45161290322580644

Lecture 13: Scaling Laws¶

CS4787/5777 --- Principles of Large-Scale Machine Learning Systems¶

Background: Power Laws¶

Power laws are ubiquitous in machine learning and in science.¶

"Scaling Laws for Neural Language Models"¶

Kaplan et. al. 2020¶

A highly impactful later scaling law: Chinchilla¶

Nowadays, large model pretraining is based on scaling laws¶

E.g. for Llama3...¶

Important note: Scaling laws that minimize loss given a training FLOPs budget do not take inference cost into account!¶