Introduction

This code computes a word-topic matrix using the spectral approach described in an ICML 2013 paper. The main algorithm has two phases:

In the first phase, we compute "anchor words" associated with specific topics.
In the second phase, we compute the word-topic intensities by a sequence of constrained QPs

Anchor word algorithms

Consider the non-negative factorization of a matrix A as A ≈ WR, where W and R are elementwise non-negative. This task becomes simple if W can be permuted into the form $W = \begin{bmatrix} I \\ W_2 \end{bmatrix}$. In this case, R corresponds to a subset of the rows of the original matrix A, and these rows can be recovered by doing QR with column pivoting on A^T.

Dense QRP

The choose_anchors routine uses a dense pivoted QR factorization to select anchor words. The anchor words (permutation indices) and the diagonal of R are returned as outputs.

Julia provides a natural interface to the pivoted QR factorization with the qrpfact routine. Note that we never need to form the orthogonal factor explicitly; we only care about the permutation (and to a lesser extent the triangular factor). For moderate size problems, this will probably be the fastest option, since it is capable of taking advantage of level 3 BLAS operations; see Quintana-Orti, Sun, and Bischof.

function choose_anchors_qrp(A)
  F = qrpfact(A)
  p = F[:p]
  r = abs(diag(F[:R]))
  p,r
end

Partial QRP

If we only care about k topics, where k is significantly less than the number of words, then a full pivoted QR factorization is overkill. The choose_anchors_partial instead does a partial QR with deferred updating; this mostly involves keeping track of the column norms in an implicit fashion. Note that this algorithm works well with large vocabularies (large, sparse A) as long as the number of topics remains modest.

This is not the most numerically stable implementation possible. A less quick-and-dirty version would do the orthogonalization more carefully (using at least MGS and perhaps Householder transforms for the orthogonalization step). Nonetheless, it's fine for the current purpose.

function choose_anchors_partial(A, k)
  cnorms2 = sum(A.^2, 1)
  m = size(A,1)
  Q = zeros(Float32, (m,k))
  p = zeros(Integer, k)
  r = zeros(Float32, k)
  for j = 1:k
    p[j] = indmax(cnorms2)
    Q[:,j] = A[:,p[j]] - Q[:,1:j-1]*(Q[:,1:j-1]'*A[:,p[j]])
    r[j] = norm(Q[:,j])
    Q[:,j] = Q[:,j]/r[j]
    cnorms2 = cnorms2 - (Q[:,j]'*A).^2
  end
  p, r
end

Simplex-constrained QPs

After we find the anchor words, the remaining expensive computation involves solving linearly constrained quadratic programs of the form
min_x{∥Ãx − b∥² : x_i ≥ 0, ∑_ix_i = 1}
where b^T is a row of A and Ã^T is a subset of the rows of A corresponding to the anchor words. The constraints force the vector x to be interpretable as a probability distribution over the topics/anchor words.

We provide two different algorithms for solving this quadratic program: exponentiated gradients and a primal active-set algorithm.

Exponentiated gradients

The exponentiated gradient algorithm was popularized in a 1996 paper of Kivinen and Warmuth, where it was proposed as a good choice for online learning problems. The basic idea is to replace the additive update of gradient descent with a multiplicative update. The algorithm is straightforward to implement, and it requires no work to incorporate the non-negativity constraints. However, convergence is modest, and it is not entirely obvious how to choose the learning rate η.

The simplex_nnls_eg routine solves the constrained quadratic program using the exponentiated gradient approach. The routine effectively works with the normal equations, so we A^TA and A^Tb as algorithms. A starting value may be provided as a third argument; the default starting point is a uniform distribution.

function simplex_nnls_eg(AtA, Atb, x=[], maxiter=500)
  # BEGIN TASK
  # END TASK
end

Active set algorithm

Exponentiated gradients are fast for modest accuracy and scale to large problems. However, the learning rate is not so obvious (at least to me!), and the convergence is modest. In contrast, primal active-set algorithms provide high accuracy solutions, but quickly become too expensive for large problems. It's still worth having a high-accuracy method around for testing, though, and it is possible to make this approach much faster by treating the linear solves involved in the direction computation more intelligently (or by using a quasi-Newton step instead of a Newton step).

NB: The tolerance, and the multiplier test, are both a bit informal and hacky. Also, one could do update/downdate operations on the Cholesky factor.

function simplex_nnls_as{T<:Real}(AtA :: Array{T,2}, 
                                  Atb :: Vector{T}, 
                                  x :: Vector{T} = [])

  # Set up starting point and tolerance
  if isempty(x)
    x = ones(T, size(Atb))
  end
  x = max(x, 0)
  x = x / sum(x)
  tol = 100*eps(T)
  maxiter = 1000

  p = zeros(T, size(x))
  is_active = (x .<= 0)
  P = find(!is_active)

  # Pre-allocate storage
  (m,n) = size(AtA)
  r = zeros(T, size(Atb))

  # Main loop
  for i = 1:maxiter

    # Equality-constrained Newton direction
    r = AtA*x-Atb
    nactive = prod(size(P))
    CF = cholfact(AtA[P,P])
    w = CF\ones(T, (nactive,))
    mu = dot(w,r[P])/sum(w)
    rr = r.-mu
    fill!(p,0.)
    p[P] = CF\rr[P]
    
    if norm(p, Inf) < tol

      # Find constraint that should not be active
      m = 0
      rrmin = 0.
      for j = 1:n
        if is_active[j] && rr[j] < rrmin
          rrmin = rr[j]
          m = j
        end
      end

      # Release constraint or return      
      if rrmin < 0.
        is_active[m] = false
        P = find(!is_active)
      else
        return x, i
      end

    else

      # Determine step length
      alpha = 1.
      m = 0
      for j = 1:n
        if p[j] > 0. && x[j] < alpha*p[j]
          m = j
          alpha = x[j]/p[j]
        end
      end
      
      # Take a Newton step and adjust P
      if alpha == 1.
        x[:] = x-p
      else
        x[:] = x-alpha*p
        is_active[m] = true
        P = find(!is_active)
      end
      x[:] = max(x, 0.)
      x[:] = x/sum(x)

    end

  end  
  println("Did not converge")
  x, maxiter
end

Simplex projection

The exponentiated gradient algorithm and the primal active set both have the property that they never consider violating the inequality constraints. An alternative is a projected gradient or related method. While we might not do projected gradients here, it's at least something to consider. The only tricky part is efficient computation of the projection of some point onto the simplex; we provide a proj_simplex algorithm based on a paper of Chen and Ye that accomplishes this.

Simplex projection can also be useful for choosing a starting point for the optimization.

function proj_simplex(y)
  (n,) = size(y)
  ys = sort(y,1)
  zs = flipud(cumsum(flipud(ys)))
  for i = n-1:-1:1
    t = (zs[i+1]-1)/(n-i)
    if t >= ys[i]
      return max(y.-t,0)
    end
  end
  t = (zs[1]-1)/n
  max(y.-t,0)
end

Compute intensities

The row-normalized matrix Q̄ can be interpreted as
Q̄_ij = P(word₁ = i∣word₂ = j).
We suppose that any row of Q̄ can be well approximated as a convex combination of anchor words Q̄(p, : ) with weights summing to one, i.e.
Q̄(i, : ) ≈ C(i, : )Q̄(p, : )
where the coefficients C belong to the simplex. We can find the optimal C(i, : ) for each row i (in a least squares sense) by solving constrained quadratic programs using one of the simplex_nnls routines.

We interpret each coefficient C_ik as the conditional probability of the topic k given word i, but what we want is the conditional probability of word i given topic k. To compute this by Bayes rule, we need the overall probability of word i, which we store as s.

function compute_A(Qn, s, p)
  Tt = Qn[p,:]
  AtA = convert(Array{Float32,2}, full(Tt*Tt'))
  AtB = convert(Array{Float32,2}, full(Tt*(Qn')))
  (nt,nw) = size(Tt)
  C = zeros(Float32, (nw,nt))
  maxerr1 = 0.0
  maxerr2 = 0.0
  alliter = 0
  for i = 1:nw
    Atb = reshape(AtB[:,i], (nt,))

    # Version 1: Exponentiated gradient
    #(ci, maxiter) = simplex_nnls_eg(AtA,Atb)
    #alliter = alliter + maxiter

    # Version 2: Warm-started active-set iteration
    ci = proj_simplex(AtA\Atb)
    (ci, maxiter) = simplex_nnls_as(AtA, Atb, ci)

    alliter = alliter + maxiter
    C[i,:] = ci' .* s[i]

    # Check normalization error
    maxerr1 = max(maxerr1, abs(sum(ci)-1))
    
    # Check error measure used in EG convergence
    r = AtA*ci-Atb
    phi = 2*(r.-minimum(r))'*ci
    maxerr2 = max(maxerr2, phi[1])

  end
  println("Max error ", maxerr1, " ", maxerr2)
  println("Total iterations: ", alliter)
  sc = reshape(sum(C,1),nt)
  scale(C,1./sc)
end

The main event

The overall topic mining algorithm is:

Row normalize Q and compute word probabilities (row sums)
Select anchor words for each topic
Compute an intensity matrix associated with those words
Find the most important words for each topic

As input, we need the fully normalized word co-occurrence matrix Q. We output the anchor word selection p (and "score" vector r), the intensity matrix A, and the top word selection TW.

function mine_topics(Q, ntopic=100, nword=20)

  println("-- Compute row scaling")
  tic()
  s = sum(Q,2)
  s = s[:,1]
  Qn = scale(1./s, Q)
  toc()

  println("-- Timing anchor words with partial fact")
  tic(); (p,r) = choose_anchors_partial(Qn', ntopic); toc()

  println("-- Compute intensities")
  tic(); A = compute_A(Qn, s, p); toc()

  println("-- Find top words per topic")
  tic()
  TW = zeros(Integer, (nword,ntopic))
  for t = 1:ntopic
    tp = sortperm(-A[:,t])
    TW[:,t] = tp[1:nword]
  end
  toc()
  p, r, A, TW

end

Preprocessing code

The raw data from the UCI sets (or bag-of-words in general) is a word-document frequency count matrix D. We want to compute from this the word co-occurrence matrix; we may also want to trim unimportant (low frequency) words from the vocabulary.

Basic word-document statistics

function doc_lengths(docs, Ddw)
  counts = zeros(Float32, (maximum(docs),))
  for k = 1:prod(size(docs))
    counts[docs[k]] = counts[docs[k]] + Ddw[k]
  end
  counts
end

function word_freqs(words, Ddw)
  counts = zeros(Float32, (maximum(words),))
  for k = 1:prod(size(words))
    counts[words[k]] = counts[words[k]] + Ddw[k]
  end
  counts
end

function word_idfs(words, docs)
  counts = zeros(Float32, (maximum(words),))
  for k = 1:prod(size(words))
    counts[words[k]] = counts[words[k]] + 1
  end
  log(maximum(docs)) .- log(counts)
end

Compacting the document matrix

If words or documents are pruned from a data set, the remaining term-document matrix may have empty rows or columns that we want to discard.

function compact_doc_matrix(vocab, docs, words, Ddw)

  # Re-index documents
  nd = doc_lengths(docs, words)
  active = find(nd .> 0)
  map_doc_id = zeros(eltype(docs), size(nd))
  map_doc_id[active] = 1:prod(size(active))
  docs = map_doc_id[docs]

  # Re-index terms
  tf = word_freqs(words, Ddw)
  active = find(tf .> 0)
  map_word_id = zeros(eltype(words), size(tf))
  map_word_id[active] = 1:prod(size(active))
  words = map_word_id[words]

  # Discard unused words from the vocabulary
  vocab = vocab[active]

  vocab, docs, words, Ddw
end

function trim_vocab(vocab, docs, words, Ddw; 
                    min_tf=0, min_tfidf=0)

  # Mark words that are important enough (tf or tf-idf)
  tf  = word_freqs(words, Ddw)
  idf = word_idfs(words, docs)
  keep = ((tf .> min_tf) & (tf .* idf .> min_tfidf))
  I = find(keep[words])  
  println("Trim: nwords=", prod(size(vocab)), " -> nwords=", size(find(keep),1))
  println("Trim: nnz=", prod(size(Ddw)), " -> nnz=", prod(size(I)))
  docs, words, Ddw = docs[I], words[I], Ddw[I]

  # Compact the term-doc matrix and return
  compact_doc_matrix(vocab, docs, words, Ddw)
end

Apply tf-idf weighting to term-document counts

function apply_tfidf(docs, words, Ddw)
  idf = word_idfs(words, docs)
  Ddw .* idf[words]
end

Computing the word co-occurrence matrix

The construction of Q is based on the description in the supplementary material from Arora 2013.

function compute_Q(docs, words, Ddw)
  println("-- Forming Q")
  tic()
  nd = doc_lengths(docs, Ddw)

  # Compute the matrices and sum across documents to get Hhat
  scaling = 1./(nd .* (nd.-1))
  Hdw = Ddw .* scaling[docs]
  diag_Hhat = zeros(Float32, (maximum(words),))
  for i = 1:prod(size(Hdw))
    diag_Hhat[words[i]] = diag_Hhat[words[i]] + Hdw[i]
  end

  # Compute the Htilde
  scaling = 1./sqrt(nd .* (nd.-1))
  Htdw = Ddw .* scaling[docs]
  Ht = sparse(words, docs, Htdw)
  Q = Ht*Ht' - spdiagm(diag_Hhat,0)
  
  Q = Q/( sum(Q)[1] )
  toc()
  Q
end

I/O code

Loading from the UCI data

The several bag-of-words data sets at UCI are a potentially useful testing ground for this code. We provide a loader that loads the words (trivial) and produces the normalized word co-occurrence matrix in compressed sparse column form. Notice that the docword file should be left gzipped (as it is on the web page).

using GZip

function load_uci(basename; min_tf=0, min_tfidf=0)
  vocab = "uci/vocab.$basename.txt"
  docword = "uci/docword.$basename.txt.gz"

  println("-- Reading files")
  tic()
  words = readcsv(vocab)
  f = gzopen(docword, "r")
  d = int(readline(f))
  w = int(readline(f))
  nnz = int(readline(f))
  idocs = zeros(Int, (nnz,))
  iwords = zeros(Int, (nnz,))
  V = zeros(Float32, (nnz,))
  for i = 1:nnz
    entries = map(int, split(readline(f), (' ')))
    idocs[i] = entries[1]
    iwords[i] = entries[2]
    V[i] = entries[3]
  end
  close(f)
  toc()

  # Uncomment if you want to work with IDF-scaled term counts
  #V = apply_tfidf(idocs, iwords, V)

  if min_tf > 0 || min_tfidf > 0
    println("-- Trimming vocabulary")
    tic()
    (words, idocs, iwords, V) = trim_vocab(words, idocs, iwords, V; 
                                           min_tf=min_tf, min_tfidf=min_tfidf)
    toc()
  end

  words, compute_Q(idocs, iwords, V)
end

Writing topic summaries

Topic summaries are reported as an anchor word followed by a list of keywords in descending order of importance.

function write_topics(fname, words, p, r, A, TW)
  f = open(fname, "w")
  (nkeyword, ntopic) = size(TW)
  for t = 1:ntopic
    pt = p[t]
    @printf(f, "%s (%1.3e):\n", words[pt], r[t])
    for k = 1:nkeyword
      kw = TW[k,t]
      @printf(f, "\t%-15s (%1.3e)\n", words[kw], A[kw,t])
    end
    @printf(f, "\n")
  end
  close(f)
end

Driver for UCI data sets

For convenience, we provide a driver for running the mining algorithm and writing topics for data sets from the UCI repository.

function driver_uci(basename; min_tf=0, min_tfidf=0, ntopic=100)
  (words,Q) = load_uci(basename; min_tf=min_tf, min_tfidf=min_tfidf)
  (p, r, A, TW) = mine_topics(Q, ntopic)
  write_topics("topics_$basename.txt", words, p, r, A, TW)
end

Julia topic modeling via anchor words

David Bindel bindel@cornell.edu

2014-04-06

Introduction

Anchor word algorithms

Dense QRP

Partial QRP

Simplex-constrained QPs

Exponentiated gradients

Active set algorithm

Simplex projection

Compute intensities

The main event

Preprocessing code

Basic word-document statistics

Compacting the document matrix

Apply tf-idf weighting to term-document counts

Computing the word co-occurrence matrix

I/O code

Loading from the UCI data

Writing topic summaries

Driver for UCI data sets

Julia topic modeling via anchor words

David Bindel  bindel@cornell.edubindel at cornell dot edu

2014-04-06

Introduction

Anchor word algorithms

Dense QRP

Partial QRP

Simplex-constrained QPs

Exponentiated gradients

Active set algorithm

Simplex projection

Compute intensities

The main event

Preprocessing code

Basic word-document statistics

Compacting the document matrix

Apply tf-idf weighting to term-document counts

Computing the word co-occurrence matrix

I/O code

Loading from the UCI data

Writing topic summaries

Driver for UCI data sets

David Bindel bindel@cornell.edu