Claude helps to create an R package

Parametrized Random Walk Diffusion Kernel Clustering

Reinhold Koch

2024-09-04

Introduction

Based on arxiv.org/pdf/2210.00310 I asked Claude, to help me create an R package that implements the described algorithms. The result is accessible in github.com/baselDataScience/prwdkc.

My first encounter with Claude!

The Algorithm in the Paper

I said: “write me a R function that implements the following algorithm:”

Claude’s first attempt:

graph_clustering <- function(W, k, nu, td) {
  N <- nrow(W)
  
  # Step 1: Compute the parametrized random walk operator P(ν)
  D <- diag(rowSums(W))
  P_nu <- solve(D) %*% W %*% diag(nu)
  
  # Step 2: Compute the parametrized random walk diffusion kernel K(td,ν)
  K_td_nu <- expm::expm(td * (P_nu - diag(N)))
  
  # Step 3: Apply k-means clustering to the rows of K(td,ν)
  kmeans_result <- kmeans(K_td_nu, centers = k)
  
  # Step 4: Obtain the k-partition Vtd based on the clustering result
  Vtd <- kmeans_result$cluster
  
  # Step 5: Return the partition
  return(Vtd)
}

My Impression

  • Claude always answers very politely (later): “I apologize for the confusion. You’re right, and I appreciate your correction.”
  • Claude can produce R code
  • Claude knows some matrix functions
  • Claude knows about package expm
  • but
    • why should K_td_nu be an exponential matrix??

“please add input checks”

  # Input checks
  if (!is.matrix(W) || !isSymmetric(W) || any(W < 0)) {
    stop("W must be a symmetric matrix with non-negative entries")
  }
  
  if (!is.numeric(k) || length(k) != 1 || k <= 0 || k != round(k)) {
    stop("k must be a positive integer")
  }
  
  if (!is.numeric(nu) || length(nu) != nrow(W) || any(nu <= 0)) {
    stop("nu must be a positive numeric vector with length equal to the number of rows in W")
  }
  
  if (!is.numeric(td) || length(td) != 1 || td <= 0) {
    stop("td must be a positive number")
  }

W needs not be symmetric though …

Interaction with Claude

  • certainly helpful to get going
  • deals reasonably well with mathematical notations like subscripts or exponents
  • but can be strenuous to nudge him in the right direction

back to clustering

spirals - what can prwdkc do?

data(spirals, package='kernlab')
W <- Matrix::sparseMatrix(rep(1:(N <- nrow(spirals)), k <- floor(log(N))),
  FNN::get.knn(spirals, k=k)$nn.index, x=1)
plot(spirals, col=prwdkc::prwdkc(W, 2, 1, 2), pch=16)

spirals - with kernlab

not too great - see for comparison spectral clustering:

plot(spirals, col=kernlab::specc(spirals, 2), pch=16)

other considerations

  • prwdkc does do better with other test cases and with examples from the Machine Learning Repository https://archive.ics.uci.edu/datasets
  • sometimes no embedding in Rn is given, only a graph with adjacency matrix - for instance links between web pages
  • first(?) algorithm able to deal with directed graphs,
    i.e. non-symmetric adjacency matrices

what is the idea behind prwdkc?

“diffusion” along a graph:

where to start “diffusion”?

  • heuristics so far,
    trying different start parameter nu in prwdkc()

when to stop “diffusion”?

  • implemented via the optstop() function
  • dependent on the eigenvalues of the adjacency matrix