Welcome to Language Agnostic, the blog of Inaimathi! It's now built on top of the Clojure framework http-kit. I'm reasonably confident it's no longer the least performant blog on the internet.

Enjoy the various programming-themed writings availble on offer. The latest post is available below and the archive link is directly above this text.


Arxivore

Sun Apr 14, 2019

Just a quick update regarding our Arxiv-indexing project. Which seems to be proceeding apace.

First off, I didn't need to contact them regarding getting dumps of their data. It turns out they have a Bulk Data Access FAQ. If we were planning on only indexing and correlating metadata, that's available through the standard APIs. If we want to download PDFs, or LaTeX files of papers in order to have a better indexing strategy, it turns out that there's something called a "Requester Pays Bucket" in S3, and Arxiv has such a bucket.

However, given that the CS Cabal is a loose coalition of Comp Sci, math and engineering nerds who have more time than money available for this particular project, we decided to instead respect the Arxiv robots specification and just crawl their repos very, very slowly. We're only out to index all comp-sci papers ever, which seems like it should only be about 200k papers total. At 15 second delays, that's the sort of thing we can do in a month or so of dedicated scraping effort. This is totally not an awful idea at all, so we're going with it.

In order to do a credible job of this, we need to both scrape the search-results interface for historic papers, and the CS rss feed for new papers on an ongoing basis. enlive helps with this, obviously, because there's no option to expose the Arxiv-direct stuff in data formats other than html/xml. Ok, so now that we know what we need to do, here's the deal.1

The Deal

(ns arxivore.core
  (:require [clojure.xml :as xml]
            [clojure.string :as str]
            [clojure.java.io :as io]
            [clojure.edn :as edn]

            [environ.core :as environ]
            [org.httpkit.client :as http]
            [net.cgrand.enlive-html :as html]))

Module imports; nothing to see here. I'm using environ, httpkit, enlive and some native utilities.

(defn env
  [key & {:keys [default]}]
  (if-let [val (get environ/env key)]
    val
    (or default
        (throw
         (Exception.
          (str "Could not find environment variable "
               (str/replace
                (str/upper-case (name key))
                #"-" "_")))))))

(def +paper-directory+
  (env :arxivore-papers
       :default (str (System/getProperty "user.home") "/arxivore-papers/")))

Because I want actual people to actually be able to use this thing on actual machines, we need to be able to point it at a directory. Not all the places it might run will have a home directory, so we want to be able to take an alternative via envronment variable. env lets us do that. The +paper-directory+ constant is set to either the value of the AXIVORE_PAPERS environment variable, or ~/arxivore-papers if that variable is not present.

(defn get! [url]
  (Thread/sleep 15000)
  (:body @(http/get url)))

(defn get-resource! [url]
  (html/html-resource (java.io.StringReader. (get! url))))

We want our GET requests to be slow. So this implementation of get! waits for 15 seconds before doing anything. get-resource! is a utility function to get an enlive resource instead of a raw body string. This'll be useful for HTML pages we want to slice up.

(defn paper-urls-in [url]
  (->> (xml/parse url)
       :content first :content
       (filter #(= (:tag %) :items))
       first :content first :content
       (map #(:rdf:resource (:attrs %)))))

This is the simplest piece of the URL-retrieving puzzle. It takes an arxiv RSS url, and returns the list of paper urls. We'll need those eventually, but first...

(defn -all-date-ranges []
  (let [current-year (Integer. (.format (java.text.SimpleDateFormat. "yyyy") (new java.util.Date)))
        dates
        (mapcat
         (fn [year] (map (fn [month] [year month]) [1 12]))
         (range 1991 (inc current-year)))]
    (map (fn [a b] [a b])
         dates (rest dates))))

That gives us all date ranges relevant to arxiv; they don't have any papers recorded as being published before 19912.

(defn -format-query [[[y1 m1] [y2 m2]] & {:keys [start]}]
  (let [fmt (format "https://arxiv.org/search/advanced?advanced=1&terms-0-operator=AND&terms-0-term=&terms-0-field=title&classification-computer_science=y&classification-physics_archives=all&classification-include_cross_list=include&date-year=&date-filter_by=date_range&date-from_date=%d-%02d&date-to_date=%d-%02d&date-date_type=submitted_date&abstracts=show&size=200&order=-announced_date_first"
                    y1 m1 y2 m2)]
    (if start
      (str fmt "&start=" start)
      fmt)))

That provides an interface to the arxiv search system. If you give it a date range from -all-date-ranges, and optionally a start parameter, it'll return a URL that queries arxiv for CS papers in that date range. The start parameter is what we'll need in order to support pagination.

(defn -urls-from-single-page [resource]
  (map #(-> % :content first :attrs :href)
       (html/select
        resource
        [:li.arxiv-result :p.list-title])))

(defn -urls-from-date-range [date-range]
  (let [resource (get-resource! (-format-query date-range))
        title (-> (html/select resource [:h1]) first :content first)]
    (if-let [match (rest (re-find #"Showing (\d+)[–-](\d+) of ([\d,]+)" title))]
      (let [[from to of] (map #(edn/read-string (str/replace % #"," "")) match)]
        (if (and to of (> of to))
          (mapcat
           #(-urls-from-single-page
             (get-resource!
              (-format-query
               date-range :start %)))
           (range to of to))
          (-urls-from-single-page resource)))
      (-urls-from-single-page resource))))

(defn historic-paper-urls []
  (mapcat -urls-from-date-range (-all-date-ranges)))

Getting a series of URLs from a single search page is pretty easy; we get all the li elements with the CSS class arxiv-result, and get the first link to the paper out of the first href we find. Getting a series of URLs from a date range is a bit more complicated. If we get a single-page response, we just apply -urls-from-single-page. If we get an empty response, there won't be any arxiv-result elements and we're just fine. If we get a multi-page result, shit gets a bit more complicated. Specifically, we need to go through each page of the result and apply -urls-from-single-page to each one.

Ok, that's how we go about expropriating URLs from the arxiv system. We also need to manipulate them a bit.

(defn pdf-path [paper-url]
  (let [id (last (str/split paper-url #"/"))]
    (str +paper-directory+ id ".pdf")))

(defn pdf-url [paper-url]
  (str/replace paper-url #"/abs/" "/pdf/"))

(defn got-pdf? [paper-url]
  (.exists (io/as-file (pdf-path paper-url))))

The PDF file we get out of a given paper URL will be downloaded to the location specified by pdf-path. By default, we get abs URLs out of each arxiv interface. Those point to XML documents rather than PDFs, but transforming them into PDF links isn't difficult (although not every paper has a PDF on file). got-pdf? just checks whether a local file already exists at the location specified by pdf-path.

(defn grab-pdf! [paper-url]
  (let [path (pdf-path paper-url)]
    (io/make-parents path)
    (with-open [out (io/output-stream (io/as-file path))]
      (io/copy (get! (pdf-url paper-url)) out))))

(defn grab-urls! []
  (let [path (str +paper-directory+ "urls.txt")]
    (io/make-parents path)
    (doseq [url (mapcat #(do (println (str "Getting " % "...")) (-urls-from-date-range %)) (-all-date-ranges))]
      (spit path (str url \newline) :append true))))

grabbing a pdf! involves taking the paper URL, and copying the result of a get! call into the location specified by pdf-path. We also call make-parents just to make sure that the target path exists on disk. grabbing the urls! involves calling historic-paper-urls3, and spitting each result into a separate line in the urls.txt file in our +paper-directory+.

(defn nom! []
  (let [historics (atom (drop-while got-pdf? (str/split-lines (slurp (str +paper-directory+ "urls.txt")))))]
    (while true
      (let [hs (take 20 @historics)]
        (swap! historics #(drop 20 %))
        (doseq [url (set (concat hs (paper-urls-in "http://export.arxiv.org/rss/cs")))]
          (if (not (got-pdf? url))
            (do (println "Grabbing <" url ">...")
                (grab-pdf! url))
            (do (println "Found duplicate '" url "'..."))))))))

Okay, finally, putting it all together...

We slurp up all the paper URLs from our local file, then we start interspersing 20 historic papers with a call to the latest CS RSS feed. If we see a URL that we don't have the corresponding file to yet, we grab it, otherwise we just print a warning and continue on our merry way.

Because the basic get! primitive sleeps for 15 seconds before doing anything, nom! doesn't need to explicitly rate-limit itself. It's not exactly thread-safe, because running multiple nom!ming threads will exceed the arxiv rate limit, and probably get your IP banned temporarily. So, I mean, in case you were going to ignore the "don't use this utility" warning in the README, extra-special don't do the multi-threaded thing.

Next Week

I'm probably taking a week or two off blogging about the Cabal's activities. We're breaking until May, both to give people a chance to RSVP to the call for PAIP readers, and to avoid having a mostly empty room at the Toda house. Also, this coming week is going to see me giving a short lightning talk at Clojure North, and I'm probably going to spend most of my (still very scarce) free time preparing for that.

Wish me luck; I'll let you know how it goes.

  1. Oh man, this takes me back. It feels like its' been absolutely fucking FOREVER since I've done an almost-literate-programming piece.
  2. Which really means that the Cabal will need to go digging elsewhere for older CS papers to feed into our classification monstrosity once we get to that point.
  3. In a roundabout way, granted, because we need to get some printlns into the mix rather than doing the whole thing silently.


Creative Commons License

all articles at langnostic are licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License

Reprint, rehost and distribute freely (even for profit), but attribute the work and allow your readers the same freedoms. Here's a license widget you can use.

The menu background image is Jewel Wash, taken from Dan Zen's flickr stream and released under a CC-BY license