Software! Math! Data! The blog of R. Sean Bowman
The blog of R. Sean Bowman
April 18 2016

I wanted a link checker for my site my new static site, and I couldn’t find one that used metalsmith or gulp. The closest thing I found was some code on stackoverflow, but even it had some problems. So, using that answer as a template, I wrote my own very simple but functional link checker. But is it even worth adding Gulp to the toolbelt? Spoiler: no. Gulp is crap, make does everything I need and more, and is easier and better. I stopped using Gulp.

As far as a link checker for Gulp goes, my requirements are these:

  1. Check all internal links to make sure they point to valid pages,
  2. check all external links as well, and
  3. report on back links, their error codes, what pages they were found in, and so forth.

Simplecrawler and Gulp integration

It turns out there is a nice web crawler, simplecrawler, which can be configured to do pretty much exactly what we want. We first start up a server using browser sync, then start crawling. We catch errors along the way, count the total number of links, and at the end, print a short report. Here’s the code:

const browser_sync = require("browser-sync").create(),
  gutil = require("gulp-util");

gulp.task("links", (done) => {
  let n_checked = 0, n_errors = 0;

  function report_link(link) {
    gutil.log("Bad link: " + link.url + " from " + link.referrer +
              " got HTTP " + link.stateData.code);
    n_errors += 1;
  }

  gulp.doneCallback = function (err) {
    browser_sync.exit();
    gutil.log("Checked " + n_checked + " links");
    process.exit((err || n_errors > 0) ? 1 : 0);
  };

  function crawl_links() {
    const c = crawler.crawl("http://localhost:3000");
    c.filterByDomain = false;

    c.addFetchCondition(function(parsedUrl, queueItem) {
      return (queueItem.host == c.host && !parsedUrl.path.match(/\.pdf$/i) &&
              !parsedUrl.path.match(/\.js$/i));
    });

    c.on("fetchcomplete", () => { n_checked += 1; })
      .on("fetcherror",report_link)
      .on("fetch404", report_link)
      .on("fetch410", report_link)
      .on("complete", () => {
        done();
      });
  }

  browser_sync.init({
    server: {
      baseDir: "site/",
    },
    open: false,
    port: 3000
  }, crawl_links);
});

Note that there are a couple of issues to work around: first, I could not get gulp to exit reliably without setting its doneCallback method, which does not seem to be well documented. But, using this callback, the process exits with an appropriate error code after giving a short message about its results.

Second, simplecrawler by default does not check links to external domains. We must set filterByDomain = false in order to have it follow external links, but then it will go on to crawl as much of the web as possible (!). In order to stop, we use the addFetchCondition method of crawler to ensure that the next item to queue comes from localhost. (At the same time, we exclude checking certain files like PDFs and javascript files.)

Gulp: friend or foe?

So this is a decent link checker, and I’m happy with how it runs, its output, and so on. What about Gulp itself? Well… I’m not so happy. First of all, I could have used linkchecker, which does the same thing with no need for coding on my part. It works faster, too, at least using http-server.

But there are bigger issues. Gulp’s model is that file contents flow through pipes; these “files” are actually virtual files provided by vinyl. Such files can either be node Streams or Buffers, and it’s hard to tell which they’ll be beforehand. Some plugins can handle one but not the other type. There are plugins to convert streams to buffers (and probably vice versa, I don’t know). Working with "virtual" files has the advantage that you don’t need to go to the file system, which can be slow.

However, there are a lot of reasonable things you might want to do that Gulp makes difficult. Having incremental builds, building only those things that have changed, is one of them. There are plugins to help, of course, but… then you have more plugins, more complexity, and a bigger gulpfile.

Second, suppose you’re able to create a Gulp pipeline which will send you only those files you want to operate on. Often it’s faster to operate on many files at once. Tools like GNU parallel make this a relatively trivial feat. But Gulp’s model doesn’t work that way: parallel expects files on the file system, gulp expects streams. Another example is transcoding video. Ffmpeg supports transcoding streaming video, but the truth is that it’s much harder than just giving files on the command line. You have to tell ffmpeg exactly what type of input to expect, and for a mere mortal like me that’s tough. (You might also have a bunch of different files enocded differently; then you have to figure out the correct incantation of ffmpeg. This is difficult, to say the least.)

After using Gulp for a bit, I realized that I can do everything I need to do more easily, quickly, and efficiently with a system like make or redo. In fact, Gulp gets in the way. Now, you might say that Gulp 4 (the upcoming version) will fix some of these problems. Maybe. But I still think that the "code over configuration" or eschewing declarative dependency management is the wrong way to go, at least for my needs. That’s why I don’t think I’ll be using Gulp any more.

Addendum

Having written and reread this section a few times, I think I’m being too generous to Gulp, Grunt, and whatever other gross sounds node build tools are named after. They’re crap. It’s manifest from (really funny) posts like this that people are unfamiliar with the existing solutions to these problems. The same thing is evident from the massive adoption and then complete abandonment of these tools in a short time frame, and the amusing promises made about “Gulp 4.” I went in with a slightly open mind, and I’m more convinced than ever that Javascript developers need to learn another language, read some books, and get more fresh air.

Happy endings

Thankfully, I’ve found tools to do everything I was doing with my Gulpfile in a makefile (or a bunch of redo files). Some of these tools are even node based, so hey, node still has a notch in that column. But it’s always wise to remember that make has served us well for decades, and we should be reluctant to give it up for the new flavor of the day build tool.

Approx. 1051 words, plus code