Статьи

Node.js для программистов PHP # 4: потоки

Много занимаясь веб-приложениями, вы понимаете, что речь идет о серии байтов, передаваемых с одного компьютера на другой. Вы можете совершать потрясающие вещи, даже не понимая, как работает этот поток. Но когда вы хотите идти очень быстро , когда вы хотите разблокировать максимальную мощь, скрытую глубоко внутри вашего сервера, тогда вы должны говорить, как это делают компьютеры.

Что приводит меня в потоки. Эта особенность очень упускается из виду в PHP и повсеместно распространена в Node. Я объясню, чем потоки Node отличаются от манипуляций с вводом / выводом в PHP, но сначала я должен сделать признание.

Я не выродок

Просто потому, что я всегда чувствовал себя свободно с компьютерами, мои друзья всегда принимали меня за какого-то гика. Хорошо, я могу использовать OFFSET()функцию в Excel, чтобы удивить бухгалтеров, но это не делает меня сравнимым с косплеерами или правительственными хакерами баз данных. Но это не имеет никакого значения для моих друзей, и поэтому они всегда просят меня о помощи, когда они борются с компьютерами.

Около десяти лет назад пара друзей отправилась на год в Новую Зеландию. Перед отъездом они спросили меня, могу ли я создать фотоблог, чтобы они могли общаться со своими друзьями (это было до появления Flickr и Picasa, а также Facebook и Instagram). Поэтому я загрузил приложение для создания фотоальбомов с открытым исходным кодом (кстати, написанное на PHP) и загрузил его на сервер, который я арендовал для себя. Они сделали много фотографий, регулярно публиковали их, и мы все очень завидовали. Потом они вернулись, и мы все забыли о фотоблоге.

A few days ago, the same couple of friends reminded me of the photo blog, and they wanted to get the pictures back. Or rather, as we’re now living in a different era, they asked me if I could transfer the pictures to Flickr. Why did they think I could do that? Do they really think I’m a geek?

Wasting Time

After thinking about it, the transfer doesn’t sound too hard to do. I only need to upload a PHP script to the server that can browse the filesystem and make a POST http request to the Flickr API. How hard could that be?

<?php
$filenames = scandir($path);
foreach ($filenames as $filename) {
  // get the image content
  $image = file_get_contents($path . '/' . $filename);
  // open an HTTP request to Flickr
  $fp = fsockopen('api.flickr.com', 80, $errno, $errstr, 30);
  // send the request headers
  $out =  "POST /services/upload/ HTTP/1.1\r\n";
  $out .= "Host: api.flickr.com\r\n";
  $out .= "Content-Disposition: attachment; filename=' . $filename' . \r\n";
  $out .= "Content-Type: application/octet-stream\r\n";
  $out .= "Content-Length: " . strlen($image) . "\r\n\r\n";
  fwrite($fp, $out);
  // send the image content in the body
  fwrite($fp, $image);
  fclose($fp);
  echo "Sent file " . $filename . "\n";
}
echo "Finished!\n";

You may wonder: why use fsockopen() instead of Guzzle, Buzz, or even ZendHttpClient? Because they don’t change the result of this script: it’s too slow. There are tons of images in the directory, and the execution never ends.

The problem is that PHP does one thing at a time, and that Input/Output operations are blocking. In other terms, when you execute a PHP I/O function, the PHP process waits until the I/O completes. And it can take a very long time. Here is what really happens in the central loop of the previous script:

<?php
$image = file_get_contents($path . '/' . $filename);
// wait until the file is loaded into memory
$fp = fsockopen('api.flickr.com', 80, $errno, $errstr, 30));
// wait until the DNS is resolved and the flickr server acknowledges the connection
// ...
fwrite($fp, $image);
// wait until the body is sent to flickr and the flickr server acknowledges the reception
fclose($fp);

The PHP process wastes a lot of time waiting. Even if you have a very fast CPU, file and network I/O make the script too slow to be really usable on a large number of files.

Tip: Of course, Flickr has an authentication system which grants a token that should be added to each API call. But it’s been removed from the current example to keep your attention focused.

Streams To The Rescue

To exchange data between systems, the script uses files and requests, but these concepts are too high level to be truly efficient in heavy usage scenarios. There is another reality, at a lower level. It might me daunting at first, but once you’ve discovered it, you can never come back. Come on, take the red pill, and let me introduce you to streams.

A stream represents a flow of bytes between two containers. In the previous PHP script, the data flows first from the disk to memory (file_get_contents()), then from the memory to the distant server (fwrite()). Wouldn’t it be more efficient to start flowing bytes from the memory to the distant server before the initial disk flow is finished? Using streams, it’s possible: a script can send an image to Flicker while it reads this image from the filesystem.

PHP offers a Stream API with low-level I/O functions. Here is how to rewrite the Flickr upload script using streams:

<?php
$filenames = scandir($path);
foreach ($filenames as $filename) {
  // open an HTTP request to Flickr (returns a stream resource)
  $httpStream = fsockopen('api.flickr.com', 80, $errno, $errstr, 30);
  // send the request headers
  $out =  "POST /services/upload/ HTTP/1.1\r\n";
  $out .= "Host: api.flickr.com\r\n";
  $out .= "Content-Disposition: attachment; filename=" . $filename . "\r\n";
  $out .= "Content-Type: application/octet-stream\r\n";
  $out .= "Content-Length: " . filesize($path . '/' . $filename) . "\r\n\r\n";
  fwrite($httpStream, $out);
  // open a file stream on the local image
  $fileStream = fopen($path . '/' . $filename, 'r');
  // read from the file and write to the HTTP request
  while (!feof($fileStream)) { // while the fileStream is not finished
    // read 1024 bytes from the file and write these bytes on the Flickr http stream
    fwrite($httpStream, fread($fileStream, 1024));
  }
  fclose($fileStream);
  fclose($httpStream);
  echo "Sent file " . $filename . "\n";
}
echo "Finished!\n";

File reads are now less blocking: PHP only waits for chunks of 1024 bytes to be read from the disk to sent them over HTTP. Consequently, this second upload script is a bit faster than the first one.

But dealing with streams in PHP is painful. The API is purely functional, not object-oriented. There are tons of functions with opaque names, several ways to do a simple operation, and very defective pieces of documentation. There must be a better way to do this.

Code-switching

Node.js comes with a native asynchronous stream API. In fact, most I/O operations in Node result in a stream by default. HTTP requests are network I/O, file reads are disk I/O, so Node naturally treats both as streams.

This means that streaming data from one source to another is really a breeze. The following script is the equivalent of the previous PHP script using Node.js:

var fs = require('fs');

var filenames = fs.readdirSync(path);
filenames.forEach(function(filename) {
  var postOptions = {
    host: 'api.flickr.com',
    port: '80',
    path: '/services/upload/',
    method: 'POST',
    headers: {
      'Content-Disposition': 'attachment; filename=' + $filename,
      'Content-Type': 'application/octet-stream'
    }
  };
  // open an HTTP request to Flickr (returns a stream)
  var httpStream = http.request(postOptions, function(res) {
    // dispose of the response status and body
  });
  // open a file stream on the local image
  var fileStream = fs.createReadStream(path + filename);
  // read from the file and write to the HTTP request
  fileStream.pipe(httpStream);
  fileStream.on('end', function() { 
    console.log('Sent file ' + filename); 
  });
});
console.log('Finished!');

As you can see, HTTP is a first-class citizen in Node. If you compare the postOptions object with the $out string of the PHP example, where each new header of the HTTP request was appended to a string with \r\n as separator, the difference is striking. The http.request() API encourages you to check the HTTP response, while PHP functions just write to a distant resource without worrying about possible errors in the process. Node builds up on the HTTP protocol, and encourages its usage.

Also, Node allows to «pipe» two streams, just like you can pipe two commands in Linux. Here, the output of the fileStream becomes the input of the httpStream, by chunks of 64kB by default. In one simple method call:

fileStream.pipe(httpStream);

Tip: Node uses a chunked transfer encoding by default on HTTP requests, so there is no need to specify the ‘Content-Length’ header.

The Node.js version is faster to execute, but it is not really faster to read and write than the PHP version. Let’s find an even better tool for the job.

There Is An NPM Package For That

HTTP client requests with Node can be somewhat verbose. I recommend the use of request, a npm package facilitating HTTP requests. This package is so generic that chances are that you may use it in all your Node projects.

With the request package, the code to initialize a POST request is simply request.post(url), so the Flickr upload script reduces to:

var fs      = require('fs');
var request = require('request');
var apiUrl  = 'http://api.flickr.com/services/upload/';

var filenames = fs.readdirSync(path);
filenames.forEach(function(filename) {
  // open a file stream on the local image
  var fileStream = fs.createReadStream(path + filename);
  // read from the file and write to the HTTP request
  fileStream.pipe(request.post(apiUrl));
  fileStream.on('end', function() { 
    console.log('Sent file ' + filename); 
  });
});
console.log('Finished!');

If you take out console messages and temporary variables, the code becomes extremely concise:

var fs      = require('fs');
var request = require('request');
var apiUrl  = 'http://api.flickr.com/services/upload/';
fs.readdirSync(path).forEach(function(filename) {
  fs.createReadStream(path + filename).pipe(request.post(apiUrl));
});

Compare that to the first PHP script. Stunning, isn’t it?

The Need For Speed

Instead of reading one file after the other, why not use the power of Node to do that asynchronously? This kind of asynchronous iteration requires a meeting point to make sure all the operation are closed. You could try redeveloping this logic from scratch, but someone already did it better (as often with Node.js), in an npm package called async. async.forEach(array, iterator) applies an iterator function to each item in an array, in parallel. Here is the Flickr upload script with parallel file reads:

var fs      = require('fs');
var request = require('request');
var async   = require('async');
var apiUrl  = 'http://api.flickr.com/services/upload/';

var filenames = fs.readdirSync(path);
async.forEach(filenames, function(filename, callback) {
  // open a file stream on the local image
  var fileStream = fs.createReadStream(path + filename);
  // read from the file and write to the HTTP request
  fileStream.pipe(request.post(apiUrl));
  fileStream.on('end', callback);
}, function(err) {
  console.log(err ? err.message : 'Finished!');
});

But is this script really faster? The answer is yes and no. It is faster because you don’t need to wait for the end of a file transfer to start the next. It is not faster because asking a disk to read from several files at the same time can more expensive than reading one file sequentially (unless you have a RAID 10 array, a NAS, or a SSD, your hard drive has a single read/write head. The boost of the parallel HTTP stream probably outweights the slowdown of the parallel disk reads. But the real problem is that if the script opens a lot of simultaneous HTTP connections to Flickr, and Flickr will probably kick you out for that. So this is a bad enhancement.

Use asynchronous streams wisely, and always check the benefit they offer. Sometimes they can be counter-productive.

Conclusion

Streams are a geek feature in PHP. They are for everybody in Node.js — because the stream API is much easier to use, and deeper into the core principles of Node. Even if you don’t look for high performance I/O, you should use Node streams as much as possible, as they bring to the front what used to be hidden behind a curtain of abstraction, without adding any significant complexity to programming.

As for the Flickr upload, I ended up zipping all the photos together on the server, transfer the archive to my desktop using FTP, and then bulk transfer the photos to Flickr using their Desktop Uploadr. No PHP, no Node.js. No need to be a geek to talk to computers these days.