June 2, 2008

Exporting and Transferring Critical Data

Choosing a File Transfer Technology

There are dozens of approaches to the actual process of transferring software data between systems. Most of these are dependent on the operating systems in use, but there are also important speed and security considerations for each approach. Usually when transferring data across the web, security (specifically interceptability) will be of the utmost importance, so we’ll concentrate on transfer methods that are fairly secure, such as:

  • SCP (Session Control Protocol)
  • SFTP (FTP over SSH)
  • Encrypted files by FTP
  • Encrypted files by MIME Email

Software and Operating System Considerations

Both SCP and SFTP are only supported natively on UNIX systems. Web-based transfers rely on the destination system running a webserver.

Network / Firewall Considerations

The first thing you may need to consider in creating a sustainable process is what’s feasible given network conditions. If you’re exchanging data between two unfirewalled servers in the same datacentre, this won’t be an issue – but this is a rare case. SSH and it’s subsets (SFTP, SCP) use only a single port: TCP 22, so you shouldn’t expect any problems there. FTP however (c. 1971!) is an oblique, firewall-hating protocol. If only one system is to be firewalled, there is a (fairly) straightforward solution – FTP passive mode. This puts the open port requirements back to the client host rather than the server.

The remaining protocols, HTTPS (TCP 443), and Email (SMTP TCP 25) are fairly firewall-friendly and shouldnt cause any problems. However, due to the large-scale uptake of HTTP and SMTP for ‘general use’ on the internet, you may find yourself dealing with a never ending set of test cases, due to the many proxying and anti-spam measures out there.

Choosing a Data Interchange Method

In the case that you are transferring one-dimensional data (such as a series of JPEG images to be printed, with no data beyond their EXIF tags), then you won’t need to prepare your data. For something a bit more complex, such as customer records to be imported into a third party database, you’ll want to choose a robust interchange method.

Plain Text Data

In many contexts, it may be ¬≠tempting to use a CSV format for data interchange, simply because it’s human readable, and can be loaded into spreadsheet software like Excel by managers or supervisors who may want to see the data in “raw form” from time to time. If you have a matching data sink and source, such as Excel both generating and reading your end data, then this approach can be sufficient.

In reality, CSV can be a nightmare to implement. When it comes to exporting, the first hurdle is often that while most plain text encoding will “escape” fields with a backslash, CSV escapes using the double quote character.

Mixed / Binary Data

For more complex (non-ASCII, or binary) data, your life can be made easier using a markup language such as XML or a specific subset relating to the type of data you’re transferring.

Other Options

If exporting or mirroring data with a flexible form, your database layer may be able to provide a resillient export format saving you a lot of extra development. For instance, the MySQL utility mysqldump supports XML output with the -x option.