I’m regularly handling very large files containing millions for chemical structures and whilst BBEdit is my usual tool for editing text files in practice it becomes rather cumbersome for really large files (> 2 GB). In these cases I’ve compiled a useful list of UNIX commands that make life easier. Whilst I use them when dealing with large chemical structure files they are equally useful when dealing with any large text or data files.

If you start up the Terminal application which is in the Utilities folder in the Applications folder. You should see a window something like this:-

The shell prompt (or command line) is where one types commands, the appearance may differ slightly depending on customisation.

File and Path Names

One thing to remember is that UNIX is less accommodating than Mac OS X when it comes to file name and the paths to files. When you see the full path to a file it will be shown as a list of directory names starting from the top of the tree right the way down to the level where your file sits. It will look something like this.

Directory names can contain spaces but you’ll often find that people working in Unix prefer to keep both file and directory names without spaces since this simplifies the process of working with these files on a command line. Directory or file names containing a space must be specified on the command line either by putting quotes around them, or by putting a \ (backslash) before the space.

or

If typing out the full path seems rather daunting remember if you drag a file in the Finder and drop it on a Terminal window the path to the file will be automatically generated.

Examining Large files

The other thing to remember is that there is an extensive help available, so if you can’t remember the command line arguments “man” can provide the answers. Simply type man followed by the command.

In addition to quickly view the man page for any command, just right-click on the command in the Terminal and choose Open man Page from the context menu. A new window will pop up, displaying the manual for that command.

So the command wc (word count) can be used to count words, lines, characters, or bytes depending on the option used. For example for a file containing SMILES strings the following command tells us how many lines (and hence the number of molecules). In this case the 4GB file contains 66,783,025 molecules.

If instead of counting lines we count words using the -w option, we get a different answer. This is because SMILES string can contain salts, counterions etc.

All the structures from ChEMBL are available for download as a 3.5 GB sdf file. Attempting to open such a file in a desktop text editor would not be recommended, however using a few UNIX commends we can interrogate the file to get useful information. For example the following tells us how many lines there are in the file. 

However in this case since each molecule record in a sdf file has multiple lines we don’t know how many structures there are. We can however look at a small portion of the file using the head command

So if we type 

We can now see that each molecule record is separated by $$$$ and we can use this to identify how many molecules are in the file using grep and counting the number of occurrences.

So if we type this in the terminal window.

In a similar manner we can use tail to look at the end of the file.

Sometimes you just want to test a new tool on a subset of a large file, in this case we can use head and pipe the results into a new file.

Dealing with Errors in very large files

Most command line applications expect the file impute to have unix style line endings, whilst most modern applications use unix style ending if you are using an old file it may have the old style Mac line endings, or dos line endings. There is a very useful tool to convert to unix line endings, dos2unix can be installed using home-brew

And then used with the following command.

Sometimes a file contains blank lines that need removing, this is easy to do using grep

Sometimes files contain non-printable ASCII characters and it can be hard to track these down. Whilst you can use BBEdit to zap these characters but for larger files this command might be a better option.

You can use this Perl command to strip these characters by piping your file through it:

If that does not work the best option is often to simply split a file into smaller files 

We can split the file test.txt created earlier

You can then use wc to check the concatenated file has the correct number of lines.

Editing Files

cat is also the swiss army knife for file manipulation, combined with tr

To convert a tab delimited file into commas

For some toolkits the SMILES file containing structures needs to have UNIX line endings, if your input file does not have UNIX line endings you can either use a text editor like BBEdit to change it to UNIX format of run this command in the Terminal

Monitoring output files

The command tail has two special command line option -f and -F (follow) that allows a file to be monitored. Instead of just displaying the last few lines and exiting, tail displays the lines and then monitors the file. As new lines are added to the file by another process, tail updates the display. This very useful if you are monitoring a log file to check for errors when processing a very large file.

Checking for duplicate structures

One of the issues with combining multiple data sets is there is always the risk of duplicate structures. In order to check for this you need a unique identifier for each molecular structure, I tend to use InChIKey.

The IUPAC International Chemical Identifier InChI is a textual identifier for chemical substances, designed to provide a standard way to encode molecular information and to facilitate the search for such information in databases and on the web. The condensed, 27 character InChIKey is a hashed version of the full InChI (using the SHA-256 algorithm).

First use OpenBabel to generate the InChiKey file

We can check the file using head

We now need to sort the file

Now we can use uniq to identify duplicate structures, it is important to note uniq filters out adjacent, matching lines which is why we need to sort first.

Alternative options for uniq are

Extracting data from delimited files

Sometimes you have very large files containing multiple columns of data, but you really want is a single column. The file is too big to open in a spreadsheet application but you can extract just the column you want using a simple command. First use head to identify the column of data you want.

The column we want is SMILES, which is column 9. We can use cat to extract just this column (replace f9 with the column you want).

If you want to save the output to a file

Dealing with a large number of files

A suggestion from a reader. Sometimes rather than one large file download sites provide the data as a large number of individual files. We can keep track of the number of files using this simple command. In my testing this works fine, but will not count hidden files, and will miscount files that contain space/line breaks etc. in file names.

xsv is a command line program for indexing, slicing, analyzing, splitting and joining CSV files. Commands should be simple, fast and composable:

Available commands

cat – Concatenate CSV files by row or by column.
count – Count the rows in a CSV file. (Instantaneous with an index.)
fixlengths – Force a CSV file to have same-length records by either padding or truncating them.
flatten – A flattened view of CSV records. Useful for viewing one record at a time. e.g., xsv slice -i 5 data.csv | xsv flatten.
fmt – Reformat CSV data with different delimiters, record terminators or quoting rules. (Supports ASCII delimited data.)
frequency – Build frequency tables of each column in CSV data. (Uses parallelism to go faster if an index is present.)
headers – Show the headers of CSV data. Or show the intersection of all headers between many CSV files.
index – Create an index for a CSV file. This is very quick and provides constant time indexing into the CSV file.
input – Read CSV data with exotic quoting/escaping rules.
join – Inner, outer and cross joins. Uses a simple hash index to make it fast.
partition – Partition CSV data based on a column value.
sample – Randomly draw rows from CSV data using reservoir sampling (i.e., use memory proportional to the size of the sample).
reverse – Reverse order of rows in CSV data.
search – Run a regex over CSV data. Applies the regex to each field individually and shows only matching rows.
select – Select or re-order columns from CSV data.
slice – Slice rows from any part of a CSV file. When an index is present, this only has to parse the rows in the slice (instead of all rows leading up to the start of the slice).
sort – Sort CSV data.
split – Split one CSV file into many CSV files of N chunks.
stats – Show basic types and statistics of each column in the CSV file. (i.e., mean, standard deviation, median, range, etc.)
table – Show aligned output of any CSV data using elastic tabstops.

Dividing large files

Sometimes it is useful to divide very large files into more manageable chunks. For SMILES files where we have one record per line we can simply divide based on lines using split.

SPLIT(1) BSD General Commands Manual SPLIT(1)

NAME split — split a file into pieces

SYNOPSIS split [-a suffixlength] [-b bytecount[k|m]] [-l line_count] [-p pattern] [file [name]]

Dividing sdf files is more problematic since we need each division to be at the end of a record defined by “$$$$”. I’ve spent a fair amount of time searching for a high-performance tool that will work for very, very large files. Many people suggest using awk

AWK (awk) is a domain-specific language designed for text processing and typically used as a data extraction and reporting tool. Like sed and grep, it’s a filter, and is a standard feature of most Unix-like operating systems.

I’ve never used awk but with much cut and pasting from the invaluable Stack Overflow this script seems to work.

The result is shown in the image below. There are a couple of caveats, this script only works with the version of awk shipped with Big Sur (you should be able to install gawk using Home Brew and use that on older systems), and it requires the file has unix line endings. The resulting file names is not ideal and if there are any awk experts out there who could tidy it up I’d be delighted to hear from you.

If anyone has any additional suggestions please feel free to submit them.

Last updated 18 Feb 2023

Related Posts