C++ String Tokenizer

 www.partow.net  .: Home :.   .: Links :.   .: Search :.   .: Contact :. 


.:: NOTICE ::.

The String Tokenizer library has been deprecated. In its place is the String Toolkit Library (StrTk), an advanced and highly efficient C++ string processing library, that features various tokenization, splitting, parsing, serialization, formatting, conversions and numerous other string processing and transformation routines.



Description

String tokenization is defined as the problem that consists of breaking up a string into tokens which are seperated by delimiters. Both tokens and delimiters are themselves strings. Commonly used string structures that require the use of string tokenization are Comman Seperated Values (CSV), written text and basically any other format of data grouping where differing units of data are sperated by some kind of delimiter.

The Problem

Assume you have data units representing fruit thats in a basket:

  • Apple
  • Peach
  • Orange
  • Banana

If you decided to package these data units and either send them over a socket or store them in a file, a simple conclusion would be concatenate all data units into one large string. An example would be as follows:

ApplePeachOrangeBanana

The problem with this kind of formating, is that it may be simple producing the data string initially, but when the data needs to be read back in and broken up into its independent data units, the problem of parsing the data, and determining where a particular unit begins and where it ends becomes rather difficult.

The Solution

A solution to this problem is to place a character or series of characters that are known never to exist within the data units and use them as markers to seperate the data units. Assuming our delimiter is a ':' the previous example may look something like this:

Apple:Peach:Orange:Banana

The StringTokenizer

The StringTokenizer class can be used to break-up strings of data that have already been created from tokens and have also been separated by delimiters. The class allows for delimiters to be strings themselves as well as just being simple chars. The StringTokenizer behaves somewhat like a stack, in that you can't access a particular token within the data, just the next token in the head of data. It provides an ability to find out how many tokens are in the data, in order to support looping patterns for extracting the tokens out of the data, and also allows for tokens to be converted into other basic built-in types such as int and double.

Usage Of StringTokenizer

  • Instantiating StringTokenizer

    std::string data = "abc:def:ghi:jkl";
    StringTokenizer strtok(data,":");
  • Obtaining Number Of Tokens (Remaining)

    unsigned int tokenCount = strtok.countTokens();
  • Obtaining The Next Token

    std::string token = strtok.nextToken();

    It should be noted that everytime a token is obtained from the StringTokenizer, it is actually removed from the data store (similar to a pop call on a stack).

  • Checking To See If There Are Still More Tokens

    
    if(strtok.hasMoreTokens())
    {
       std::cout << "Still  has more tokens!" << std::endl;
    }
    else
    {
       std::cout << "No more tokens left." << std::endl;
    }
    
    

    It should be noted that everytime a token is obtained from the StringTokenizer, it is actually removed from the data store (similar to a pop call on a stack).

  • Obtaining The Next Token As An int

    int token = strtok.nextIntToken();
  • Obtaining The Next Token As double

    double token = strtok.nextFloatToken();
  • Obtaining The Remaining Data As One String

    std::string remainingData = strtok.remaining();
  • Filter A Token

    In some situations the tokens maybe formatted with unnecessary strings of characters, such as in database files where tuples are stored, and particular fields in the tuples are defined to be a particular size. Instances where data being placed into those fields is less than the required size, padding of the data occurs by repeatedly adding a character such as a SPACE until the data reaches the required size. StringTokenizer allows you to define the padding pattern and have that filtered out before the token is passed back.

    
    std::string data = "abc  :def  :ghi  :jkl  :";
    StringTokenizer strtok(data,":");
    std:string filteredToken  = strtok.filterToken(" ");
    
    

    It should be noted that the filtering will occur over the entire token, meaning if the particular filter pattern occurs within the actual token itself, it will also filtered out. In various instances this may lead to undesired behavior.

A Simple Example

In this example we will assume there exists a database that contains tuples relating to information about people. The fields for the tuple will be:

  • First Name
  • Surname
  • Year Of Birth
  • Height (meters)

From the above definition and assuming we construct our data tuple in the same order as is listed above and seperate each field with a '#' symbol we can give some examples of possible tuples that maybe found in this database.

  • John#Doe#1970#1.53
  • Jane#Doe#1980#1.78
  • Bob#Cob#1900#2.34

Tokenization of a data string in the format above may look something like the following using StringTokenizer:


typedef struct
{
   std::string  firstName
   std::string  surname;
   unsigned int yearOfBirth;
   double       height;
}Person;

std::string data      = "John#Doe#1970#1.53";
std::string delimiter = "#";
StringTokenizer strtok(data,delimiter);

if(strtok.countTokens() != 4)
{
   std::cout << "!-Error-! Not enough tokens!" << std::endl;
}
else
{
   Person person;
   person.name        = strtok.nextToken();
   person.surname     = strtok.nextToken();
   person.yearOfBirth = strtok.nextIntToken();
   person.height      = strtok.nextFloatToken();
}

Note About StringTokenizer

StringTokenizer is a unambiguous parser, meaning it will attempt the alter the data string passed on to it, in such a way that further processing done by StringTokenizer will result in only deterministic and rational behaviour. These alterations consist of eliminating consecutive delimiters within the string and delimiters at the beginning and end of the string.

It is advised that values which later on may become tokens in a string always have a default and null values. For example in the case of strings a string such as "N/A" or "~" will be adequate. Leaving the string empty will cause consecutive delimiters in the string and hence will be eliminated at processing time. This will result in a shift to the left of the tokens in the string meaning for example token7 will become token6 etc...

Update 10-05-2003 - A far more advanced and optimal implementation of string tokenzier in C++ can be found in the String Toolkit Library (StrTk). The new implementation supports generic tokenizing for arrays of any type, iteratations and split functions.


StringTokenizer License

Free use of the StringTokenizer library is permitted under the guidelines and in accordance with the most current version of the "Common Public License."


Compatibility

The StringTokenizer C++ implementation is compatible with the following C++ compiler:

  • GNU Compiler Collection (3.3.1-x+)
  • Intel® C++ Compiler (8.x+)
  • Borland C++ Builder (5,6)
  • Borland C++ BuilderX

Download




© Arash Partow. All Rights Reserved.