Strings

Working with strings in C. String manipulation functions from the `string.h` library.


String Tokenization with strtok in C

String tokenization is the process of breaking down a string into smaller, meaningful parts, called tokens, based on predefined delimiters. The C standard library provides the strtok function to accomplish this task.

What is strtok?

strtok is a C function that splits a string into tokens using specified delimiter characters. It modifies the original string in place by replacing the delimiters with null characters (\0). It's important to note that strtok is not re-entrant, meaning it's not thread-safe and can have unexpected behavior if used in a multi-threaded environment or nested calls.

How strtok Works

The strtok function has the following prototype:

char *strtok(char *str, const char *delim);
  • str: A pointer to the string you want to tokenize. On the first call, this should point to the string. On subsequent calls to continue tokenizing the *same* string, this should be NULL.
  • delim: A pointer to a string containing the delimiters. Any character in this string can act as a delimiter.
  • Return Value: strtok returns a pointer to the beginning of the next token. If there are no more tokens (i.e., the end of the string has been reached), it returns NULL.

Here's a step-by-step explanation:

  1. First Call: The first time you call strtok with a particular string, it searches the string (str) for the first occurrence of any of the delimiter characters (delim).
  2. Finding the Delimiter: If a delimiter is found, it is replaced with a null character (\0). A pointer to the beginning of the string before the delimiter is returned. This is the first token.
  3. Subsequent Calls: To continue tokenizing the same string, you call strtok again, but this time you pass NULL as the first argument (str). This tells strtok to continue where it left off in the previous call.
  4. Iteration:strtok then searches for the next delimiter, replaces it with \0, and returns a pointer to the beginning of the new token.
  5. End of String: This process continues until no more delimiters are found. When strtok reaches the end of the string without finding any more delimiters, it returns NULL.

Example Code

Here's a C code example demonstrating how to use strtok:

 #include <stdio.h>
#include <string.h>

int main() {
  char str[] = "This,is,a,sample,string.";
  char *token;

  // Use strtok to tokenize the string
  token = strtok(str, ","); // First call:  'str' points to the string, ',' is the delimiter

  // Loop through the tokens
  while (token != NULL) {
    printf("Token: %s\n", token);
    token = strtok(NULL, ","); // Subsequent calls: 'str' is NULL, ',' is still the delimiter
  }

  //The original string 'str' has now been modified.  It is now:
  // "This\0is\0a\0sample\0string."
  printf("\nOriginal String (modified): %s\n", str); //Prints only "This" because of the first null terminator.

  return 0;
} 

Explanation of the Code:

  • We include the necessary header files: stdio.h for input/output and string.h for string functions.
  • We declare a character array str containing the string to be tokenized.
  • We declare a character pointer token to store the address of each token.
  • The first call to strtok uses the original string str and specifies the delimiter ",".
  • The while loop continues as long as strtok returns a non-NULL pointer (i.e., a token is found).
  • Inside the loop, we print the current token.
  • The subsequent calls to strtok use NULL as the first argument, indicating that we want to continue tokenizing the same string. The delimiter remains ",".
  • After the tokenization is complete, the original string `str` will be modified. The commas will be replaced by null terminators. Printing the original string after the while loop, will result in only the first token being printed because printf stops at the first null terminator it encounters.

Important Considerations

  • Modification of the Original String:strtok modifies the original string. If you need to preserve the original string, make a copy of it before calling strtok. You can use strcpy or strdup for this.
  • Not Re-entrant: As mentioned earlier, strtok is not re-entrant. This means it's not safe to use in multi-threaded programs or recursive functions. Consider using `strtok_r` for thread-safe tokenization. strtok_r is POSIX standard and is re-entrant.
  • Consecutive Delimiters:strtok treats consecutive delimiters as a single delimiter. It doesn't return empty tokens for consecutive delimiters.
  • Empty Strings: If the string to be tokenized is empty or contains only delimiters, strtok will return NULL.

Alternatives to strtok

Because of the limitations of strtok, especially its non-reentrant nature, consider using alternative approaches for string tokenization, such as:

  • strtok_r (POSIX): A thread-safe version of strtok.
  • Manual Parsing: Implement your own tokenization logic using functions like strchr (find character in string) and strncpy (copy a portion of a string). This gives you more control and avoids the issues associated with strtok.
  • Third-party Libraries: Some libraries provide more robust and feature-rich string manipulation functions.

Choosing the right approach depends on your specific needs and the complexity of your tokenization requirements.