Answering Rust Strings, UTF-8, Variable Encoding, Clone On Write (COW), String Trait methods, and why Strings can't be indexed

Answering Rust Strings, UTF-8, Variable Encoding, Clone On Write (COW), String Trait methods, and why Strings can't be indexed

Explore Strings in general and how Rust protects us from invalid string operations.An In-Depth Introduction to Strings in Rust and Their Distinctions

·

36 min read

Strings are a complex kind of collection, especially when they are encoded in UNICODE format or are represented as a string type. The possibilities of strings are not finite, meaning they do not have a fixed set of limits. For example:

  1. An 8-bit signed integer can represent values from -128 to +127, offering a finite range. It supports arithmetic and comparison operations.

  2. A Boolean type can be either true or false, offering only two possible values. It is commonly used in conditionals.

  3. A 64-bit unsigned integer can represent values from 0 to 0xffffffffffffffff, but it's still a finite range.

In contrast, a Unicode-encoded string can contain various characters, including spaces (which also take up memory), special characters, escape sequences, numerical characters, case-sensitive characters, characters from foreign languages, scientific symbols like π, and even emojis like 😂. Due to this complexity, handling strings is more intricate than handling the aforementioned types. Most programming languages, such as Python, Swift, Julia, and Rust, provide abstractions to facilitate string manipulation.

Strings offer more flexibility compared to other types. The source code of a programming language is often represented as strings, which the front-end part of the compiler parses as defined by the language's parser. If passwords were represented using u64 or other integer types, they could potentially be brute-forced or reverse-engineered. However, representing passwords as strings with a mix of different character types makes breaking them significantly more challenging, assuming the combination is strong. The same reason why a string can't be used for all purposes is that it's untyped; the possibilities are enormous. Look at the code below to understand why a string is such a complex type

    use std::env::args;
    let environment_var = args().nth(1).expect("At least one argument is expected");
    match environment_var.as_str() {
        "1" | "One" | "ONE" | "one" => println!("The number is One"),
        "10" | "Ten" | "TEN" | "ten" => println!("The number is ten"),
        _ => println!("Catch all pattern"),
    }

Without matching wild card patterns, it would be impossible to match all possible string combinations. If the return type is String, Rust only knows it's a string type, not a specific string value. It's the programmer's responsibility to determine the desired behavior. Later, we will see how ADTs like enums can reduce the possibilities and runtime overhead.

Imagine writing a programming language in multi-byte characters, such as Tamil or Urdu. It becomes challenging from an ergonomic perspective unless there is an ergonomic method of utilizing complete words, like UTF-32, with accompanying tools to facilitate ease of use.

Interpretation of bytes as Text

Each type has restricted operations and space constraints. Whether it's strings, integers, floats, booleans, or any other type, they all make sense at the type level or abstract, high level. At the CPU level, they are all composed of zeros and ones—nothing more. C/C++ doesn't provide many guarantees about the interpretation of data due to implicit conversions, lack of checking, and unrestricted pointers by default.

Type-level abstraction exists in all programming languages after Fortran (as far as I know), whether it's known at compile time or runtime. Bytes are interpreted in a way that makes sense at a high level and provides type safety before being executed by CPUs.

Before the prevalence of 8-bit architectures, ASCII was a 7-bit character encoding representing characters in the range of 0 to 0x7F (127). This encoding covered special characters, alphabetic characters (both upper and lower case), keyboard keys, and control characters. An ASCII superset is an 8-bit encoding that includes a total of 256 characters, ranging from 0 to 0xFF. However, the need arose to represent characters from all languages in a unified way without causing compatibility issues.

Unicode is a standard for representing every spoken language's characters using unique code points. Rust's UTF-8 encoding represents a character as a sequence of up to four bytes, allowing for variable-width encoding. This approach contrasts with UTF-32 encoding, which uses a fixed number of bytes for all characters. Standardization is essential to achieve portability and avoid compatibility issues across different systems. Programming languages like Swift, Julia, and Rust use Unicode from the start, while popular languages like Python and Java have also adopted the UNICODE character set. Standardization makes interpreting characters across different languages and platforms seamless.

Text directionality isn't always left-to-right. Some languages, such as Arabic, Urdu, Hebrew, and others, are read from right to left. Rust handles this appropriately. For instance, calling next() in those languages returns the last character as the first, in contrast to English where next returns the characters in the order they appear.

   let string = String::from("שלום עולם");

    for char in string.chars(){
        println!("{}",char);
    }

(Hello World in Hebrew). Executing the above code prints the elements from right to left.

Bytes' interpretation varies based on their range(UTF-8):

  • The range from 0 to 0x7F requires one byte for each character. This range encompasses ASCII characters .

  • The range from 0x80 (128) to 0x7FF (2047) requires up to two bytes to represent a single character.

  • The range from 0x800 (2048) to 0xFFFF (65535) requires up to three bytes to represent a character.

  • The range from 0x10000 (65536) to 0x10FFFF (1114111) requires up to four bytes to represent a single character in UTF-8.

These ranges can be combined to form characters that take a combination of bytes, resulting in even more bytes being used. It's no wonder that indexing Unicode points is so complex.

Rust provides multiple ways to represent textual information, depending on use cases. Before delving into the String type, let's explore other textual types in Rust.

The char type can store a single Unicode scalar value, which can represent both ASCII and UTF-8 characters, but it can only store one character. A char takes up 32 bits or 4 bytes. They can be allocated:

  1. In the stack, where they can be modified to different scalar codes if annotated with mut.

  2. Statically allocated through static, making them read-only. They can be mutated only through an unsafe block.

  3. On the heap using Box, explicitly.

The char type has four associated constants:

  1. char::MAX and char::MIN - The highest upper range and minimum of a UTF-8 code point.

  2. char::REPLACEMENT_CHARACTER - Used to replace invalid code points in Unicode sets.

  3. char::UNICODE_VERSION - Returns the Unicode version used by char and str methods.

Escape sequences can be used inside a char, but they include only one ASCII character point. An unsigned 8-bit value can be converted to an ASCII character using the char::from method on a char. This doesn't use either try_from or return Option or Result because it's infallible, every U8 value is a valid Unicode point.

//Different way to construct char type
    let char = char::from(95);
    let char: char = 'j';
    let char: char = '\u{ff}';
    let char: char = char::from_u32(2048).unwrap();

    println!(
        "{} {} {:?}",
        char::MAX,
        char::REPLACEMENT_CHARACTER,
        char::UNICODE_VERSION
    );
    for i in 0..=u8::MAX {
        println!("{i} => {}", char::from(i));
    }

A character with the prefix b, called a byte, is used to convert single ASCII-only characters to their byte representation. We can use any character that lies within the u8 range. This is the reverse of mapping ASCII characters to u8 bytes.

    let u8_byte_from_char: u8  = b'A';
    let u8_byte_from_char2 : u8 = b' ';
    println!("{} {}",u8_byte_from_char,u8_byte_from_char2);

the char type is,

  • Not owned in the sense of not adhering to ownership rules unless placed behind pointer types.

  • Have Fixed size.

  • Copy type

  • Reference to char takes 8 bytes in a 64-bit architecture.

String Slice

String slice -Contain sequences of always-valid UTF-8 characters in contiguous memory.

The string literal is in the core of Rust, but String is in the Standard Library.

A string slice reference is a reference to immutable data; we can't mutate the string literal. Even though a string slice has a method that takes &mut self, we can't mutate through a string slice but through String.

    let str_slice: &str = "Rustaceans";
    let str_slice: &&str = &"Borrowed of Borrowed string literal";
    let string_slice: &&&str = &&"Borrowed of borrowed of borrowed string literal";

String slices or any slices in Rust, containing starting address pointer which points to anywhere in memory and length which keep track of it's length of contents. Slices are bound checked at runtime i.e we can't access the element beyond it's length.

String slices are,

  • Not owned

  • Usually seen as borrowed form (&str)

  • Always represent valid UTF-8

The concat! macro builds a static string slice from literals. Literals include floats, integers, char, and string literals.

 let str:&'static str = concat!("Hello ",89,' ',90.8,' ',true,' ','a');
 println!("{}",str);

b" " - Converting slice of ASCII characters to corresponding u8 bytes. This conversion is bidirectional unless modified bytes do not cross the u8 range.

    let u8_bytes_from_slice : &[u8; 5] = b"hello";
    //Slices of type &[T] have to_vec method to convert
    //to Vec<T>.
    println!("{}", String::from_utf8(u8_bytes_from_slice.to_vec()).unwrap());

    let u8_bytes_from_slice: &[u8; 40] = b"Returns the reference to u8 static array";
    println!("{}", String::from_utf8(u8_bytes_from_slice.to_vec()).unwrap());

When to use &str:

  • When dealing with elements of fixed length, such as an array, it's more efficient to use a string literal instead of creating a String that incurs memory allocation.

  • If a function doesn't need to modify the contents, it's better to accept borrowed &str to perform operations that require an immutable reference.

  • When an immutable string type is sufficient

String type

A String in Rust is a mutable, growable, always UTF-8 valid, and heap-allocated type. Rust's UTF-8 strings are represented as Vec, which means they are not a collection of characters like null-terminated characters in C, but rather a contiguous sequence of u8 bytes. The size of the String type is 24 bytes, comprising:

  1. ptr - a raw pointer (*mut u8) pointing to heap memory.

  2. len (usize) - carrying the length of the string in bytes, not in terms of character count.

  3. capacity (usize) - used to automate allocation and de allocation.

A reference to a string is a slice, which is a view of actual data, not a copy. In Rust, expensive copies/clones are explicit.

Rust strings are encoded using variable-width encodings, consuming less space and exhibiting other characteristics of UTF-8.

Quotation marks have different meanings in Rust:

  • A single quote represents a char type and can be converted to a String.

  • A double quote represents a str slice and can be converted to a String.

  • A String is an object type. Not in the sense of an object in OOP, but it has a collection of inherent methods to manipulate strings. A string can be created in more than one way in Rust.

When to use String:

  • When a mutable string type is required that can grow or shrink.

  • &mut String can be used when mutating data, which may involve allocation and deallocation. Additionally, the methods defined on String can be used.

  • If the string needs to be mutable, the option should be to use String.

Rust provides a wide range of APIs for manipulating UTF-8 strings. While most languages provide similar APIs, Rust's guarantees and performance stand out. These APIs offer convenience and are designed to adhere to Rust's type system.

Owned types are moved, as copying them can be expensive compared to copy types. Memory owned by strings is automatically deallocated without manual intervention.

References behind &T and &mut T do not own the data they point to. In contrast, smart pointers like Box, Rc, and Arc own the data they point to, ensuring proper cleanup.

A String is:

  • Owned, meaning it owns the heap-allocated data it points to and is responsible for its cleanup, hence, it participates in ownership rules.

  • Its actual size isn't known at compile time.

  • A reference to a String itself occupies 8 bytes (&String).

  • Always valid UTF-8, with exceptions possible using unsafe code.

Why rust String types can't be Indexed using [usize]

If a string is encoded using UTF-8 or UTF-16, which is true for the modern era, then the language should not allow indexing using a single value as we normally would for other collection types. Indexing is a constant-time operation for collections arranged in memory contiguously, where each address takes the same amount of bytes. This allows for efficient random access. However, indexing character-wise in UTF-8 or UTF-16 encodings is not constant-time, as some characters require more bytes than others. Not every character has a fixed size and they don't represent a grapheme cluster. Instead, they are part of a grapheme cluster, which is a combination of Unicode scalars that forms a single grapheme cluster as per the Unicode standard. Emojis and languages like Urdu, Hindi, or Tamil can't be represented using just single bytes; they need more than one byte to be stored in memory.

Python and Julia's strings allow indexing, but not mutation through indexing, as this could corrupt the input. However, indexing in these languages is constant-time because they use byte indexing. This is where the real difficulty arises. Variables with encodings like UTF-8 and UTF-16 encode characters, with some characters taking one byte (like ASCII), others taking two , three or four bytes. Allowing indexing using one byte might result in indexing not yielding the desired character.

Python's documentation explains this as:

There is no separate “character” type; indexing a string produces strings of length 1.

It prints values at the specified index, without indicating whether it's an actual character in Unicode.

Consider the Python code below:

// 'Have a nice day' in tamil
string = "இந்த நாள் நல்லதாக அமைய வாழ்த்துக்கள் "
print(string[2],
      string[6],
      string[8],
      string[9],
      string[12])

The outputs are ் ா ் which doesn't make any sense without combining with others. This is true for other languages and emojis made of multiple bytes.

  • Python doesn't provide any awareness in this context. Even slice indexes have the same behavior, but at least Python strings are immutable.

  • Julia refuses to compile if the byte index is not valid, which is better than printing anything at that byte index.

  • However, Rust doesn't allow indexing at all but provides a slice range index with safety guarantees.

  • Swift takes a different approach in that String.Index type is used to index the string, outputting a Unicode scalar value. Swift doesn't allow plain integer values as in Rust. Rust is capable of representing these kinds of relationships by wrapping them inside Index<String::Index> or Range<String::Index> variants. Instead, methods like char_indices can be used to print the characters using those index values within a range index.

From the Rust Book:

Prevents misunderstandings early in the development process.

If the bytes are encoded in fixed length, such as 1-byte ASCII, where indexing guarantees the return of actual characters or UTF-32, which uses padding to support constant time indexing at the cost of using more space to fit 4 bytes, even if the character needs only 1 byte. Rust sequences are encoded in variable-width encoding. Indexing them may not return the Unicode characters as some characters are represented using a few bytes, while others need more. However, UTF-8 takes up less space than UTF-16 or UTF-32. Another reason for using UTF-8 is that it is standards-compliant and backward-compatible with ASCII.

Refer to this discussion and here for more information.

From Wikipedia:

UTF-8 results in fewer internationalization issues than any alternative text encoding, and it has been implemented in all modern operating systems, including Microsoft Windows, and standards such as JSON, where, as is increasingly the case, it is the only allowed form of Unicode.

UTF-8 is the dominant encoding for the World Wide Web (and internet technologies), accounting for 97.9% of all web pages, over 99.0% of the top 10,000 pages, and up to 100.0% for many languages, as of 2023.[9] Virtually all countries and languages have 95.0% or more use of UTF-8 encodings on the web.

So, how can we index strings in Rust? It can be done using the range syntax operator or methods provided by the language.

    //English alphabets are ASCII
    let ascii_str = "English";
    let utf_str = "தமிழ்";
    //Printing elements
    ascii_str.char_indices()
        .for_each(|(index, char)| println!("Index {index} : {char}"));
    println!("\n");
    //Accessing individual character
    println!(
        "First Word: {}\n Second Word: {} \n Third Word: {}",
        //one-byte offset like [0]
        &ascii_str[..1],
        &ascii_str[1..2],
        &ascii_str[2..3]
    );
    println!(
        " First Word:  {}\n Second Word: {}\n Third Word:  {}",
        //Multi-byte offsets
        &utf_str[..3],
        &utf_str[3..=8],
        &utf_str[9..]
    );
    println!("{}", ascii_str.chars().nth(1).unwrap());

String slices can be indexed using range operators or equivalent methods, which provide a view into the actual data rather than making a copy, as is the case with defaults in languages like Python or Julia. When indexing, we operate on bytes rather than character indexes. This is due to variable-length encoding, and Rust does not offer a distinct type for indexing other than integer values in range syntax.

It's easy to trigger a panic if the range values are invalid according to UTF-8 encoding, which is how Rust strings are encoded. In the above code,the indexes for non-ASCII characters are hardcoded after determining the byte offset of each character. If the chosen range index bytes do not align with character boundaries, a panic will occur. Although they may appear as simple distinct words to the naked eye, they are actually sequences of bytes for a computer. Using slice methods to access them is more ergonomic than manually adjusting byte offsets.

For Unicode strings containing only ASCII characters, where a single byte represents an actual character, we can use iterator methods to access them without causing panics. Here's how:

    let ascii_characters = "Indexing ASCII Supersets\
    Using Single Byte Index\
    Won't Cause Panic";
    for index in 0..ascii_characters.len() {
        //Calling unwrap is fine here
        //since single byte index is valid for ASCII
        println!("{}", ascii_characters.chars().nth(index).unwrap());
    }

All ASCII codes are valid UTF-8, but not vice versa. Converting from ASCII to UNICODE is acceptable, but the reverse is only true for ASCII; otherwise, a panic occurs.

The Index and IndexMut traits are not implemented for String. This is why we can't index strings using a single value, as is allowed for some collection types. However, Index<Range> and their variants are implemented, which will cause a panic if safety is violated.

Common ways to Construct a String

A String can be constructed in multiple ways, each returning a String instance with different use cases and trade-offs.

  • String::new() - A static method on the String type that returns an empty string without allocating anything. Further mutable calls may incur allocation. The length and capacity are zero. If you have upfront knowledge of how much storage is needed, then you can use String::with_capacity(usize), another associated function on String. It doesn't incur a syscall as long as the mutations don't exceed the capacity, where the capacity doubles. The string is a struct type that wraps the Vec, thus supporting some methods of Vector while returning a string instance and guarantees.

  • Construct from a vector of u8 bytes. This method moves the vector and either returns a String without additional allocation (using the original vector) or returns the vector with an error if the bytes are not valid UTF-8.

  • String::from(&str) - Construct a String from a string slice. This is equivalent to calling to_string directly on str slices.

  • String::from(char) - Construct a string from a single character. This is equivalent to calling the to_string method directly on a char.

  let string : String = String::new();
  let string : String = String::with_capacity(100);

 let vec : Vec<u8> = vec![240, 159, 149, 181, 240, 159, 143, 188, 226, 128, 141,
  226, 153, 128, 239, 184, 143];
 let string : String = String::from_utf8(vec).unwrap();
 let string : String = String::from('*');
 let string : String = String::from("Str Slice");

The type annotations are optional and are used here to indicate that each method returns an instance of the String type. The string variable is shadowed by the subsequent String identifier and so on.

The method from is a trait method, not an instance or inherent method of String. The into method can be used if from is implemented for a particular type.

Calling to_owned() on borrowed slices returns an owned instance of that type.

  • &Path.to_owned() -> PathBuf

  • &str.to_owned() -> String

  • &OsStr.to_owned() -> OsString

  • &CStr.to_owned() -> CString (null-terminated C compatible type)

Finally, calling to_owned() on any type that implements the Clone trait will return independent copies. For copy types, this is the same as assigning to a new variable. For Clone types, this is equivalent to calling clone() on them.

//Copy type
let a =10;
let b = a;
//do the same as above
let c = a.to_owned();
let d = a.clone();

//Owned type
let w : String = String::from("Rust");

//Both methods do the same i.e calling to_owned on the Owned type itself
//returns the independent copies of that type thus 
//cause new allocation like a clone.
let x: String = w.to_owned();
let y : String  = w.clone();
println!("w {} x {} y {}",w,x,y);

//Moves the w into z
let z: String= w.into();

Converting other types, i.e., from non-textual to String, can be achieved by calling the to_string() method on any types that implement the Display trait. This includes user-defined types if the display trait is implemented for them. All primitive types (char, bool, integers, floats, str slice), and others, can be converted to a String instance. As a result, we don't need to write boilerplate code for converting to a String. However, it's important to note that this method involves memory allocation.

use std::sync::Arc;
use std::rc::Rc;
let from_integer : String = 15634.to_string();
let from_float : String = 56e-10.to_string();
let from_char : String = 'a'.to_string();
let from_str_slice : String = "Hello".to_string();
let from_bool : String = true.to_string();
let from_box : String = Box::new(45).to_string();
let from_arc : String = Arc::new("from_arc").to_string();
let from_rc : String = Rc::new(true).to_string();
  • String::from_iter()-Construct the String instance from iterator with element type Box, char,&str, String itself and others.
let from_char: String  = String::from_iter(&['a','b', 'd',' ']);
let slice_of_iter:[&str;3] = ["One","Two","Three"];
let from_str : String = String::from_iter(slice_of_iter);
  • format!() - Similar to the println macro, but it returns a formatted string instance instead of printing it to the terminal. The format! macro can be used to build a String from various types of references, and it supports formatting similar to the println! macro.

String (UTF-8) interpolation can be formatted differently using template patterns described in std::fmt, which provides its own mini-language akin to pattern matching. String interpolations are employed within macros like println!, print!, format!, format_args!, writeln!, and write!.

    let i32_ = 56;
    let bool_ = true;
    let str_slice = "format";
    let vec = vec![1, 4, 5, 6, 7];
    let string = format!(
        "Intger {:*^12b} ,Bool {} , Str {} and Vector {:?}",
        i32_, bool_, str_slice, vec
    );
    println!("{}", string);
    //Without this writeln! macros can't call
    //Write methods on String
    use std::fmt::Write;
    //Rough guess to avoid allocation when writing
    let mut string_ = String::with_capacity(55);
    //Only write to the string doesn't print anything to the terminal
    writeln!(
        string_,
        "Length of Above String is {0} and the bool is {bool_}",
        string.len()
    );
    println!("{string_}");

The format macro doesn't take ownership of anything that is used. Once the format macro constructs a string, it doesn't rely on the state anymore. Those constructed strings are now separate entities. The operators {} and {:?} are used when the type implements the Display and Debug traits, respectively.

  • Calling concat or join on a string slice will return a String.
    let slice_of_str = ["Strings ", "In ", "Rust "];
    let string_by_join : String = slice_of_str.join("\n");
    println!("{string_by_join}");
    let mut string_by_concat : String= slice_of_str.concat();
    //Equivalent to multiple push_str
    string_by_concat.extend(["Push ", "More ", "str ", "slice ", "in ", "one ", "go "]);
    println!("{}", string_by_concat);
  • Strings can be constructed by calling the collect method on an iterator only when the type of item produced by the iterator is char, &str, or String itself. The String type implements the from_iter method for these types of items.

  • include_str! is used to insert your textual file into the binary, and the return type is &'static str. The data should be present at compile time, and once the binary is produced, the file is no longer needed as it is embedded inside the binary. Of course, you can turn it into a String instance by calling to_string, meaning that you can use string methods to manipulate it without depending on the file system. This also implies that whatever changes you make to that file are not visible to the same file in the file system.

    let mut string = include_str!(r#"D:\Sanjeevi\Chrome tabs.txt"#).to_string();
    //or
    //let mut string = String::from(include_str!(r#"D:\Sanjeevi\Chrome tabs.txt"#));
    string.push_str("Append it at the end of the file");
    println!("{}", string);

Rust String Methods:

Once the string instance is created in any of the above-mentioned ways, we can call any methods of String on it. Most methods of String internally call vector methods, as it's a wrapper around a vector of u8 bytes. The only difference is that these methods are checked for valid Unicode points. These methods are Unicode-compliant and respect Rust's type system principles such as ownership, borrowed data not outliving the referent, unique mutability, and accessibility by multiple parties, among others.

Some methods are specific to ASCII, and some are UTF-8 compliant. For example, is_numeric and is_alphabetic on a char return true or false according to the UTF-8 code points.

len - Returns the length in terms of the total number of bytes. The length of the string doesn't necessarily equal the number of characters it contains. This equality holds true only when the string contains ASCII-only characters or is UTF-32 encoded. The len method is defined on the str slice. If you want to know the exact byte length of the string, you can call len() directly on the String object or call len() on slices of the whole string.

    let string = String::from("ਸਵਾਗਤ ਹੈ");
    println!("{}",string.len());
    println!("{}",&string[..].len());
    //Return 7
    println!("{}",&string[12..19].len());

The length method will panic if the index of a single value does not correspond sensibly to the combinations it contains.

   //Counting length of Unicode scalars.
    let string = "🫠🤥☺️🥳😃🫣";
    println!("{}",string.chars().count());
    //To know the char boundary indices.
    let indexes = string.char_indices()
                      .map(|(index,_)|index)
                      .collect::<Vec<usize>>();
    println!("{:?}",indexes);

Even with char_indices, indexing can cause a panic if a single index doesn't cover the entire characters. To mitigate this, you should use range syntax instead of a single index value. So, it's recommended to avoid using a single index value and instead opt for the safer range syntax. The following is a safer alternative for range syntax using methods rather than using [..] and similar constructs.

//These methods won't panic unless
     //calling unwrap or expect
     println!("{:?}",string.get(4..8));
     println!("{:?}",string.get(..));
     println!("{:?}",string.get(22..));

     //We can't call any methods of String. And
     //Return a new string depending on the method.
     string.get_mut(..).map(|s|println!("{}",s.replace("🤥","🫣")));
     println!("{}",string);

Adding a string can be done through the + operator or the add method. In this operation, the left operand takes ownership, while the right operand is a borrowed string, denoted as &str. The signature of the add method is fn add(self, other: &str) -> String;.

Both the + operator and the add method allow us to concatenate a string and a string slice. To use these methods, it's necessary to import the Add trait. This is because we can only call trait methods when they are within scope.

Another arithmetic operator overloaded for String is the AddAssign<&str> operator. This operator doesn't return anything (()), similar to the behavior of the push_str method.

   //Both are equivalent
    let string_1 = string + "String ";
    let mut string_2 = string_1.add("String_1 ");

   //Both are equivalent
   string_2 += "World ";
   string_2.add_assign("World ");
   println!("{}",string_2);

In Rust, the String type doesn't overload other arithmetic operators. In contrast, Python overloads the multiplication operator for strings to return multiples of the specified number on the right, appended to the string on the left. In Rust, we can't implement the Mul trait for the String type due to the Orphan rule. However, achieving the same outcome is possible through the repeat() method and the iterator method.

3 possible ways

  let string ="Buenos dias ".repeat(4);
   println!("{}",string);

   println!("{}",repeat("Hello ",4)); 

   fn repeat(word:&str , re:usize)->String{
          let mut string_ = String::new();
          string.push_str(word);
//This last expression returns String
          string.repeat(re)   
   }

  //Same behavior with Iterators.
   let string = String::from("Repeat 5 times ");
   println!("{}",string.lines().cycle().take(5).collect::<String>());

Rust strings are represented as UTF-8, not UTF-32. As a result, character-wise operations don't behave as expected. For example, calling pop() won't return the last character as desired if the string contains non-ASCII characters. Due to variable-length encoding, successfully performing character-wise operations requires knowledge of the bytes constituting a single Unicode character. Achieving this is cumbersome without higher-level libraries. This behavior is consistent for any method that accepts a single usize value, which works well for ASCII-only strings.

Thanks to Rust's pattern trait, a string and a str slice method can accept different types with the same method signature and return different outputs based on the method's behavior. A type implementing this trait can accept various patterns. Using different patterns to match rather than just single characters is particularly useful for string operations involving Unicode strings. This abstraction provided by Rust helps conceal the low-level complexities of Unicode standards.

For ASCII-only operations, the char pattern can be used to avoid the overhead of the &str slice pattern when dealing with single characters. For Unicode characters, always use the &str slice pattern, even if the character is just a single character that might not fit within the char pattern. Rust will issue a compile error if the char pattern is insufficient.

Two patterns are especially useful for Unicode character operations:

  1. &str - Represents a sequence of UTF-8 characters.

  2. &String - Represents the String itself but only borrowed. This pattern is used for matching purposes rather than modifying the string.

There are three ways to interpret the bytes:

  1. Interpreting the bytes as raw bytes, which represent the UTF-8 encoding of the given string. This is the inverse of the from_utf8 method. By default, the bytes are displayed in decimal format. You can use {:0x?} to see the bytes in hexadecimal format.
  println!(
        "In Decimal {:?}\n\nEquivalent In Hex {:0x?}",
        "Hello 日本語".bytes(),
        "Hello 日本語".bytes()
    );
  1. Interpret as a character. This is not equivalent to the character you might expect but a Unicode point.

  2. Interpret as a Grapheme cluster, which is the actual character that is close to what we want. However, Rust doesn't provide this functionality in the standard library, but it is available through crates like Unicode-Segmentation and print-positions.

String methods

  • push and insert - Pushing a primitive character that is not a reference to a char.

  • push_str and insert_str - Pushing a string slice and inserting at a specified index, respectively.

  • remove(index) - Remove the character from the String; panics if the index is not valid.

  • pop() - Removing the last character.

  • extend - Equivalent to multiple push_str operations.

  • replace and replacen - Replacing matches while accepting different patterns. Methods ending with n match patterns up to n and accept pattern types.

  • Methods that start with r, like rmatches, perform the operation from right to left, i.e., in reverse order.

  • Methods that start with into take ownership of the data.

  • Methods starting with is can be used anywhere a boolean is required.

  • Methods that start with as perform a conversion that results in borrowed data.

  • fs::read_to_string() is a utility method in the standard library that reads content from the disk into a String instance. It operates using file operations under the hood to reduce boilerplate. This is essentially accomplished through the read_to_string method of the io::Read trait. Note that you shouldn't use String for arbitrary data like binary streams; other types are better suited to manage such data instead of using String for all types of data.

_
fn main(){

             //Have a nice day in Arabi.
    let mut string = "أتمنى لك نهارا سعيدا!".to_string();

 //print in right to left as how they are read
    for (index, _word) in string.char_indices() {

//Don't call unwrap here as they return None
        println!("{:?}", string.chars().nth(index));

//Use the Option map method to extract some variants if any 
//or does nothing
  string.chars().nth(index).map(|index|println!("Character is {}",index));

//this is useful if some middle index returns None
//and continue to the length of the char Iterator.
match string.chars().nth(index){
    Some(ch) =>println!("Character is {ch}"),
    None => continue,
}
//If you want to break when None 
//returns then use break instead of
//continue in the above match expression

    }
    // Return the position as Some or None
    println!("{:?}", string.rfind("س"));
    //Collect into other string
    println!("{}", string.matches('ا').collect::<String>());
    //Rerurn new String
    println!("{:?}", string.replacen('ا', "h", 2));

//The index is based on the ```rfind``` method above
//Random index may panic
    string.remove(27);

//This removes the last char but this cause, for language
 //read from right to left to remove the first char.
    println!("{:?}", string.pop());
}

The string itself is not an iterator, meaning putting it in a for loop is an error, unlike vectors and other collections. We can make strings return iterators by calling methods on slices, allowing us to use all the iterator methods. This way, we can perform a sequence of operations on string characters and collect them into another collection or back into a string itself for further processing. The count method used above is not defined on String itself, but rather on the iterator method. The Iterator trait is not implemented for String itself, but it is implemented for the methods that return appropriate Unicode scalars. Therefore, the iterator methods return the expected behavior for those types. Some of the iterator methods and types that return iterators are called combinators.

  • Mapper - map, map_while, enumerate, flat_map, chain, cycle

  • Filter - filter, filter_map

  • Reducer - fold, count, sum, collect, collect_into

Methods that return iterators:

  • bytes() - Returns an iterator where each item is a u8 byte.

  • chars() - Contains the characters at char boundaries, not the actual characters in UTF-8.

  • char_indices() - Similar to enumerate, but it returns the appropriate index and the character at that byte index. This is handy for identifying the byte index for use in range syntax. enumerate and char_indices are exactly equivalent for single-byte characters like English and special characters, but not for multi-byte characters.

    let single_byte_chars = String::from("Hello");
    let multi_byte_chars = String::from("עולם");

    for (index, elem) in single_byte_chars.chars().enumerate() {
        for (indices, char) in single_byte_chars.char_indices() {
            println!("Index       : {index}, character: {elem}\nByte Indices: {indices}, character: {char}");
        }
    }
    println!();
    for (index, elem) in multi_byte_chars.chars().enumerate() {
        for (indices, char) in multi_byte_chars.char_indices() {
            println!("Index       : {index}, character: {elem}\nByte Indices: {indices}, character: {char}");
        }
    }
  • match_indices() - Returns an iterator with a tuple of the starting index of the pattern and the pattern itself.

  • matches() - Like match_indices, but without the index.

  • rmatch_indices and rmatches() - Similar to the above methods, but in reverse order.

One advantage of collecting into a String is the visibility of text on the screen, as String implements Display and also Debug. However, other collections only implement Debug, which may lead to poor visibility of text on the screen, though it's useful for debugging.

fn main(){
        //Have a nice day in 3 different languages, taking
    //from Google search except the last one.

    //Hindi
    let mut string = "आपका दिन शुभ हो";
    println!(
        "Inside a Vector : {:?} \n\n With String:     {}\n",
        string.lines().collect::<Vec<_>>(),
        string.lines().collect::<String>()
    );
    //Malayalam
    let mut string = "ഒരു നല്ല ദിനം ആശംസിക്കുന്നു";
    println!(
        "Inside a Vector : {:?} \n\n With String:    {}\n",
        string.lines().collect::<Vec<_>>(),
        string.lines().collect::<String>()
    );
   //Tamil
    let mut string = "இந்த நாள் நல்லதாக அமையட்டும்";
    println!(
        "Inside a Vector : {:?} \n\n With String:    {}",
        string.lines().collect::<Vec<_>>(),
        string.lines().collect::<String>()
    );
}

If you observe the outputs of Vec<&str> in the terminal, they are not as readable as String. This holds true for emojis or any multi-byte characters.

A string slice can be parsed to another type using the parse method and the FromStr trait. A type must implement the FromStr trait to be able to parse from a string slice. This method returns a Result because the operation may fail due to the nature of strings.

Command line arguments accept either UTF-8 strings or invalid UTF-8, depending on the Iterator type used. For instance, std::env::args() provides an Iterator<Item=String>, guaranteeing valid UTF-8 and causing a panic if the command line arguments are invalid UTF-8.

trim can match spaces, newline, line feed, tab, and control characters. Meanwhile, trim_matches matches patterns that can be more robust than a simple trim alone. This is because trim_matches allows us to specify different characters to match, which could be more versatile than using the default trim, which might lead to panics. It's important to note that space is also a valid character in a string. Not calling trim before parsing could be problematic for slices that include spaces. When using strings to gather input from users, exercise caution; otherwise, a panic might occur. For example,

    let mut string = "\n\t\r  \u{85} . \\ \" 67 \n\t\r   ";
    //trim method panic since dot, back slash not taken
 //into account
    // println!("{}",string.trim().parse::<i32>().unwrap());
    println!(
        "{}",
        string
            .trim_matches(&['\n', '\t', '\r', ' ', '.', '\\', '\u{85}', '\"'] as &[char])
            .parse::<i32>()
            .unwrap()
    );

Rust's default string type is UTF-8. However, this default isn't suitable for representing file paths, URLs, or regular expression patterns. Rust offers different kinds of string types to suit various use cases. Each type comes with specific restrictions and capabilities.

For example, std::env::args_os() returns an iterator of items of type OsString. These items are not necessarily UTF-8 encoded strings, they can contain arbitrary text. While these strings lack the typical string methods, they provide methods tailored for OS-related tasks. Keep in mind that command-line arguments can be accepted in either UTF-8 format or as invalid UTF-8.

These specialized string types cannot utilize UTF-8 string and str slice methods:

  • CString and &CStr: These are used for interoperation with C, as C strings are null-terminated.

  • Filename-related methods: These are available for path-type strings but not for standard string types or OsStr.

  • PathBuf and &Path: These are designed for dealing with file paths across different platforms in an abstract manner. They can contain non-UTF-8 strings. File system methods are available for these types but not for String or CString types.

Raw strings are used for regular expression patterns because they can contain quotes, question marks, and non-escaped sequences that have specific meanings in regex expressions.

    let raw_string = r" \t \n \r \\ \ \u{85}";

    //Nothing is escaped
    println!("{}",raw_string);

    let raw_string_with_pound = r#"shift\[\./,, ,~ #"""\t \n"#;
    //Nothing is escaped and 
    //put any number of quotes
    //Unlike above
    let raw_string_with_pound = r###""\r# abcd#" \a \b "##"###;

These methods are not exhaustive. Look for more methods in the String documentation. By default, these string methods are single-threaded. To parallelize string processing, consider using crates like rayon or polars string namespace (docs.rs/polars/latest/polars/prelude/string..).

Where Allocation Can and Can't Be Avoided

Reducing Allocation Overhead for Strings is essential. For instance, calling the to_uppercase method on a string slice returns a new String instead of modifying the original string. This behavior is due to UNICODE characters like ß, which require more bytes to be converted to uppercase. This conversion operation causes the string to grow, leading to a new String allocation. This situation even applies when the string contains only ASCII characters, where the same number of bytes is needed for converting a to B. This operation could allow in-place mutation, but Rust assumes that strings always contain UNICODE characters. The standard library contains more methods that return new String instance.

  • Cow: Clone On Write is a technique that avoids unnecessary memory usage and ensures immutability in certain cases. Borrowed types can be passed around, providing a view into where they are stored, without unnecessary copying or cloning, unless required. If the type is already owned, no cloning or allocation is performed. The from_utf8_lossy(&[u8]) function, for example, returns a Cow<_, str>. If the bytes are already well-formed, a new String is not created, thus avoiding allocation. However, if the bytes are not well-formed, ill-formed bytes are replaced with char::REPLACEMENT_CHARACTER, and a new string is created.

  • The + operator or the add method on a string moves the left operand to the caller by reusing the buffer, avoiding allocation, but the buffer grows if more content is added.

  • Instead of directly cloning strings, which leads to allocation, consider cloning either:

    1. Rc<String> type in single-threaded scenarios.

    2. Arc<String> type in multi-threaded scenarios, depending on usage.

  • The HashMap get method and related methods accept a borrowed str as a key if the key is a String. This avoids allocation overhead, as creating a new string is unnecessary for operations like searching.

  • Avoid using allocation operations inside loops unless there is a valid reason to do so.

Clever Use of APIs

  • make_ascii_uppercase() or make_ascii_lowercase(): These methods are useful when only converting ASCII characters, allowing in-place modification since lower or uppercase conversions need only one byte, which doesn't cause the String to grow or shrink.

  • clear(): This method empties the contents of a String, changing its length but not its capacity, thus maintaining the existing allocation.

  • with_capacity(): This method allocates a buffer of the specified capacity, avoiding further allocation overhead as long as the capacity isn't increased by adding more strings, which would cause the String to grow.

  • Cow Utils crate: This crate reduces allocation for certain methods where in-place modification is possible; otherwise, it returns an Owned type.

  • Smol Str crate: An immutable type similar to a borrowed str.

  • Compact Str crate: This crate avoids heap allocation as long as the string doesn't exceed 24 bytes.

String Invariant:

Rust strings and borrowed str slices are always guaranteed to be valid UTF-8. A given string, whether scalar as a char or a sequence as a String or &str, is always checked for the validity of its type. If you don't want your string to be UTF-8 for some reason, you can opt to use Rust's other string types, such as OString, CString, PathBuf, and their equivalent borrowed str slices, OStr, CStr, and Path.

Rust doesn't allow us to directly modify strings or string slices through range indices or any sort of methods in safe Rust. Strings can be mutated only through methods that guarantee well-defined behavior for every operation. This reliance ensures expected behavior when developing string-based applications.

RUST DEFAULT IS SAFE. Unsafe blocks, which are explicit and not required, are the only way to introduce undefined behavior and other memory errors in Rust.

  • If the bytes are known to be valid, we can skip the validity check.

String::from_utf8_unchecked() and str::from_utf8_unchecked() skip the checking process, reducing validation overhead.

  • If you are confident about the index, you can use:

    1. str.get_unchecked() to retrieve values without bound checking or validity checking.

    2. str.get_mut_unchecked() for the same purpose but allowing mutation.

A fallible operation returns either an Option or a Result. Calling unwrap on these types may result in a panic.

  • from_utf8()

  • Type conversions such as parse from String may fail.

  • char::from_u32(), which returns an Option, as not all u32 values are valid UTF-8 code points. Calling unwrap here will panic.

  • writeln!, which returns a Result, and calling unwrap may not panic because it's used for writing to a String, which only accepts valid UTF-8. In the case of writeln! and println!, which are used for string formatting, there's no risk of violating the invariant because they write UTF-8 encoded data. These macros don't allow arbitrary sequences of bytes to be inserted, as is possible with the io::Write trait.

Rust enforces strictness in many areas involving non-ASCII characters. Methods accepting single or multiple index values must deal with valid values; otherwise, they halt the program immediately.

Traits Involved in Strings

A trait is a collection of methods that only describe the behavior of a type, which can't directly access the state, unlike structs or enums. A string is a struct and has access to its state via the impl block. Numerous traits are implemented for the String type. Here, we'll consider a few of them. These are similar to Swift protocols.

  • Display - This trait aids in printing Unicode strings as characters on the screen, without introducing escape characters which is achieved using the Debug trait. This is particularly useful since languages like Julia rely on terminal settings to display actual Unicode characters, whereas Rust relies on the Display and Debug traits. The char type and string slices also implement the Display and Debug traits.

  • Add - This is one of the Operator Overloading traits that defines how types can be added together. It has one method defined as:

fn add(self, rhs: Rhs) -> Self::Output

In this, self represents the left operand of the + operator, and it's moved into the add function. For a copy type, this creates an independent copy, but for a move type like String, the add function takes ownership of the data. The rhs represents the right operand, which is &str for the String type. The Output is an associated type to the Add trait. For String, the Output is the String itself.

  • Clone - This trait is implemented for any type that makes sense to clone, creating an independent copy of the type. This may involve heap allocation for types like String, HashMap, and Vec, but for others like Arc and Rc, it only increments the reference count.

  • Deref and DerefMut - As the names suggest, these are used to dereference a String to &str or &mut str. Due to these trait implementations on a string, we can directly call immutable or mutable slicing methods on a String. Smart pointers like Arc, Rc, and Box with String return String, &str, and &mut str, depending on the trait and the Target they return on these types. These traits can be utilized using operators or by explicitly calling the deref and deref_mut methods.

  • Default - This trait sets the default value for the String, which is typically an empty string created using String::new().

  • Extend - This is to push multiple characters without using for loops and other low-level details. It can accept any iterator that produces characters of the item char.

  • Extend<&'a str> and its variants These are analogous to calling the multiple push_str method, accepting any iterator that produces items of type &str.

  • Extend - This adds the iterator of type String to the String. For instance, we can push command line arguments to a String because std::env::args() produces an iterator of type String.

let iter_string = std::env::args().skip(1).take(10).map(|mut string| {
    string.push(' ');
    string
});
let mut string = String::with_capacity(100);
string.extend(iter_string);
println!("{string}");
  • Ord - Strings are ordered lexicographically in adherence to Unicode rules. We can use operators like <, >, <=, or their equivalent methods on characters within a String.

  • Eq - This trait doesn't introduce any methods but guarantees == and != comparisons for a type.

  • Hash - Without this trait, a HashMap can't accept String or &str as keys. It also implements the above Eq trait to fulfill the requirements for a HashMap key. The HashSet also relies on this, and string hashing can be used for purposes beyond hash-based data structures.

  • ToString - The to_string() method is defined under this trait. This trait is implicitly imported to every Rust module, which is why we can call it without explicitly importing it.

  • Write - This trait is used for writing valid UTF-8 encoded strings.It's distinct from io::Write trait, which is used to write raw bytes. Macros like writeln! and write! offer higher-level conventions for writing to them, instead of using fmt::Write trait methods directly. The {} template inside macros only prints UTF-8 encoded strings. Hence, printing Paths, OsStrings, CStrings, and their borrowed parts results in a compile error. The error messages are user-friendly and often suggest how to rectify the issue.

  • Send trait. This trait guarantees that a String can be passed between threads.

If the words in the cover image make anything not make sense, please let me know. I created the words randomly using the Google keyboard.

References:

Rust book

Rust Book (Abridged)

Rust String by fasterthanli

Blog post about Unicode and ASCII