Kalpana Kalpana (Editor)

String literal

Updated on
Edit
Like
Comment
Share on FacebookTweet on TwitterShare on LinkedInShare on Reddit

A string literal or anonymous string is a type of literal in programming for the representation of a string value within the source code of a computer program. Most often in modern languages this is a quoted sequence of characters (formally "bracketed delimiters"), as in x = "foo", where "foo" is a string literal with value foo – the quotes are not part of the value, and one must use a method such as escape sequences to avoid the problem of delimiter collision and allow the delimiters themselves to be embedded in a string. However, there are numerous alternate notations for specifying string literals, particularly more complicated cases, and the exact notation depends on the individual programming language in question. Nevertheless, there are some general guidelines that most modern programming languages follow.

Contents

Bracketed delimiters

Most modern programming languages use bracket delimiters (also balanced delimiters) to specify string literals. Double quotations are the most common quoting delimiters used:

"Hi There!"

An empty string is literally written by a pair of quotes with no character at all in between:

""

Some languages either allow or mandate the use of single quotations instead of double quotations (the string must begin and end with the same kind of quotation mark and the type of quotation mark may give slightly different semantics):

'Hi There!'

Note that these quotation marks are unpaired (the same character is used as an opener and a closer), which is a hangover from the typewriter technology which was the precursor of the earliest computer input and output devices.

In terms of regular expressions, a basic quoted string literal is given as:

"[^"]*"

This means that a string literal is written as: a quote, followed by zero, one, or more non-quote characters, followed by a quote. In practice this is often complicated by escaping, other delimiters, and excluding newlines.

Paired delimiters

A number of languages provide for paired delimiters, where the opening and closing delimiters are different. These also often allow nested strings, so delimiters can be embedded, so long as they are paired, but still result in delimiter collision for embedding an unpaired closing delimiter. Examples include PostScript, which uses parentheses, as in (The quick (brown fox)) and m4, which uses the backtick (`) as the starting delimiter, and the apostrophe (') as the ending delimiter. Tcl allows both quotes (for interpolated strings) and braces (for raw strings), as in "The quick brown fox" or {The quick {brown fox}}; this derives from the single quotations in Unix shells and the use of braces in C for compound statements, since blocks of code is in Tcl syntactically the same thing as string literals – that the delimiters are paired is essential for making this feasible.

While the Unicode character set includes paired (separate opening and closing) versions of both single and double quotations, used in text, mostly in other languages than English, these are rarely used in programming languages (because ASCII is preferred, and these are not included in ASCII):

“Hi There!” ‘Hi There!’ „Hi There!“ «Hi There!»

The paired double quotations can be used in Visual Basic .NET, but many other programming languages will not accept them. Unpaired marks are preferred for compatibility - many web browsers, text editors, and other tools will not correctly display unicode paired quotes, and so even in languages where they are permitted, many projects forbid their use for source code.

Whitespace delimiters

String literals might be ended by newlines.

One example is MediaWiki template parameters.

{{Navbox |name=Nulls |title=[[wikt:Null|Nulls]] in [[computing]] }}

There might be special syntax for multi-line strings.

In YAML, string literals may be specified by the relative positioning of whitespace and indentation.

Declarative notation

In the original FORTRAN programming language (for example), string literals were written in so-called Hollerith notation, where a decimal count of the number of characters was followed by the letter H, and then the characters of the string:

This declarative notation style is contrasted with bracketed delimiter quoting, because it does not require the use of balanced "bracketed" characters on either side of the string.

Advantages:

  • eliminates text searching (for the delimiter character) and therefore requires significantly less overhead
  • avoids the problem of delimiter collision
  • enables the inclusion of metacharacters that might otherwise be mistaken as commands
  • can be used for quite effective data compression of plain text strings
  • Drawbacks:

  • this type of notation is error-prone if used as manual entry by programmers
  • special care is needed in case of multi byte encodings
  • This is however not a drawback when the prefix is generated by an algorithm as is most likely the case.

    Delimiter collision

    When using quoting, if one wishes to represent the delimiter itself in a string literal, one runs into the problem of delimiter collision. For example, if the delimiter is a double quote, one cannot simply represent a double quote itself by the literal """ as the second quote is interpreted as the end of the string literal, not as the value of the string, and similarly one cannot write "This is "in quotes", but invalid." as the middle quoted portion is instead interpreted as outside of quotes. There are various solutions, the most general-purpose of which is using escape sequences, such as """ or "This is "in quotes" and properly escaped.", but there are many other solutions.

    Note that paired quotes, such as braces in Tcl, allow nested string, such as {foo {bar} zork} but do not otherwise solve the problem of delimiter collision, since an unbalanced closing delimiter cannot simply be included, as in {}}.

    Doubling up

    A number of languages, including Pascal, BASIC, DCL, Smalltalk, SQL, and Fortran, avoid delimiter collision by doubling up on the quotation marks that are intended to be part of the string literal itself:

    Dual quoting

    Some languages, such as Fortran, Modula-2, JavaScript, Python, and PHP allow more than one quoting delimiter; in the case of two possible delimiters, this is known as dual quoting. Typically, this consists of allowing the programmer to use either single quotations or double quotations interchangeably – each literal must use one or the other.

    This does not allow having a single literal with both delimiters in it, however. This can be worked around by using several literals and using string concatenation:

    Note that Python has string literal concatenation, so consecutive string literals are concatenated even without an operator, so this can be reduced to:

    D supports a few quoting delimiters, with such strings starting with q"[ and ending with ]" or similarly for other delimiter character (any of () <> {} or []). D also supports here document-style strings via similar syntax.

    In some programming languages, such as sh and Perl, there are different delimiters that are treated differently, such as doing string interpolation or not, and thus care must be taken when choosing which delimiter to use; see different kinds of strings, below.

    Multiple quoting

    A further extension is the use of multiple quoting, which allows the author to choose which characters should specify the bounds of a string literal.

    For example in Perl:

    all produce the desired result. Although this notation is more flexible, few languages support it; other than Perl, Ruby (influenced by Perl) and C++11 also support these. In C++11, raw strings can have various delimiters, beginning with R"delimiter( and end with )delimiter". The delimiter can be from zero to 16 characters long and may contain any member of the basic source character set except whitespace characters, parentheses, or backslash. A variant of multiple quoting is the use of here document-style strings.

    Lua (as of 5.1) provides a limited form of multiple quoting, particularly to allow nesting of long comments or embedded strings. Normally one uses [[ and ]] to delimit literal strings (initial newline stripped, otherwise raw), but the opening brackets can include any number of equal signs, and only closing brackets with the same number of signs close the string. For example:

    Multiple quoting is particularly useful with regular expressions that contain usual delimiters such as quotes, as this avoids needing to escape them. An early example is sed, where in the substitution command s/regex/replacement/ the default slash / delimiters can be replaced by another character, as in s,regex,replacement, .

    Constructor functions

    Another option, which is rarely used in modern languages, is to use a function to construct a string, rather than representing it via a literal. This is generally not used in modern languages because the computation is done at run time, rather than at parse time.

    For example, early forms of BASIC did not include escape sequences or any other workarounds listed here, and thus one instead was required to use the CHR$ function, which returns a string containing the character corresponding to its argument. In ASCII the quotation mark has the value 34, so to represent a string with quotes on an ASCII system one would write

    In C, a similar facility is available via sprintf and the %c "character" format specifier, though in the presence of other workarounds this is generally not used:

    These constructor functions can also be used to represent nonprinting characters, though escape sequences are generally used instead. A similar technique can be used in C++ with the std::string stringification operator.

    Escape sequences

    Escape sequences are a general technique for represent characters that are otherwise difficult to represent directly, including delimiters, nonprinting characters (such as backspaces), newlines, and whitespace characters (which are otherwise impossible to distinguish visually), and have a long history. They are accordingly widely used in string literals, and adding an escape sequence (either to a single character or throughout a string) is known as escaping.

    One character is chosen as a prefix to give encodings for characters that are difficult or impossible to include directly. Most commonly this is backslash; in addition to other characters, a key point is that backslash itself can be encoded as a double backslash and for delimited strings the delimiter itself can be encoded by escaping, say by " for ". A regular expression for such escaped strings can be given as follows, as found in the ANSI C specification:

    "(.|[^"])*"

    meaning "a quote; followed by zero or more of either an escaped character (backslash followed by something, possibly backslash or quote), or a non-escape, non-quote character; ending in a quote" – the only issue is distinguishing the terminating quote from a quote preceded by a backslash, which may itself be escaped. Note that multiple characters can follow the backslash, such as