Skip to content

5.8 Strings

Claude Roux edited this page Feb 10, 2022 · 19 revisions

Strings

back

(trim0 (str) remove '0' at the end of the string)
(trim (str) Cuts all 'space' characters)
(trimleft (str) Trim all 'space' characters to the left)
(trimright (str) Cuts all 'space' characters to the right)

(vowelp (str) Checks if the string only contains vowels)
(consonantp (str) Checks if the string only contains consonants)
(lowerp (str) Checks if the string is only lowercase)
(upperp (str) Checks if the string is only capitalized)
(alphap (str) Checks if the string contains only alphabetic characters)
(punctuationp (str) Checks if the string contains only punctuation)

(lower (str) puts in lower case)
(upper (str) is capitalized)
(deaccentuate (str) Replaces accented letters with their non-accented form)
(replace (str fnd rep (index)) Replaces all sub-strings from index (default value is 0)

(format str e1 e2 ... e9) str must contains variables of the form: %I, 
where 1 <= i <= 9, which will be replaced with their corresponding element. 
Ex. (format "It is a %1 %2" "nice" "car") ; yields It is a nice car


(convert_in_base value base (convert_from) Converts a value into a different base or from a value in different base

Ex. 
(convert_in_base 36 2) ; yields 100100
(convert_in_base "100100" 2 true) ; yields 36, in this case the initial value is in base 2

(left (str nb) Returns the 'n' characters on the left)
(right (str nb) Returns the last 'n' characters on the right)
(middle (str pos nb) Returns the 'n' characters from the 'p' position)
(split (str fnd) Splits the string into sub-strings according to a given string)
(splite (str fnd) Same as split but keeps the empty strings)
(tokenize (str) Splits the string into a list of tokens)
(tokenizee (str) same as 'tokenize' but keeps the blanks)
(ngrams (str nb)) Builds a lists of ngrams of size nb
(getstruct (str o c (pos)) Read an embedded structure in a string that starts at openning character 'o' and stops at closing character, from position 'pos') 

(ord (str) Returns the Unicode codes of each character of 'str')
(chr (nb) Returns the Unicode character corresponding to the code 'nb')

(fill c nb Returns a string, which contains the string 'c' 'nb' times)
(padding str c nb Pads the string str with c string up to nb characters)
(editdistance str strbis Computes the edit distance between str and strobes)

Note on: getstruct

getstruct can extract successive structures from a string, which starts with a specific character and ends with another, from a specific position.

(getstruct str "{" "}")

The string can contain sub-structures of the same sort, which means that the method will only stop when all sub-structures have been consumed.

Hence, getstruct can return: {{a:b} {c:d}}.

This method can be called as many times as necessary to read all balanced structures. At each step, it returns a list, which contains the sub-string extracted so far, its initial position and its final position in the string. When the last structure has been read, the system returns nil.

Important: file_getstruct returns a string object. If this string matches some LispE structures such as lists or dictionaries, you can use json_parse to transform it into actual LispE containers.

JSON Instructions

(json (element) Returns the element as a JSON string)
(json_parse (str) Compile a JSON string)
(json_read (filename) Reads a JSON file)

Rule Tokenization

LispE also provides another mechanism to handle tokenisation. In this case, we use a set of pre-defined rules that can be modified.

; Tokenization with rules
(tokenizer_rules () Create a 'rules' object for tokenisation)
(tokenize_rules (rules str) Applies a 'rule' object to a string to tokenize it)
(get_tokenizer_rules (rules) Gets the underlying tokenizer rules)
(set_tokenizer_rules (rules lst) Stores a new set of rules) 

Here is an example:

; We need first a rule tokenizer object
(setq rules (tokenizer_rules))
; which we apply to a string
(tokenize_rules rules "The lady, who lives here, bought this picture in 2000 for $345.5")
; ("The" "lady" "," "who" "lives" "here" "," "bought" "this" "picture" "in" "2000" "for" "$" "345.5")

This underlying set of rules can be loaded and modified to change or enrich the tokenization process, thanks to tokenizer_rules.

(setq rules (tokenizer_rules))
(setq rules_list (get_tokenizer_rules rules)

The rules are applied according to a simple algorithm. First, rules are automatically identified as:

  • character rules: the rule starts with a specific character
  • entity rules: the rule starts with an entity such as: %a, %d etc...
  • metarules: the rule pattern is associated with an id that is used in other rules.

IMPORTANT

  • The rules should always be ordered with character rules first and ends with entity rules.
  • The most specific rules should precede the most general ones.

Metarules

A metarule is composed of two parts:

  • c:expression, where c is the metacharacter that is accessed through %c and expression is a single body rule. for instance, we could have encoded %o as: "o:[≠ ∨ ∧ ÷ × 2 3 ¬]"

IMPORTANT:

These rules should be declared with one single operation. Their body will replace the call to a %c in other rules (see the test on metas in the parse section) If you use a character that is already a meta-character (such as "a" or "d"), then the meta-character will be replaced with this new description...

However, its content might still use the standard declaration:

"1:{%a %d %p}": "%1 is a combination of alphabetical characters, digits and punctuations

Formalism

A rule is composed of two parts: body=action.

N.B The action is not used for the moment, but is required. It can be a number or a '#'.

body uses the following instructions:

  • x is a character that should be recognized
  • #x-y interval between ascii characters x and y
  • (..) is a sequence of optional instructions
  • [..] is a disjunction of possible characters
  • {..} is a disjunction of meta-characters
  • x+ means that the instruction can be repeated at least once
  • x- means that the character should be recognized but not stored in the parsing string
  • %.~.. means that all character will be recognized except for those in the list after the tilde.
  • %x is a meta-character with the following possibilities:
  1. %. is any character
  2. %a is any alphabetical character (including unicode ones such as éè)
  3. %C is any uppercase character o %c is any lowercase character o %d is any digits
  4. %H is any hangul character
  5. %n is a non-breaking space
  6. %o is any operators
  7. %p is any punctuations
  8. %r is a carriage return both \n and \r o %s isaspace(32)oratab(09)
  9. %S is both a carriage return or a space (%s or %r)
  10. %? is any character with the possibility of escaping characters with a '' such as: \r \t \n or "
  11. %nn you can create new metarules associated with any characters...

IMPORTANT: do not add any spaces as they would be considered as a character to test...

; A meta-rule for hexadecimal characters
; It can be a digit, ABCDEF or abcdef
1:{%d #A-F #a-f}

; This is a basic rule to handle regular characters
!=0
(=0
)=0
[=0
]=0

; This rule detects a comment that starts with //
; any characters up to a carriage return
//%.~%r+=#

; These rules detects a number
; Hexadecimal starting with 0x
0x%1+(.%1+)([p P]([- +])%d+)=3

; regular number
%d+(.%d+)([e E]([- +])%d+)=3
Clone this wiki locally