Skip to content

5.8 Strings

Claude Roux edited this page Feb 10, 2023 · 19 revisions

Strings

back

(trim0 (str) remove '0' at the end of the string)
(trim (str) Cuts all 'space' characters)
(trimleft (str) Trim all 'space' characters to the left)
(trimright (str) Cuts all 'space' characters to the right)

(vowelp (str) Checks if the string only contains vowels)
(consonantp (str) Checks if the string only contains consonants)
(lowerp (str) Checks if the string is only lowercase)
(upperp (str) Checks if the string is only capitalized)
(alphap (str) Checks if the string contains only alphabetic characters)
(digitp (str) Checks if the string contains only digits)
(punctuationp (str) Checks if the string contains only punctuation)

(lower (str) puts in lower case)
(upper (str) is capitalized)
(deaccentuate (str) Replaces accented letters with their non-accented form)
(replace (str fnd rep (index)) Replaces all sub-strings from index (default value is 0)

(format str e1 e2 ... e9) str must contains variables of the form: %I, 
where 1 <= i <= 9, which will be replaced with their corresponding element. 
Ex. (format "It is a %1 %2" "nice" "car") ; yields It is a nice car


(convert_in_base value base (convert_from) Converts a value into a different base or from a value in different base

Ex. 
(convert_in_base 36 2) ; yields 100100
(convert_in_base "100100" 2 true) ; yields 36, in this case the initial value is in base 2

(left (str nb) Returns the 'n' characters on the left)
(right (str nb) Returns the last 'n' characters on the right)
(middle (str pos nb) Returns the 'n' characters from the 'p' position)
(split (str fnd) Splits the string into sub-strings according to a given string)
(splite (str fnd) Same as split but keeps the empty strings)
(segment (str (point) Splits the string into a list of tokens)
(segment_e (str (point) same as 'segment' but keeps the blanks)
(ngrams (str nb)) Builds a lists of ngrams of size nb
(getstruct (str o c (pos)) Read an embedded structure in a string that starts at openning character 'o' and stops at closing character, from position 'pos') 

(ord (str) Returns the Unicode codes of each character of 'str')
(chr (nb) Returns the Unicode character corresponding to the code 'nb')

(fill c nb Returns a string, which contains the string 'c' 'nb' times)
(padding str c nb Pads the string str with c string up to nb characters)
(editdistance str strbis Computes the edit distance between 'str' and 'strbis')

Note on segment

The last argument of segment: point takes three possible values:

  • point = 0: the decimal separator is ".", it is also the default value
  • point = 1: the decimal separator is ","
  • point = 2: the decimal separator is potentially both
(segment "10.5 bottles of beer") ; is ("10.5" "bottles" "of" "beer")
(segment "10.5 bottles of beer" 0) ; is ("10.5" "bottles" "of" "beer")
(segment "10.5 bottles of beer" 1) ; is ("10" "." "5" "bottles" "of" "beer")
(segment "10,5 bottles of beer" 1) ; is ("10,5" "bottles" "of" "beer")

(segment "10,5 bottles and 10.5 bottles" 2) ; is ("10,5" "bottles" "and" "10.5" "bottles")

Note on: getstruct

getstruct can extract successive structures from a string, which starts with a specific character and ends with another, from a specific position.

(getstruct str "{" "}" 0)

The string can contain sub-structures of the same sort, which means that the method will only stop when all sub-structures have been consumed.

Hence, getstruct can return: {{a:b} {c:d}}.

This method can be called as many times as necessary to read all balanced structures. At each step, it returns a list, which contains the sub-string extracted so far, its initial position and its final position in the string. When the last structure has been read, the system returns nil.

Important: The first element, which is returned by getstruct is a string object. If this string matches some LispE structures such as lists or dictionaries, you can use json_parse to transform it into actual LispE containers.

(setq r "{[a [b c d]] [[e f] g h]}")

(getstruct r "[" "]"); yields ("[a [b c d]]" 1 12)
(getstruct r "[" "]" 12); yields ("[[e f] g h]" 13 24)
(getstruct r "[" "]" 24); yields nil

JSON Instructions

(json (element) Returns the element as a JSON string)
(json_parse (str) Compile a JSON string)
(json_read (filename) Reads a JSON file)

Rule Tokenization

LispE also provides another mechanism to handle tokenisation. In this case, we use a set of pre-defined rules that can be modified.

; Tokenization with rules
(deflib tokenizer_main (), returns the main tokenizer of LispE, which used to tokenize LispE code)
(deflib segmenter (keepblanks point), returns the segmenter which is used in instruction segment)
(deflib tokenizer (), returns a copy of the main tokenizer)
(deflib tokenizer_display (rules), display the rules as indented automata)
(deflib tokenize (rules str (types)), tokenize a string using a specific tokenizer, can also returns the type of each element)
(deflib get_tokenizer_rules (rules), returns a vector of all rules in memory)
(deflib set_tokenizer_rules (rules lst), change the rules in memory to a new ensemble. Rules are then recompiled on the fly)
(deflib get_tokenizer_operators (rules), "%o" is associated with a set of operators)
(deflib set_tokenizer_operators (rules a_set), modify the set of operators with which "%o" is associated)

Basically, when you need to tokenize a string with a specific set of rules, you need first to access the rule controller with either: tokenizer_main, segmenter or tokenizer. These methods return a handler to these different tokenizers.

Note that if you modify the handler returned by tokenizer_main, you can modify the actual rules that are used to tokenize LispE code. In the same way, if you use the handler returned by segmenter, you can modify the behavior of segment.

Here is an example:

; We need first a rule tokenizer handler
(setq rule_handler (tokenizer))
; which we apply to a string
(tokenize rule_handler "The lady, who lives here, bought this picture in 2000 for $345.5")
; ("The" "lady" "," "who" "lives" "here" "," "bought" "this" "picture" "in" "2000" "for" "$" "345.5")

This underlying set of rules can be loaded and modified to change or enrich the tokenization process, thanks to tokenizer_rules.

(setq rules_list (get_tokenizer_rules rule_handler))

The rules are compiled into an automaton, which is used to tokenize a string. There are two sorts of rules:

  • character rules: the rule starts with a specific character
  • metarules: the rule pattern is associated with an id that is used in other rules.

IMPORTANT

The rules should always be ordered with the most specific rules ahead.

Metarules

A metarule is composed of two parts:

  • c:expression, where c is the metacharacter that is accessed through %c and expression is a single body rule. for instance, we could have encoded %o as: "o:[≠ ∨ ∧ ÷ × 2 3 ¬]"

IMPORTANT:

These rules should be declared with one single operation. Their body will replace the call to a %c in other rules (see the test on metas in the parse section) If you use a character that is already a meta-character (such as "a" or "d"), then the meta-character will be replaced with this new description...

However, its content might still use the standard declaration:

"1:{%a %d %p}": "%1 is a combination of alphabetical characters, digits and punctuations

Formalism

A rule is composed of two parts: body=action.

N.B The action is not used for the moment, but is required. It can be a number or a '#'.

body uses the following instructions:

  • x is a character that should be recognized

  • #x comparison with character of code x...

  • #x-y comparison between x and y. x and y should be ascii characters...

  • %x is a meta-character with the following possibilities:

  1. ? is any character
  2. %a is any alphabetical character (including unicode ones such as éè)
  3. %C is any uppercase character
  4. %c is any lowercase character
  5. %d is any digits
  6. %e is an emoji
  7. %E is an emoji complement (cannot start a rule)
  8. %h is a Greek letter
  9. %H is any hangul character
  10. %n is a non-breaking space
  11. %o is any operators
  12. %p is any punctuations
  13. %r is a carriage return both \n and \r
  14. %s is a space (32) or a tab (09)
  15. %S is both a carriage return or a space (%s or %r)
  16. %nn you can create new metarules associated with any OTHER characters...

Rule formalism:

  1. (..) is a sequence of optional instructions
  2. [..] is a sequence of characters in a disjunction
  3. {..} is a disjunction of meta-characters
  4. x* means that the instruction can be repeated zero or n times
  5. x+ means that the instruction can be repeated at least once
  6. x- means that the character should be recognized but not stored in the parsing string
  7. ~.. means that all character will be recognized except for those in the list after the tilda.

IMPORTANT: do not add any spaces as they would be considered as a character to test... Except in disjunction: {%d%d} is the same as {%d %d}, as it makes the expression more readable. Note that if you want to force a space as a potential target for your rule, you can either use: %s or #32, since 32 is the actual ASCII code for space.

; A meta-rule for hexadecimal characters
; It can be a digit, ABCDEF or abcdef
1:{%d #A-F #a-f}

; This is a basic rule to handle regular characters
!=0

; We escape the following characters as they are used as actions
%(=0
%)=0
%[=0
%]=0

; This rule detects a comment that starts with //
; any characters up to a carriage return
//?*%r=#

; These rules detects a number
; Hexadecimal starting with 0x
0x%1+(.%1+)({p P}({%- %+})%d+)=3

; regular number, + and - are escaped
%d+(.%d+({e E}({%- %+})%d+))=3
Clone this wiki locally