From 4e4ebcc7434fe09913f1fbeb07247c3993d2fb13 Mon Sep 17 00:00:00 2001 From: Pierre Le Marre Date: Fri, 7 Jun 2024 14:11:21 +0200 Subject: [PATCH] ucd2haskell: Refactor This is a huge refactor of `ucd2haskell`, motivated by similar work in `ghc-internal`. This will prevent this tool from further bit-rotting. - Remove dependency on `streamly`. This package is overkilled and has an instable API. The version we use is not supported by recent GHCs and non-trivial migration seems to be required at each new version. Furthermore we currently process `String`s, so there is no much benefit. - Mimic the `Fold` type from `streamly` for basic features. Although not mandatory, this avoid changing all the logic. - Use `ByteString` parsers from `unicode-data-parser` [1]. These parsers are shared with the corresponding `ucd2haskell` tool in `base` (now `ghc-internal`). We now have a clear separation between parsers and generators. The Unicode files being very stable, this package should be very stable as well. - Move generators to independent modules. This speeds the compilation up and add more structure to the code base. - Remove many anti-patterns and share more code. The files *generated* by this tool remain identical, although I left some comments to further improve them. [1]: https://hackage.haskell.org/package/unicode-data-parser --- ucd2haskell/exe/Parser/Text.hs | 3245 ----------------- ucd2haskell/exe/UCD2Haskell.hs | 14 +- ucd2haskell/exe/UCD2Haskell/Common.hs | 195 + ucd2haskell/exe/UCD2Haskell/Generator.hs | 392 ++ ucd2haskell/exe/UCD2Haskell/Generator/Core.hs | 101 + .../exe/UCD2Haskell/Generator/Names.hs | 33 + .../exe/UCD2Haskell/Generator/Scripts.hs | 33 + .../exe/UCD2Haskell/Generator/Security.hs | 47 + ucd2haskell/exe/UCD2Haskell/Modules/Blocks.hs | 161 + .../exe/UCD2Haskell/Modules/CaseFoldings.hs | 67 + .../Modules/DerivedNumericValues.hs | 59 + .../exe/UCD2Haskell/Modules/Properties.hs | 95 + .../exe/UCD2Haskell/Modules/Scripts.hs | 227 ++ .../UCD2Haskell/Modules/ScriptsExtensions.hs | 172 + .../Modules/Security/Confusables.hs | 69 + .../Modules/Security/IdentifierStatus.hs | 50 + .../Modules/Security/IdentifierType.hs | 210 ++ .../Security/IntentionalConfusables.hs | 69 + .../exe/UCD2Haskell/Modules/SpecialCasings.hs | 133 + .../Modules/UnicodeData/CombiningClass.hs | 85 + .../Modules/UnicodeData/Composition.hs | 135 + .../Modules/UnicodeData/Decomposition.hs | 174 + .../Modules/UnicodeData/DerivedNames.hs | 249 ++ .../Modules/UnicodeData/GeneralCategory.hs | 224 ++ .../Modules/UnicodeData/NameAliases.hs | 159 + .../Modules/UnicodeData/SimpleCaseMappings.hs | 90 + ucd2haskell/ucd2haskell.cabal | 50 +- 27 files changed, 3278 insertions(+), 3260 deletions(-) delete mode 100644 ucd2haskell/exe/Parser/Text.hs create mode 100644 ucd2haskell/exe/UCD2Haskell/Common.hs create mode 100644 ucd2haskell/exe/UCD2Haskell/Generator.hs create mode 100644 ucd2haskell/exe/UCD2Haskell/Generator/Core.hs create mode 100644 ucd2haskell/exe/UCD2Haskell/Generator/Names.hs create mode 100644 ucd2haskell/exe/UCD2Haskell/Generator/Scripts.hs create mode 100644 ucd2haskell/exe/UCD2Haskell/Generator/Security.hs create mode 100644 ucd2haskell/exe/UCD2Haskell/Modules/Blocks.hs create mode 100644 ucd2haskell/exe/UCD2Haskell/Modules/CaseFoldings.hs create mode 100644 ucd2haskell/exe/UCD2Haskell/Modules/DerivedNumericValues.hs create mode 100644 ucd2haskell/exe/UCD2Haskell/Modules/Properties.hs create mode 100644 ucd2haskell/exe/UCD2Haskell/Modules/Scripts.hs create mode 100644 ucd2haskell/exe/UCD2Haskell/Modules/ScriptsExtensions.hs create mode 100644 ucd2haskell/exe/UCD2Haskell/Modules/Security/Confusables.hs create mode 100644 ucd2haskell/exe/UCD2Haskell/Modules/Security/IdentifierStatus.hs create mode 100644 ucd2haskell/exe/UCD2Haskell/Modules/Security/IdentifierType.hs create mode 100644 ucd2haskell/exe/UCD2Haskell/Modules/Security/IntentionalConfusables.hs create mode 100644 ucd2haskell/exe/UCD2Haskell/Modules/SpecialCasings.hs create mode 100644 ucd2haskell/exe/UCD2Haskell/Modules/UnicodeData/CombiningClass.hs create mode 100644 ucd2haskell/exe/UCD2Haskell/Modules/UnicodeData/Composition.hs create mode 100644 ucd2haskell/exe/UCD2Haskell/Modules/UnicodeData/Decomposition.hs create mode 100644 ucd2haskell/exe/UCD2Haskell/Modules/UnicodeData/DerivedNames.hs create mode 100644 ucd2haskell/exe/UCD2Haskell/Modules/UnicodeData/GeneralCategory.hs create mode 100644 ucd2haskell/exe/UCD2Haskell/Modules/UnicodeData/NameAliases.hs create mode 100644 ucd2haskell/exe/UCD2Haskell/Modules/UnicodeData/SimpleCaseMappings.hs diff --git a/ucd2haskell/exe/Parser/Text.hs b/ucd2haskell/exe/Parser/Text.hs deleted file mode 100644 index d294b0b7..00000000 --- a/ucd2haskell/exe/Parser/Text.hs +++ /dev/null @@ -1,3245 +0,0 @@ -{-# LANGUAGE LambdaCase #-} - --- | --- Module : Parser.Text --- Copyright : (c) 2021 Pierre Le Marre --- (c) 2020 Composewell Technologies and Contributors --- (c) 2016-2017 Harendra Kumar --- (c) 2014-2015 Antonio Nikishaev --- License : Apache-2.0 --- Maintainer : streamly@composewell.com --- Stability : experimental - --- The original Unicode database parser was taken from --- https://github.com/composewell/unicode-transforms but was completely --- rewritten from scratch to parse from UCD text files instead of XML, only --- some types remain the same. That code in turn was originally taken from --- https://github.com/llelf/prose (Antonio Nikishaev) and heavily modified by --- Harendra Kumar. --- -module Parser.Text - ( genCoreModules - , genNamesModules - , genScriptsModules - , genSecurityModules - ) where - -import Control.Applicative (Alternative(..)) -import Control.Arrow ((&&&)) -import Control.Exception (assert, catch, IOException) -import Control.Monad (void) -import Control.Monad.IO.Class (MonadIO(liftIO)) -import Data.Bifunctor (Bifunctor(..)) -import Data.Bits (Bits(..)) -import Data.Char (chr, ord, isAlphaNum, isAscii, isSpace, toUpper) -import Data.Foldable (foldl') -import Data.Function (on, (&)) -import Data.Functor ((<&>)) -import Data.List - (dropWhileEnd, elemIndex, groupBy, intersperse, sort, sortBy, unfoldr) -import Data.Maybe (fromMaybe) -import Data.Ratio ((%)) -import Data.Word (Word8, Word32) -import GHC.Stack (HasCallStack) -import Numeric (showHex) -import Streamly.Data.Fold (Fold) -import Streamly.Prelude (IsStream, SerialT) -import System.FilePath (()) -import System.Directory (createDirectoryIfMissing) -import System.Environment (getEnv) -import System.IO.Unsafe (unsafePerformIO) - -import qualified Data.Map.Strict as Map -import qualified Data.Set as Set -import qualified GHC.Foreign as Foreign -import qualified GHC.IO.Encoding as Encoding -import qualified Streamly.Prelude as Stream -import qualified Streamly.Data.Fold as Fold -import qualified Streamly.Internal.Data.Fold as Fold -import qualified Streamly.Data.Unfold as Unfold -import qualified Streamly.FileSystem.Handle as Handle -import qualified System.IO as Sys -import qualified Streamly.Unicode.Stream as Unicode - -------------------------------------------------------------------------------- --- Types -------------------------------------------------------------------------------- - -data GeneralCategory = - Lu|Ll|Lt| --LC - Lm|Lo| --L - Mn|Mc|Me| --M - Nd|Nl|No| --N - Pc|Pd|Ps|Pe|Pi|Pf|Po| --P - Sm|Sc|Sk|So| --S - Zs|Zl|Zp| --Z - Cc|Cf|Cs|Co|Cn --C - deriving (Bounded, Enum, Eq, Show, Read) - -data DecompType = - DTCanonical | DTCompat | DTFont - | DTNoBreak | DTInitial | DTMedial | DTFinal - | DTIsolated | DTCircle | DTSuper | DTSub - | DTVertical | DTWide | DTNarrow - | DTSmall | DTSquare | DTFraction - deriving (Show, Eq) - -data Decomp = DCSelf | DC [Char] deriving (Show, Eq) - -data DType = Canonical | Kompat - -data DetailedChar = - DetailedChar - { _char :: Char - , _name :: String - -- ^ Only used to detect ranges. - -- The names we use are read from @DerivedName.txt@. - , _generalCategory :: GeneralCategory - , _combiningClass :: Int - , _decompositionType :: Maybe DecompType - , _decomposition :: Decomp - , _simpleUpperCaseMapping :: Maybe Char - , _simpleLowerCaseMapping :: Maybe Char - , _simpleTitleCaseMapping :: Maybe Char - } - deriving (Show) - -type CharRange = Either Char (Char, Char) -data CharRangeStream - = SingleChar !Char - | CharRange !Char !Char - | Stop - -------------------------------------------------------------------------------- --- Helpers -------------------------------------------------------------------------------- - -apacheLicense - :: Word -- ^ Copyright year - -> String -- ^ Module name - -> String -apacheLicense year modName = - unlines - [ "-- |" - , "-- Module : " <> modName - , "-- Copyright : (c) " - <> show year - <> " Composewell Technologies and Contributors" - , "-- License : Apache-2.0" - , "-- Maintainer : streamly@composewell.com" - , "-- Stability : experimental" - ] - -readCodePoint :: String -> Char -readCodePoint = chr . read . ("0x"++) - -readCodePointM :: String -> Maybe Char -readCodePointM "" = Nothing -readCodePointM u = Just (readCodePoint u) - -parseCodePointRange :: String -> CharRange -parseCodePointRange - = (\(c1, c2) -> maybe (Left c1) (Right . (c1,)) c2) - . bimap readCodePoint (readCodePointM . drop 2) - . span (/= '.') - -showPaddedHex :: Int -> String -showPaddedHex cp = - let hex = showHex cp mempty - padding = 4 - length hex - in replicate padding '0' <> hex - -showPaddedHeX :: Int -> String -showPaddedHeX = fmap toUpper . showPaddedHex - -showHexCodepoint :: Char -> String -showHexCodepoint = showPaddedHeX . ord - -generalCategoryConstructor :: GeneralCategory -> String -generalCategoryConstructor = \case - Lu -> "UppercaseLetter" - Ll -> "LowercaseLetter" - Lt -> "TitlecaseLetter" - Lm -> "ModifierLetter" - Lo -> "OtherLetter" - Mn -> "NonSpacingMark" - Mc -> "SpacingCombiningMark" - Me -> "EnclosingMark" - Nd -> "DecimalNumber" - Nl -> "LetterNumber" - No -> "OtherNumber" - Pc -> "ConnectorPunctuation" - Pd -> "DashPunctuation" - Ps -> "OpenPunctuation" - Pe -> "ClosePunctuation" - Pi -> "InitialQuote" - Pf -> "FinalQuote" - Po -> "OtherPunctuation" - Sm -> "MathSymbol" - Sc -> "CurrencySymbol" - Sk -> "ModifierSymbol" - So -> "OtherSymbol" - Zs -> "Space" - Zl -> "LineSeparator" - Zp -> "ParagraphSeparator" - Cc -> "Control" - Cf -> "Format" - Cs -> "Surrogate" - Co -> "PrivateUse" - Cn -> "NotAssigned" - -genBitmap :: HasCallStack => String -> [Int] -> String -genBitmap funcName ordList = mconcat - [ "{-# INLINE " <> funcName <> " #-}\n" - , funcName, " :: Char -> Bool\n" - , funcName, func - , " !(Ptr bitmap#) = ", bitmapLookup, "\n\n" - , bitmapLookup, " :: Ptr Word8\n" - , bitmapLookup, " = Ptr\n" - , " \"", bitMapToAddrLiteral bitmap "\"#\n" ] - where - rawBitmap = positionsToBitMap ordList - bitmapLookup = funcName <> "Bitmap" - (func, bitmap) = if length rawBitmap <= 0x40000 - -- Only planes 0-3 - then - ( mconcat - [ " = \\c -> let cp = ord c in cp >= 0x" - , showPaddedHeX (minimum ordList) - , " && cp <= 0x" - , showPaddedHeX (maximum ordList) - , " && lookupBit64 bitmap# cp\n" - , " where\n" ] - , rawBitmap ) - -- Planes 0-3 and 14 - else - let (planes0To3, plane14) = splitPlanes "genBitmap: cannot build" not rawBitmap - bound0 = pred (minimum ordList) - bound1 = length planes0To3 - bound2 = 0xE0000 + length plane14 - in ( mconcat - [ " c\n" - , if bound0 > 0 - then mconcat - [ " | cp < 0x" - , showPaddedHeX bound0 - , " = False\n" ] - else "" - , " | cp < 0x", showPaddedHeX bound1 - , " = lookupBit64 bitmap# cp\n" - , " | cp < 0xE0000 = False\n" - , " | cp < 0x", showPaddedHeX bound2 - , " = lookupBit64 bitmap# (cp - 0x" - , showPaddedHeX (0xE0000 - bound1) - , ")\n" - , " | otherwise = False\n" - , " where\n" - , " cp = ord c\n" ] - , planes0To3 <> plane14 ) - -positionsToBitMap :: [Int] -> [Bool] -positionsToBitMap = go 0 - - where - - go _ [] = [] - go i xxs@(x:xs) - | i < x = False : go (i + 1) xxs - | otherwise = True : go (i + 1) xs - -bitMapToAddrLiteral - :: [Bool] - -- ^ Values to encode - -> String - -- ^ String to append - -> String -bitMapToAddrLiteral bs = chunkAddrLiteral 4 0xff encode (unfoldr mkChunks bs) - - where - - mkChunks :: [a] -> Maybe ([a], [a]) - mkChunks = \case - [] -> Nothing - xs -> Just (splitAt 8 xs) - - encode :: [Bool] -> String -> String - encode chunk acc = '\\' : shows (toByte (padTo8 chunk)) acc - - padTo8 :: [Bool] -> [Bool] - padTo8 xs - | length xs >= 8 = xs - | otherwise = xs ++ replicate (8 - length xs) False - - toByte :: [Bool] -> Int - toByte xs = sum $ map (\i -> if xs !! i then 1 `shiftL` i else 0) [0..7] - -genEnumBitmap - :: forall a. (HasCallStack, Bounded a, Enum a, Eq a, Show a) - => String - -- ^ Function name - -> (a, String) - -- ^ Value for planes 15-16 - -> (a, String) - -- ^ Default value - -> [a] - -- ^ List of values to encode for planes 0 to 3 - -> [a] - -- ^ List of values to encode for plane 14 - -> String -genEnumBitmap funcName (defPUA, pPUA) (def, pDef) planes0To3 plane14 = mconcat - [ "{-# INLINE ", funcName, " #-}\n" - , funcName, " :: Char -> Int\n" - , funcName, func - , " !(Ptr bitmap#) = ", bitmapLookup, "\n\n" - , bitmapLookup, " :: Ptr Word8\n" - , bitmapLookup, " = Ptr\n" - , " \"", enumMapToAddrLiteral 4 0xff bitmap "\"#" - ] - where - bitmapLookup = funcName <> "Bitmap" - planes0To3' = dropWhileEnd (== def) planes0To3 - check = if length planes0To3 <= 0x40000 - then () - else error "genEnumBitmap: Cannot build" - (func, bitmap) = check `seq` if null plane14 && defPUA == def - -- Only planes 0-3 - then - ( mconcat - [ " = \\c -> let cp = ord c in if cp >= 0x" - , showPaddedHeX (length planes0To3') - , " then " - , pDef - , " else lookupIntN bitmap# cp\n" - , " where\n" ] - , planes0To3' ) - -- All the planes - else - let plane14' = dropWhileEnd (== def) plane14 - bound1 = length planes0To3' - bound2 = 0xE0000 + length plane14' - in ( mconcat - [ " c\n" - , " -- Planes 0-3\n" - , " | cp < 0x", showPaddedHeX bound1 - , " = lookupIntN bitmap# cp\n" - , " -- Planes 4-13: ", show def, "\n" - , " | cp < 0xE0000 = " <> pDef, "\n" - , " -- Plane 14\n" - , " | cp < 0x", showPaddedHeX bound2 - , " = lookupIntN bitmap# (cp - 0x" - , showPaddedHeX (0xE0000 - bound1) - , ")\n" - , if defPUA == def - then "" - else mconcat - [ " -- Plane 14: ", show def, "\n" - , " | cp < 0xF0000 = ", pDef, "\n" - , " -- Plane 15: ", show defPUA, "\n" - , " | cp < 0xFFFFE = ", pPUA, "\n" - , " -- Plane 15: ", show def, "\n" - , " | cp < 0x100000 = ", pDef, "\n" - , " -- Plane 16: ", show defPUA, "\n" - , " | cp < 0x10FFFE = ", pPUA, "\n" ] - , " -- Default: ", show def, "\n" - , " | otherwise = " <> pDef, "\n" - , " where\n" - , " cp = ord c\n" ] - , planes0To3' <> plane14' ) - -splitPlanes :: (HasCallStack) => String -> (a -> Bool) -> [a] -> ([a], [a]) -splitPlanes msg isDef xs = if all isDef planes4To13 && null planes15To16 - then (planes0To3, plane14) - else error msg - where - planes0To3 = dropWhileEnd isDef (take 0x40000 xs) - planes4To16 = drop 0x40000 xs - planes4To13 = take (0xE0000 - 0x40000) planes4To16 - planes14To16 = drop (0xE0000 - 0x40000) planes4To16 - plane14 = dropWhileEnd isDef (take 0x10000 planes14To16) - planes15To16 = drop 0x10000 planes14To16 - -{-| Encode a list of values as a byte map, using their 'Enum' instance. - -__Note:__ 'Enum' instance must respect the following: - -* @fromEnum minBound >= 0x00@ -* @fromEnum maxBound <= 0xff@ --} -enumMapToAddrLiteral - :: forall a. (Bounded a, Enum a, Show a) - => Word8 - -- ^ Indentation - -> Int - -- ^ Chunk size - -> [a] - -- ^ Values to encode - -> String - -- ^ String to append - -> String -enumMapToAddrLiteral indentation chunkSize = - chunkAddrLiteral indentation chunkSize addWord - - where - - addWord :: a -> String -> String - addWord x acc = '\\' : shows (toWord8 x) acc - - toWord8 :: a -> Word8 - toWord8 a = let w = fromEnum a in if 0 <= w && w <= 0xff - then fromIntegral w - else error $ "Cannot convert to Word8: " <> show a - -chunkAddrLiteral - :: forall a. Word8 - -- ^ Indentation - -> Int - -- ^ Chunk size - -> (a -> String -> String) - -- ^ Function to convert to 'Word8' and prepend to the accumulator - -> [a] - -- ^ Values to encode - -> String - -- ^ String to append - -> String -chunkAddrLiteral indentation chunkSize addWord xs cs - = fst - . foldr go (cs, id : repeat indent) - $ chunksOf chunkSize xs - - where - - indent = indent' indentation . ('\\' :) - indent' = \case - 0 -> \s -> '\\' : '\n' : s - i -> indent' (pred i) . (' ' :) - - go :: [a] -> (String, [String -> String]) -> (String, [String -> String]) - go as (acc, seps) = (foldr addWord (head seps acc) as, tail seps) - -chunksOf :: Int -> [a] -> [[a]] -chunksOf i = go - where - go = \case - [] -> [] - as -> b : go as' - where (b, as') = splitAt i as - --- Encode Word32 to [Word8] little endian -word32ToWord8s :: Word32 -> [Word8] -word32ToWord8s n = (\k -> fromIntegral ((n `shiftR` k) .&. 0xff)) <$> [0,8..24] - --- This bit of code is duplicated but this duplication allows us to reduce 2 --- dependencies on the executable. - -jamoLCount :: Int -jamoLCount = 19 - -jamoVCount :: Int -jamoVCount = 21 - -jamoTCount :: Int -jamoTCount = 28 - -hangulFirst :: Int -hangulFirst = 0xac00 - -hangulLast :: Int -hangulLast = hangulFirst + jamoLCount * jamoVCount * jamoTCount - 1 - -isHangul :: Char -> Bool -isHangul c = n >= hangulFirst && n <= hangulLast - where n = ord c - -------------------------------------------------------------------------------- --- Parsers -------------------------------------------------------------------------------- - --- Make a valid Haskell constructor (in CamelCase) from an identifier. -mkHaskellConstructor :: String -> String -mkHaskellConstructor = reverse . fst . foldl' convert (mempty, True) - where - - convert (acc, newWord) = \case - -- Skip the following and start a new word - ' ' -> (acc, True) - '-' -> (acc, True) - '_' -> (acc, True) - -- Letter or number - c -> if isAscii c && isAlphaNum c - then ( if newWord then toUpper c : acc else c : acc - , False) - else error ("Unsupported character: " <> show c) - -genBlocksModule - :: Monad m - => String - -> Fold m BlockLine String -genBlocksModule moduleName = done <$> Fold.foldl' step initial - where - - done (blocks, defs, ranges) = let ranges' = reverse ranges in unlines - [ apacheLicense 2022 moduleName - , "{-# OPTIONS_HADDOCK hide #-}" - , "" - , "module " <> moduleName - , "(Block(..), BlockDefinition(..), block, blockDefinition)" - , "where" - , "" - , "import Data.Ix (Ix)" - , "import Data.Word (Word32)" - , "import GHC.Exts" - , "import Unicode.Internal.Bits (lookupWord32#)" - , "" - , "-- | Unicode [block](https://www.unicode.org/glossary/#block)." - , "--" - , "-- There is a total of " <> show (length blocks) <> " blocks." - , "--" - , "-- @since 0.3.1" - , "data Block" - , " = " <> mconcat (intersperse "\n | " (reverse blocks)) - , " deriving (Enum, Bounded, Eq, Ord, Ix, Show)" - , "" - , "-- | Block definition: range and name." - , "--" - , "-- @since 0.3.1" - , "data BlockDefinition = BlockDefinition" - , " { blockRange :: !(Int, Int) -- ^ Range" - , " , blockName :: !String -- ^ Name" - , " } deriving (Eq, Ord, Show)" - , "" - , "-- | Block definition" - , "--" - , "-- @since 0.3.1" - , "blockDefinition :: Block -> BlockDefinition" - , "blockDefinition b = case b of" - , mconcat (reverse defs) - , "-- | Character block, if defined." - , "--" - , "-- @since 0.3.1" - , "block :: Char -> Maybe Int" - , "block (C# c#) = getBlock 0# " <> shows (length ranges - 1) "#" - , " where" - , " -- [NOTE] Encoding" - , " -- A range is encoded as two LE Word32:" - , " -- • First one is the lower bound, where the higher 11 bits are the block" - , " -- index and the lower 21 bits are the codepoint." - , " -- • Second one is the upper bound, which correspond to the codepoint." - , "" - , " cp# = int2Word# (ord# c#)" - , "" - , " -- Binary search" - , " getBlock l# u# = if isTrue# (l# ># u#)" - , " then Nothing" - , " else" - , " let k# = l# +# uncheckedIShiftRL# (u# -# l#) 1#" - , " j# = k# `uncheckedIShiftL#` 1#" - , " cpL0# = getRawCodePoint# j#" - , " cpL# = cpL0# `and#` 0x1fffff## -- Mask for codepoint: [0..0x10fff]" - , " cpU# = getRawCodePoint# (j# +# 1#)" - , " in if isTrue# (cpU# `ltWord#` cp#)" - , " -- cp > upper bound" - , " then getBlock (k# +# 1#) u#" - , " -- check lower bound" - , " else if isTrue# (cp# `ltWord#` cpL#)" - , " -- cp < lower bound" - , " then getBlock l# (k# -# 1#)" - , " -- cp in block: get block index" - , " else let block# = cpL0# `uncheckedShiftRL#` 21#" - , " in Just (I# (word2Int# block#))" - , "" - , " getRawCodePoint# = lookupWord32# ranges#" - , "" - , " -- Encoded ranges" - , " !(Ptr ranges#) = rangesBitmap" - , "" - , "rangesBitmap :: Ptr Word32" - , "rangesBitmap = Ptr" - , " \"" <> enumMapToAddrLiteral 4 0xff (mkRanges ranges') "\"#" - ] - - initial :: ([String], [String], [(Int, Int)]) - initial = (mempty, mempty, mempty) - - step (blocks, defs, ranges) (blockName, blockRange) = - let blockID = mkHaskellConstructor blockName - in ( mkBlockConstructor blockID blockName blockRange : blocks - , mkBlockDef blockID blockName blockRange : defs - , blockRange : ranges ) - - mkBlockConstructor blockID blockName (l, u) = mconcat - [ blockID - , " -- ^ @U+" - , showPaddedHeX l - , "..U+" - , showPaddedHeX u - , "@: " - , blockName - , "." - ] - - mkBlockDef blockID blockName (l, u) = mconcat - [ " " - , blockID - , " -> BlockDefinition (0x" - , showPaddedHex l - , ", 0x" - , showPaddedHex u - , ") " - , show blockName - , "\n" - ] - - -- [NOTE] Encoding: a range is encoded as two LE Word32: - -- • First one is the lower bound, where the higher 11 bits are the block - -- index and the lower 21 bits are the codepoint. - -- • Second one is upper bound, which correspond to the codepoint. - mkRanges :: [(Int, Int)] -> [Word8] - mkRanges = foldMap (uncurry mkBlockRange) . zip [0..] - mkBlockRange :: Word32 -> (Int, Int) -> [Word8] - mkBlockRange idx (l, u) = encodeBound idx l <> encodeBound 0 u - - encodeBound :: Word32 -> Int -> [Word8] - encodeBound idx n = word32ToWord8s ((idx `shiftL` 21) .|. fromIntegral n) - -defaultScript :: String -defaultScript = "Unknown" - -genScriptsModule - :: Monad m - => String - -> PropertyValuesAliases - -> Fold m ScriptLine String -genScriptsModule moduleName aliases = - done <$> Fold.foldl' addRange mempty - where - - done ranges = - let scripts = Set.toList - (foldr addScript (Set.singleton defaultScript) ranges) - in unlines - [ apacheLicense 2022 moduleName - , "{-# OPTIONS_HADDOCK hide #-}" - , "" - , "module " <> moduleName - , "(Script(..), script, scriptDefinition)" - , "where" - , "" - , "import Data.Char (ord)" - , "import Data.Int (Int32)" - , "import Data.Ix (Ix)" - , "import Data.Word (Word8)" - , "import GHC.Exts (Ptr(..))" - , "import Unicode.Internal.Bits (lookupIntN)" - , "" - , "-- | Unicode [script](https://www.unicode.org/reports/tr24/)." - , "--" - , "-- The constructors descriptions are the original Unicode values" - , "-- (short and long forms)." - , "--" - , "-- There is a total of " <> show (length scripts) <> " scripts." - , "--" - , "-- @since 0.1.0" - , "data Script" - , " = " <> mkScripts scripts - , " deriving (Enum, Bounded, Eq, Ord, Ix, Show)" - , "" - , "-- | Script definition: list of corresponding characters." - , "--" - , "-- @since 0.1.0" - , "scriptDefinition :: Script -> (Ptr Int32, Int)" - , "scriptDefinition b = case b of" - , mkScriptDefinitions ranges - , "-- | Script of a character." - , "--" - , "-- @since 0.1.0" - , if length scripts <= 0xff - then mkCharScripts scripts ranges - else error "Cannot encode scripts" - , "" - ] - - addRange :: [ScriptLine] -> ScriptLine -> [ScriptLine] - addRange acc l@(script, r) = case acc of - (script', r'):acc' -> if script == script' - then case combineRanges r r' of - Left r'' -> (script, r'') : acc - Right r'' -> (script, r'') : acc' - else l : acc - _ -> [l] - - combineRanges :: CharRange -> CharRange -> Either CharRange CharRange - combineRanges r = case r of - Left c1 -> \case - Left c2 -> if c1 == succ c2 - then Right (Right (c2, c1)) - else Left r - Right (c2, c3) -> if c1 == succ c3 - then Right (Right (c2, c1)) - else Left r - Right (c1, c2) -> \case - Left c3 -> if c1 == succ c3 - then Right (Right (c3, c2)) - else Left r - Right (c3, c4) -> if c1 == succ c4 - then Right (Right (c3, c2)) - else Left r - - addScript :: ScriptLine -> Set.Set String -> Set.Set String - addScript (script, _) = Set.insert script - - mkScripts :: [String] -> String - mkScripts - = mconcat - . intersperse "\n | " - . fmap (\script -> mconcat - [ mkHaskellConstructor script - , " -- ^ " - , case Map.lookup script aliasesMap of - Just as -> mkAliases as - Nothing -> error ("No abbreviation for script: " <> script) - , ": @" - , script - , "@" - ]) - - -- Map: script long form → short forms - aliasesMap :: Map.Map String [String] - aliasesMap = Map.foldrWithKey - (\abbr as -> Map.insert (head as) (abbr : tail as)) - mempty - aliases - mkAliases - = mconcat - . intersperse ", " - . fmap (\abbr -> mconcat ["@", abbr, "@"]) - - mkScriptDefinitions :: [ScriptLine] -> String - mkScriptDefinitions - = foldMap mkScriptDefinition - . groupBy ((==) `on` fst) - . reverse - . addUnknownRanges - - addUnknownRanges :: [ScriptLine] -> [ScriptLine] - addUnknownRanges ls = - let addUnknown (acc, expected) (c, _) = case mkMissingRange expected c of - Just r -> (,succ c) $ case acc of - r':acc' -> either (:acc) (:acc') (combineRanges r r') - _ -> [r] - Nothing -> (acc, succ expected) - addRest (acc@(r':acc'), expected) = - let r = Right (expected, maxBound) - in either (:acc) (:acc') (combineRanges r r') - addRest _ = error "impossible" - unknown = fmap (defaultScript,) . addRest $ foldl' - addUnknown - (mempty, '\0') - (sort (foldMap (rangeToCharScripts id) ls)) - in unknown <> ls - - mkMissingRange :: Char -> Char -> Maybe CharRange - mkMissingRange expected c - | c == expected = Nothing - | c == succ expected = Just (Left expected) - | otherwise = Just (Right (expected, pred c)) - - mkScriptDefinition :: [ScriptLine] -> String - mkScriptDefinition ranges = mconcat - [ " " - , mkHaskellConstructor (fst (head ranges)) - , " -> (Ptr \"" - , foldMap encodeRange ranges - , "\"#, " - , show (foldr (\r -> either (const (+1)) (const (+2)) (snd r)) 0 ranges :: Word) - , ")\n" - ] - - -- Encoding: - -- • A single char is encoded as an LE Int32. - -- • A range is encoded as two LE Int32 (first is lower bound, second is - -- upper bound), which correspond to the codepoints with the 32th bit set. - encodeRange :: ScriptLine -> String - encodeRange (_, r) = case r of - Left c -> encodeBytes (fromIntegral (ord c)) - Right (l, u) -> encodeBytes (setBit (fromIntegral (ord l)) 31) - <> encodeBytes (setBit (fromIntegral (ord u)) 31) - encodeBytes = foldr addByte "" . word32ToWord8s - addByte n acc = '\\' : shows n acc - - mkCharScripts :: [String] -> [ScriptLine] -> String - mkCharScripts scripts scriptsRanges = - let charScripts = sort (foldMap (rangeToCharScripts getScript) scriptsRanges) - charScripts' = reverse (fst (foldl' addMissing (mempty, '\0') charScripts)) - addMissing (acc, expected) x@(c, script) = if expected < c - then addMissing (def:acc, succ expected) x - else (script:acc, succ c) - def = getScript defaultScript - getScript s = fromMaybe (error "script not found") (elemIndex s scripts) - -- [TODO] simplify - (planes0To3, plane14) = splitPlanes "Cannot generate: genScriptsModule" (== def) charScripts' - in genEnumBitmap - "script" - (def, show (fromEnum def)) - (def, show (fromEnum def)) - planes0To3 - plane14 - - rangeToCharScripts :: (String -> b) -> ScriptLine -> [(Char, b)] - rangeToCharScripts f (script, r) = case r of - Left cp -> [(cp, f script)] - Right (l, u) -> (, f script) <$> [l..u] - -genScriptExtensionsModule - :: Monad m - => String - -> PropertyValuesAliases - -> ScriptExtensions - -> Fold m ScriptLine String -genScriptExtensionsModule moduleName aliases extensions = - done <$> Fold.foldl' processLine mempty - - where - - -- [NOTE] We rely on all the scripts having a short form - - -- Map: script → short form - scriptsAbbr :: Map.Map String String - scriptsAbbr = - Map.foldrWithKey (\abbr as -> Map.insert (head as) abbr) mempty aliases - getScriptAbbr :: String -> String - getScriptAbbr = fromMaybe (error "script not found") . (scriptsAbbr Map.!?) - - -- All possible values: extensions + scripts - extensionsSet :: Set.Set [String] - extensionsSet = Set.fromList (Map.elems extensions) - <> Set.map pure (Map.keysSet aliases) - extensionsList = sortBy - (compare `on` fmap mkScript) - (Set.toList extensionsSet) - - encodeExtensions :: [String] -> Int - encodeExtensions e = fromMaybe - (error ("extension not found: " <> show e)) - (elemIndex e extensionsList) - - encodedExtensions :: Map.Map [String] Int - encodedExtensions = - let l = length extensionsSet - in if length extensionsSet > 0xff - then error ("Too many script extensions: " <> show l) - else Map.fromSet encodeExtensions extensionsSet - - processLine - :: (Set.Set [String], Map.Map Char Int) -- used exts, encoded char exts - -> ScriptLine - -> (Set.Set [String], Map.Map Char Int) - processLine acc (script, range) = case range of - Left c -> addChar script c acc - Right (c1, c2) -> foldr (addChar script) acc [c1..c2] - - addChar - :: String -- script - -> Char -- processed char - -> (Set.Set [String], Map.Map Char Int) - -> (Set.Set [String], Map.Map Char Int) - addChar script c (extsAcc, charAcc) = case Map.lookup c extensions of - -- Char has explicit extensions - Just exts -> ( Set.insert exts extsAcc - , Map.insert c (encodedExtensions Map.! exts) charAcc) - -- Char has no explicit extensions: use its script - Nothing -> - let exts = [getScriptAbbr script] - in ( Set.insert exts extsAcc - , Map.insert c (encodedExtensions Map.! exts) charAcc) - - done (usedExts, exts) = unlines - [ apacheLicense 2022 moduleName - , "{-# LANGUAGE OverloadedLists #-}" - , "{-# OPTIONS_HADDOCK hide #-}" - , "" - , "module " <> moduleName - , "(scriptExtensions, decodeScriptExtensions)" - , "where" - , "" - , "import Data.Char (ord)" - , "import Data.List.NonEmpty (NonEmpty)" - , "import Data.Word (Word8)" - , "import GHC.Exts (Ptr(..))" - , "import Unicode.Internal.Char.Scripts (Script(..))" - , "import Unicode.Internal.Bits (lookupIntN)" - , "" - , "-- | Useful to decode the output of 'scriptExtensions'." - , "decodeScriptExtensions :: Int -> NonEmpty Script" - , "decodeScriptExtensions = \\case" <> mkDecodeScriptExtensions usedExts - , " _ -> [" <> mkHaskellConstructor defaultScript <> "]" - , "" - , "-- | Script extensions of a character." - , "--" - , "-- @since 0.1.0" - , genEnumBitmap - "scriptExtensions" - (def, show (fromEnum def)) - (def, show (fromEnum def)) - planes0To3 - plane14 - ] - where - scriptExtensions = mkScriptExtensions exts - -- [TODO] simplify - (planes0To3, plane14) = splitPlanes - "Cannot generate: genScriptExtensionsModule" - (== def) - scriptExtensions - - mkDecodeScriptExtensions :: Set.Set [String] -> String - mkDecodeScriptExtensions - = mkDecodeScriptExtensions' - . Set.map (\exts -> (encodedExtensions Map.! exts, exts)) - mkDecodeScriptExtensions' = foldMap $ \(v, exts) -> mconcat - [ "\n " - , show v - , " -> [" - , mconcat (intersperse ", " (mkScript <$> exts)) - , "]" - ] - mkScript :: String -> String - mkScript = mkHaskellConstructor . head . (aliases Map.!) - - def :: Int - def = encodedExtensions Map.! [getScriptAbbr defaultScript] - - mkScriptExtensions - = reverse - . snd - . Map.foldlWithKey addCharExt ('\0', mempty) - addCharExt (expected, acc) c v = if expected < c - then addCharExt (succ expected, def : acc) c v - else (succ c, v : acc) - -------------------------------------------------------------------------------- --- Parsing UnicodeData.txt -------------------------------------------------------------------------------- - -genGeneralCategoryModule - :: Monad m - => String - -> Fold m DetailedChar String -genGeneralCategoryModule moduleName = - done <$> Fold.foldl' step initial - - where - - -- (categories planes 0-3, categories plane 14, expected char) - initial = - ( [] - , [] - , minBound - , CharBounds minBound minBound minBound minBound minBound minBound minBound) - - step :: ([GeneralCategory], [GeneralCategory], Char, CharBounds) - -> DetailedChar - -> ([GeneralCategory], [GeneralCategory], Char, CharBounds) - step acc@(acc1, acc2, p, bounds) a - -- Plane 0 to 3, missing char - -- Fill missing char entry with default category Cn - -- See: https://www.unicode.org/reports/tr44/#Default_Values_Table - | plane0To3 && p < c = step (Cn : acc1, acc2, succ p, bounds) - a - -- Plane 0 to 3, Regular entry - | plane0To3 = - ( generalCategory : acc1 - , acc2 - , succ c - , updateCharBounds bounds c generalCategory ) - -- Plane 4 to 13: no entry expected - | plane4To13 = error ("Unexpected char in plane 4-13: " <> show a) - -- Plane 15 to 16: skip if PUA - | plane15To16 = case generalCategory of - Co -> acc -- skip - _ -> error ("Unexpected char in plane 15-16: " <> show a) - -- Leap to plane 14 - | p < '\xE0000' = step (acc1, acc2, '\xE0000', bounds) a - -- Plane 14, missing char - | p < c = step (acc1, Cn : acc2, succ p, bounds) a - -- Plane 14, regular entry - | otherwise = - ( acc1 - , generalCategory : acc2 - , succ c - , updateCharBounds bounds c generalCategory ) - where - c = _char a - generalCategory = _generalCategory a - plane0To3 = c <= '\x3FFFF' - plane4To13 = c <= '\xDFFFF' - plane15To16 = c >= '\xF0000' - - done (acc1, acc2, _, CharBounds{..}) = unlines - [ apacheLicense 2020 moduleName - , "{-# OPTIONS_HADDOCK hide #-}" - , "{-# LANGUAGE PatternSynonyms #-}" - , "" - , "module " <> moduleName - , "( -- * Lookup functions" - , " generalCategory" - , ", generalCategoryPlanes0To3" - , "" - , " -- * General categories" - , foldMap mkGeneralCategoryPatternExport [minBound..maxBound] - , " -- * Characters bounds for predicates" - , init $ foldMap mkCharBoundPatternExport charBoundPatterns - , ") where" - , "" - , "import Data.Char (ord)" - , "import Data.Word (Word8)" - , "import GHC.Exts (Ptr(..))" - , "import Unicode.Internal.Bits (lookupIntN)" - , "" - , "--------------------------------------------------------------------------------" - , "-- General category patterns" - , "--------------------------------------------------------------------------------" - , foldMap mkGeneralCategoryPattern [minBound..maxBound] - , "--------------------------------------------------------------------------------" - , "-- Characters bounds for predicates" - , "--------------------------------------------------------------------------------" - , foldMap mkCharBoundPattern charBoundPatterns - , "--------------------------------------------------------------------------------" - , "-- Lookup functions" - , "--------------------------------------------------------------------------------" - , "" - , "-- | Return the general category of a code point in planes 0 to 3" - , "--" - , "-- The caller of this function must ensure its parameter is \\< @0x40000@." - , "{-# INLINE generalCategoryPlanes0To3 #-}" - , "generalCategoryPlanes0To3 :: Int -> Int" - , "generalCategoryPlanes0To3 = lookupIntN bitmap#" - , " where" - , " !(Ptr bitmap#) = generalCategoryBitmap" - , "" - , "-- | Return the general category of a character" - , genEnumBitmap - "generalCategory" - (Co, generalCategoryConstructor Co) - (Cn, generalCategoryConstructor Cn) - (reverse acc1) - (reverse acc2) - ] - where - mkExport p = mconcat [", pattern ", p, "\n"] - mkGeneralCategoryPatternExport = mkExport . generalCategoryConstructor - mkGeneralCategoryPattern gc = mconcat - [ "\n-- | General category ", show gc, "\n" - , "pattern ", generalCategoryConstructor gc, " :: Int\n" - , "pattern ", generalCategoryConstructor gc - , " = " - , show (fromEnum gc), "\n"] - mkCharBoundPatternExport = mkExport . fst - mkCharBoundPattern (p, c) = mconcat - [ "\n-- | Maximum codepoint satisfying @", 'i' : drop 4 p, "@\n" - , "pattern ", p, " :: Int\n" - , "pattern ", p, " = 0x", showHexCodepoint c, "\n"] - charBoundPatterns = - [ ("MaxIsLetter" , maxIsLetter ) - , ("MaxIsAlphaNum" , maxIsAlphaNum ) - , ("MaxIsLower" , maxIsLower ) - , ("MaxIsUpper" , maxIsUpper ) - , ("MaxIsNumber" , maxIsNumber ) - , ("MaxIsSpace" , maxIsSpace ) - , ("MaxIsSeparator", maxIsSeparator) ] - -data CharBounds = CharBounds - { maxIsLetter :: !Char - , maxIsAlphaNum :: !Char - , maxIsLower :: !Char - , maxIsUpper :: !Char - , maxIsNumber :: !Char - , maxIsSpace :: !Char - , maxIsSeparator :: !Char } - -updateCharBounds :: CharBounds -> Char -> GeneralCategory -> CharBounds -updateCharBounds acc@CharBounds{..} c = \case - Lu -> acc{ maxIsAlphaNum = max maxIsAlphaNum c - , maxIsLetter = max maxIsLetter c - , maxIsUpper = max maxIsUpper c } - Ll -> acc{ maxIsAlphaNum = max maxIsAlphaNum c - , maxIsLetter = max maxIsLetter c - , maxIsLower = max maxIsLower c } - Lt -> acc{ maxIsAlphaNum = max maxIsAlphaNum c - , maxIsLetter = max maxIsLetter c - , maxIsUpper = max maxIsUpper c } - Lm -> acc{maxIsAlphaNum=max maxIsAlphaNum c, maxIsLetter=max maxIsLetter c} - Lo -> acc{maxIsAlphaNum=max maxIsAlphaNum c, maxIsLetter=max maxIsLetter c} - Nd -> acc{maxIsAlphaNum=max maxIsAlphaNum c, maxIsNumber=max maxIsNumber c} - Nl -> acc{maxIsAlphaNum=max maxIsAlphaNum c, maxIsNumber=max maxIsNumber c} - No -> acc{maxIsAlphaNum=max maxIsAlphaNum c, maxIsNumber=max maxIsNumber c} - Zs -> acc{maxIsSeparator=max maxIsAlphaNum c, maxIsSpace=max maxIsSpace c} - Zl -> acc{maxIsSeparator=max maxIsAlphaNum c} - Zp -> acc{maxIsSeparator=max maxIsAlphaNum c} - _ -> acc - -readDecomp :: String -> (Maybe DecompType, Decomp) -readDecomp s = - if null wrds - then (Nothing, DCSelf) - else decmps wrds - - where - - decmps [] = error "Unreachable flow point" - decmps y@(x:xs) = - case dtmap x of - DTCanonical -> (,) (Just DTCanonical) (readCP y) - other -> (,) (Just other) (readCP xs) - - wrds = words s - - readCP ws = DC $ map readCodePoint ws - - dtmap "" = DTCompat - dtmap "" = DTCircle - dtmap "" = DTFinal - dtmap "" = DTFont - dtmap "" = DTFraction - dtmap "" = DTInitial - dtmap "" = DTIsolated - dtmap "" = DTMedial - dtmap "" = DTNarrow - dtmap "" = DTNoBreak - dtmap "" = DTSmall - dtmap "" = DTSquare - dtmap "" = DTSub - dtmap "" = DTSuper - dtmap "" = DTVertical - dtmap "" = DTWide - dtmap _ = DTCanonical - -filterNonHangul :: Monad m => Fold m DetailedChar a -> Fold m DetailedChar a -filterNonHangul = Fold.filter (not . isHangul . _char) - -filterDecomposableType :: - Monad m => DType -> Fold m DetailedChar a -> Fold m DetailedChar a -filterDecomposableType dtype = - Fold.filter ((/= DCSelf) . _decomposition) - . Fold.filter (predicate . _decompositionType) - - where - - predicate = - case dtype of - Canonical -> (== Just DTCanonical) - Kompat -> const True - -genDecomposableModule :: - Monad m => String -> DType -> Fold m DetailedChar String -genDecomposableModule moduleName dtype = - filterNonHangul - $ filterDecomposableType dtype $ done <$> Fold.foldl' step initial - - where - - initial = [] - - step st a = ord (_char a) : st - - done st = - unlines - [ apacheLicense 2020 moduleName - , "{-# OPTIONS_HADDOCK hide #-}" - , "" - , "module " <> moduleName - , "(isDecomposable)" - , "where" - , "" - , "import Data.Char (ord)" - , "import Data.Word (Word8)" - , "import GHC.Exts (Ptr(..))" - , "import Unicode.Internal.Bits (lookupBit64)" - , "" - , genBitmap "isDecomposable" (reverse st) - ] - -genCombiningClassModule :: Monad m => String -> Fold m DetailedChar String -genCombiningClassModule moduleName = - Fold.filter (\dc -> _combiningClass dc /= 0) - $ done <$> Fold.foldl' step initial - - where - - initial = ([], []) - - step (st1, st2) a = (genCombiningClassDef a : st1, ord (_char a) : st2) - - done (st1, st2) = - unlines - [ apacheLicense 2020 moduleName - , "{-# LANGUAGE LambdaCase #-}" - , "{-# OPTIONS_HADDOCK hide #-}" - , "module " <> moduleName - , "(combiningClass, isCombining)" - , "where" - , "" - , "import Data.Char (ord)" - , "import Data.Word (Word8)" - , "import GHC.Exts (Ptr(..))" - , "import Unicode.Internal.Bits (lookupBit64)" - , "" - , "combiningClass :: Char -> Int" - , "combiningClass = \\case" - , unlines (reverse st1) - , " _ -> 0\n" - , "" - , genBitmap "isCombining" (reverse st2) - ] - - genCombiningClassDef dc = mconcat - [ " " - , show (_char dc) - , " -> " - , show (_combiningClass dc) - ] - -genDecomposeDefModule :: - Monad m - => String - -> [String] - -> [String] - -> DType - -> (Int -> Bool) - -> Fold m DetailedChar String -genDecomposeDefModule moduleName before after dtype predicate = - Fold.filter (predicate . ord . _char) - $ filterNonHangul - $ filterDecomposableType dtype $ done <$> Fold.foldl' step initial - - where - - decomposeChar c DCSelf = [c] - decomposeChar _c (DC ds) = ds - - genHeader = - [ apacheLicense 2020 moduleName - , "{-# LANGUAGE LambdaCase #-}" - , "{-# OPTIONS_HADDOCK hide #-}" - , "" - , "module " <> moduleName - , "(decompose)" - , "where" - , "" - ] - genSign = - [ "{-# NOINLINE decompose #-}" - , "decompose :: Char -> [Char]" - , "decompose = \\case" - ] - initial = [] - - step st dc = genDecomposeDef dc : st - - done st = - let body = mconcat [genHeader, before, genSign, reverse st, after] - in unlines body - - genDecomposeDef dc = mconcat - [ " " - , show (_char dc) - , " -> " - , show (decomposeChar (_char dc) (_decomposition dc)) - ] - -genCompositionsModule :: - Monad m - => String - -> [Int] - -> [Int] - -> Fold m DetailedChar String -genCompositionsModule moduleName compExclu non0CC = - Fold.filter (not . flip elem compExclu . ord . _char) - $ filterNonHangul - $ Fold.filter (isDecompositionLen2 . _decomposition) - $ filterDecomposableType Canonical $ done <$> Fold.foldl' step initial - - where - - isDecompositionLen2 DCSelf = False - isDecompositionLen2 (DC ds) = length ds == 2 - - genComposePairDef name dc = - name - <> " " - <> show (head d01) - <> " " <> show (d01 !! 1) <> " = Just " <> show (_char dc) - - where - - d01 = decompPair dc - - decompPair dc = - case _decomposition dc of - DCSelf -> error "toCompFormat: DCSelf" - (DC ds) -> - if length ds == 2 - then ds - else error "toCompFormat: length /= 2" - - initial = ([], [], []) - - step (dec, sp, ss) dc = (dec1, sp1, ss1) - - where - - d01 = decompPair dc - d1Ord = ord $ d01 !! 1 - dec1 = genComposePairDef "compose" dc : dec - sp1 = - if d1Ord `notElem` non0CC - then genComposePairDef "composeStarters" dc : sp - else sp - ss1 = - if d1Ord `notElem` non0CC - then d1Ord : ss - else ss - - header = - [ apacheLicense 2020 moduleName - , "{-# OPTIONS_HADDOCK hide #-}" - , "" - , "module " <> moduleName - , "(compose, composeStarters, isSecondStarter)" - , "where" - , "" - , "import Data.Char (ord)" - , "import Data.Word (Word8)" - , "import GHC.Exts (Ptr(..))" - , "import Unicode.Internal.Bits (lookupBit64)" - , "" - ] - - composePair decomps = - [ "{-# NOINLINE compose #-}" - , "compose :: Char -> Char -> Maybe Char" - , unlines decomps - , "compose _ _ = " <> "Nothing" <> "\n" - , "" - ] - - composeStarterPair starterPairs = - [ "composeStarters :: Char -> Char -> Maybe Char" - , unlines starterPairs - , "composeStarters _ _ = " <> "Nothing" <> "\n" - , "" - ] - - isSecondStarter secondStarters = - [genBitmap "isSecondStarter" secondStarters] - - done (dec, sp, ss) = - unlines - $ header - ++ composePair (reverse dec) - ++ composeStarterPair (reverse sp) - ++ isSecondStarter (Set.toList (Set.fromList ss)) - -genSimpleCaseMappingModule - :: Monad m - => String - -> String - -> (DetailedChar -> Maybe Char) - -> Fold m DetailedChar String -genSimpleCaseMappingModule moduleName funcName field = - done <$> Fold.foldl' step initial - - where - - genHeader = - [ apacheLicense 2020 moduleName - , "{-# LANGUAGE LambdaCase #-}" - , "{-# OPTIONS_HADDOCK hide #-}" - , "" - , "module " <> moduleName - , "(" <> funcName <> ")" - , "where" - , "" - ] - genSign = - [ "{-# NOINLINE " <> funcName <> " #-}" - , funcName <> " :: Char -> Char" - , funcName <> " = \\case" - ] - initial = [] - - step ds dc = case mkEntry dc of - Nothing -> ds - Just d -> d : ds - - after = [" c -> c"] - - done st = - let body = mconcat [genHeader, genSign, reverse st, after] - in unlines body - - mkEntry dc = field dc <&> \c -> mconcat - [ " " - , show (_char dc) - , " -> " - , show c - ] - --- [NOTE] Case mapping encodes up to 3 code points on 21 bits each in an Int64. -genSpecialCaseMappingModule - :: Monad m - => String - -> String - -> SpecialCasings - -- ^ Special casings - -> (SpecialCasing -> String) - -- ^ Special case selector - -> (DetailedChar -> Maybe Char) - -- ^ Simple case selector - -> Fold m DetailedChar String -genSpecialCaseMappingModule moduleName funcName specialCasings special simple = - done <$> Fold.foldl' step initial - - where - - genHeader = - [ apacheLicense 2022 moduleName - , "{-# LANGUAGE LambdaCase #-}" - , "{-# OPTIONS_HADDOCK hide #-}" - , "" - , "module " <> moduleName - , "(" <> funcName <> ")" - , "where" - , "" - , "import Data.Int (Int64)" - , "" - , "{-# NOINLINE " <> funcName <> " #-}" - , funcName <> " :: Char -> Int64" - , funcName <> " = \\case" - ] - initial = [] - - step xs dc = case mkEntry dc of - Nothing -> xs - Just x -> x : xs - - after = [" _ -> 0"] - - done st = - let body = mconcat [genHeader, reverse st, after] - in unlines body - - mkEntry dc = (mkSpecial dc <|> mkSimple dc) <&> \k -> mconcat - [ " " - , show (_char dc) - , " -> 0x" - , showHex k "" - ] - - mkSimple = fmap ord . simple - mkSpecial = fmap (encode . special) . (specialCasings Map.!?) . _char - encode :: String -> Int - encode - = foldr (\(k, c) -> (+) (ord c `shiftL` k)) 0 - . zip [0, 21, 42] - -- Check min 1 character, max 3 characters - . (\cs -> if null cs || length cs > 3 then error (show cs) else cs) - --- [NOTE] Case folding encodes up to 3 code points on 21 bits each in an Int64. -genCaseFolding - :: Monad m - => String - -> Fold m CaseFoldings String -genCaseFolding moduleName = - done <$> Fold.foldl' step initial - - where - - genHeader = - [ apacheLicense 2022 moduleName - , "{-# LANGUAGE LambdaCase #-}" - , "{-# OPTIONS_HADDOCK hide #-}" - , "" - , "module " <> moduleName - , "(toCasefold)" - , "where" - , "" - , "import Data.Int (Int64)" - , "" - , "{-# NOINLINE toCasefold #-}" - , "toCasefold :: Char -> Int64" - , "toCasefold = \\case" - ] - initial = [] - - step xs cf = maybe xs (:xs) (mkEntry cf) - - after = [" _ -> 0"] - - done st = - let body = mconcat [genHeader, reverse st, after] - in unlines body - - mkEntry (c, cfs) - = (lookup FullCaseFolding cfs <|> lookup CommonCaseFolding cfs) - <&> \cf -> mconcat - [ " " - , show c - , " -> 0x" - , showHex (encode cf) "" - ] - - encode :: String -> Int - encode - = foldr (\(k, c) -> (+) (ord c `shiftL` k)) 0 - . zip [0, 21, 42] - -- Check min 1 character, max 3 characters - . (\cs -> if null cs || length cs > 3 then error (show cs) else cs) - -genCorePropertiesModule :: - Monad m => String -> (String -> Bool) -> Fold m (String, [Int]) String -genCorePropertiesModule moduleName isProp = - Fold.filter (\(name, _) -> isProp name) $ done <$> Fold.foldl' step initial - - where - - prop2FuncName x = "is" ++ x - - initial = ([], []) - - step (props, bitmaps) (name, bits) = - (name : props, genBitmap (prop2FuncName name) bits : bitmaps) - - done (props, bitmaps) = unlines $ header props ++ bitmaps - - header exports = - [ apacheLicense 2020 moduleName - , "{-# OPTIONS_HADDOCK hide #-}" - , "" - , "module " <> moduleName - , "(" ++ unwords (intersperse "," (map prop2FuncName exports)) ++ ")" - , "where" - , "" - , "import Data.Char (ord)" - , "import Data.Word (Word8)" - , "import GHC.Exts (Ptr(..))" - , "import Unicode.Internal.Bits (lookupBit64)" - , "" - ] - -genNamesModule - :: Monad m - => String - -> Fold m CharName String -genNamesModule moduleName = - done <$> Fold.foldl' addCharName initial - - where - - initial :: (Char, [String], [[Int]], Int, Char, Char) - initial = ('\0', mempty, mempty, 0, '\0', '\0') - - addCharName - (expected, names, offsets, offset, maxPlanes03, maxPlane14) - entry@(CharName char name) = if expected < char - then if expected < '\xE0000' && char >= '\xE0000' - then addCharName - ( '\xE0000' - , names - , offsets - , offset - , maxPlanes03 - , maxPlane14 ) - entry - else addCharName - ( succ expected - , names - , encodeOffset 0 0 : offsets - , offset - , maxPlanes03 - , maxPlane14 ) - entry - else - let !(name', len, len', compressed) = encodeName name - in if (char < '\x40000' || char >= '\xE0000') && - offset <= 0xffffff && (len < hangul || compressed) - then - ( succ expected - , name' : names - , encodeOffset offset len : offsets - , offset + len' - , if char < '\x40000' - then max maxPlanes03 char - else maxPlanes03 - , max maxPlane14 char ) - else error (mconcat - [ "genNamesModule: Cannot encode '\\x" - , showHexCodepoint char - , "' “", name, "”" - , " (offset: 0x" - , showPaddedHeX offset - , ", length: 0x" - , showPaddedHeX len - , ")" ]) - - cjkCompat = 0xf0 - cjkUnified = 0xf1 - tangut = 0xf2 - hangul = 0x80 - - encodeName name - | take 28 name == "CJK COMPATIBILITY IDEOGRAPH-" = ("", cjkCompat, 0, True) - | take 22 name == "CJK UNIFIED IDEOGRAPH-" = ("", cjkUnified, 0, True) - | take 17 name == "TANGUT IDEOGRAPH-" = ("", tangut, 0, True) - | take 16 name == "HANGUL SYLLABLE " = - let !name' = drop 16 name; !len = length name' - in if len <= 12 - then (name', hangul + len, len, True) - else error ("genNamesModule: cannot encode Hangul: " <> show len) - | otherwise = let !len = length name in (name, len, len, False) - - encodeOffset offset len = encode32LE offset' mempty - where !offset' = len .|. (offset `shiftL` 8) - encode32LE v acc - = (v .&. 0xff) - : (v `shiftR` 8 .&. 0xff) - : (v `shiftR` 16 .&. 0xff) - : v `shiftR` 24 - : acc - - done (_, names, offsets, _, maxPlanes03, maxPlane14) = unlines - [ apacheLicense 2022 moduleName - , "{-# LANGUAGE PatternSynonyms #-}" - , "{-# OPTIONS_HADDOCK hide #-}" - , "" - , "module " <> moduleName - , " ( name" - , " , pattern NoName" - , " , pattern CjkCompatibilityIdeograph" - , " , pattern CjkUnifiedIdeograph" - , " , pattern TangutIdeograph" - , " , pattern HangulSyllable" - , " ) where" - , "" - , "import Data.Int (Int32)" - , "import Foreign.C (CChar)" - , "import GHC.Exts" - , " ( Addr#, Char#, Int#, Ptr(..)," - , " ord#, (-#), (<#)," - , " uncheckedIShiftRL#, andI#," - , " plusAddr#, isTrue# )" - , "import Unicode.Internal.Bits.Names (lookupInt32#)" - , "" - , "-- | No name. Used to test length returned by 'name'." - , "--" - , "-- @since 0.3.0" - , "pattern NoName :: Int#" - , "pattern NoName = 0#" - , "" - , "-- | CJK compatibility ideograph. Used to test the length returned by 'name'." - , "--" - , "-- @since 0.3.0" - , "pattern CjkCompatibilityIdeograph :: Int#" - , "pattern CjkCompatibilityIdeograph = 0x" <> showHex cjkCompat "#" - , "" - , "-- | CJK unified ideograph. Used to test the length returned by 'name'." - , "--" - , "-- @since 0.3.0" - , "pattern CjkUnifiedIdeograph :: Int#" - , "pattern CjkUnifiedIdeograph = 0x" <> showHex cjkUnified "#" - , "" - , "-- | Tangut ideograph. Used to test the length returned by 'name'." - , "--" - , "-- @since 0.3.0" - , "pattern TangutIdeograph :: Int#" - , "pattern TangutIdeograph = 0x" <> showHex tangut "#" - , "" - , "-- | Hangul syllable. Used to test the length returned by 'name'." - , "--" - , "-- @since 0.3.0" - , "pattern HangulSyllable :: Int#" - , "pattern HangulSyllable = 0x" <> showHex hangul "#" - , "" - , "-- | Name of a character, if defined." - , "--" - , "-- The return value represents: (ASCII string, string length or special value)." - , "--" - , "-- Some characters require specific processing:" - , "--" - , "-- * If length = @'CjkCompatibilityIdeograph'@," - , "-- then the name is generated from the pattern “CJK COMPATIBILITY IDEOGRAPH-*”," - , "-- where * is the hexadecimal codepoint." - , "-- * If length = @'CjkUnifiedIdeograph'@," - , "-- then the name is generated from the pattern “CJK UNIFIED IDEOGRAPH-*”," - , "-- where * is the hexadecimal codepoint." - , "-- * If length = @'TangutIdeograph'@," - , "-- then the name is generated from the pattern “TANGUT IDEOGRAPH-*”," - , "-- where * is the hexadecimal codepoint." - , "-- * If length ≥ @'HangulSyllable'@," - , "-- then the name is generated by prepending “HANGUL SYLLABLE ”" - , "-- to the returned string." - , "--" - , "-- See an example of such implementation using 'String's in 'Unicode.Char.General.Names.name'." - , "--" - , "-- @since 0.1.0" - , "{-# INLINE name #-}" - , "name :: Char# -> (# Addr#, Int# #)" - , "name c#" - , " | isTrue# (cp# <# 0x" - <> showHexCodepoint (succ maxPlanes03) - <> "#) = getName cp#" - , " | isTrue# (cp# <# 0xE0000#) = (# \"\\0\"#, 0# #)" - , " | isTrue# (cp# <# 0x" - <> showHexCodepoint (succ maxPlane14) - <> "#) = getName (cp# -# 0x" - <> showPaddedHeX (0xE0000 - ord (succ maxPlanes03)) - <> "#)" - , " | otherwise = (# \"\\0\"#, 0# #)" - , "" - , " where" - , "" - , " -- [NOTE] Encoding" - , " -- • The names are ASCII. Each name is encoded as a raw bytes literal." - , " -- • The names are concatenated in names#." - , " -- There are exceptions (see function’s doc)." - , " -- • The name of a character, if defined, is referenced by an offset in names#." - , " -- • The offsets are stored in offsets#. A character entry is composed of:" - , " -- • a LE Word24 for the offset;" - , " -- • a Word8 for the length of the name or a special value." - , "" - , " !cp# = ord# c#" - , "" - , " {-# INLINE getName #-}" - , " getName k# =" - , " let !entry# = lookupInt32# offsets# k#" - , " !offset# = entry# `uncheckedIShiftRL#` 8#" - , " !name# = names# `plusAddr#` offset#" - , " !len# = entry# `andI#` 0xff#" - , " in (# name#, len# #)" - , "" - , " !(Ptr names#) = namesBitmap" - , " !(Ptr offsets#) = offsetsBitmap" - , "" - , "namesBitmap :: Ptr CChar" - , "namesBitmap = Ptr" - , " \"" - <> chunkAddrLiteral 4 0xff shows' (mconcat (reverse names)) "\"#" - , "" - , "offsetsBitmap :: Ptr Int32" - , "offsetsBitmap = Ptr" - , " \"" - <> enumMapToAddrLiteral 4 0xff (mconcat (reverse offsets)) "\"#" - ] - where - shows' = \case - '\0' -> \s -> '\\' : '0' : s - c -> (c :) -- Note: names are ASCII - -genAliasesModule - :: Monad m - => String - -> Fold m CharAliases String -genAliasesModule moduleName = - done <$> Fold.foldr ((:) . mkCharAliases) mempty - - where - - mkCharAliases :: CharAliases -> String - mkCharAliases (CharAliases char aliases) = mconcat - [ " '\\x" - , showHexCodepoint char - , "'# -> \"" - , mkCharAliasesLiteral char aliases - , "\"#" - ] - - mkCharAliasesLiteral :: Char -> Aliases -> String - mkCharAliasesLiteral char aliasesList = - enumMapToAddrLiteral 0 0xfff - (reverse index) - (mconcat (reverse ("\\0":aliases))) - where - (index, aliases, _) = Map.foldlWithKey' - (addAliasType char) - (mempty, mempty, Map.size aliasesMap) - aliasesMap - -- Group aliases by type - aliasesMap = foldr - (\(ty, a) -> Map.adjust (a:) ty) - (Map.fromSet (const []) (Set.fromList [minBound..maxBound])) - aliasesList - - -- [FIXME] [(Word8:AliasType,Word8:index of first alias)] [CString] - addAliasType - :: Char - -> ([Word8], [String], Int) -- (index, aliases, last alias index) - -> AliasType - -> [Alias] - -> ([Word8], [String], Int) - addAliasType char (index, aliasesAcc, lastAliasIndex) _ty = \case - [] -> - ( 0 : index - , aliasesAcc - , lastAliasIndex ) - aliases -> if lastAliasIndex < 0xff - then - ( fromIntegral lastAliasIndex : index - , encodedAliases - , lastAliasIndex' ) - else error . mconcat $ - [ "Cannot encode char ", show char - , "aliases. Offset: ", show lastAliasIndex, " >= 0xff" ] - where - (encodedAliases, lastAliasIndex') = - addEncodedAliases (aliasesAcc, lastAliasIndex) aliases - addEncodedAliases (as, offset) = \case - Alias alias : rest -> addEncodedAliases - ( mconcat ["\\", show len, alias] : as - , offset' ) - rest - where - len = length alias - offset' = offset + len + 1 - [] -> ("\\0" : as, offset + 1) - - done names = unlines - [ apacheLicense 2022 moduleName - , "{-# LANGUAGE DeriveGeneric, PatternSynonyms #-}" - , "{-# OPTIONS_HADDOCK hide #-}" - , "" - , "module " <> moduleName - , "(NameAliasType(..), pattern MaxNameAliasType, nameAliases)" - , "where" - , "" - , "import Data.Ix (Ix)" - , "import GHC.Exts (Addr#, Char#, Int#)" - , "import GHC.Generics (Generic)" - , "" - , "-- | Type of name alias. See Unicode Standard 15.0.0, section 4.8." - , "--" - , "-- @since 0.1.0" - , "data NameAliasType" - , " = Correction" - , " -- ^ Corrections for serious problems in the character names." - , " | Control" - , " -- ^ ISO 6429 names for @C0@ and @C1@ control functions, and other" - , " -- commonly occurring names for control codes." - , " | Alternate" - , " -- ^ A few widely used alternate names for format characters." - , " | Figment" - , " -- ^ Several documented labels for @C1@ control code points which" - , " -- were never actually approved in any standard." - , " | Abbreviation" - , " -- ^ Commonly occurring abbreviations (or acronyms) for control codes," - , " -- format characters, spaces, and variation selectors." - , " deriving (Generic, Enum, Bounded, Eq, Ord, Ix, Show)" - , "" - , "-- $setup" - , "-- >>> import GHC.Exts (Int(..))" - , "" - , "-- |" - , "-- >>> I# MaxNameAliasType == fromEnum (maxBound :: NameAliasType)" - , "-- True" - , "pattern MaxNameAliasType :: Int#" - , "pattern MaxNameAliasType = " - <> show (fromEnum (maxBound :: AliasType)) <> "#" - , "" - , "-- | Detailed character names aliases." - , "-- The names are listed in the original order of the UCD." - , "--" - , "-- Encoding:" - , "--" - , "-- * If there is no alias, return @\"\\\\xff\"#@." - , "-- * For each type of alias, the aliases are encoded as list of (length, alias)." - , "-- The list terminates with @\\\\0@." - , "-- * The list are then concatenated in order of type of alias and" - , "-- terminates with @\\\\0@." - , "-- * The first " - <> show (fromEnum (maxBound :: AliasType) + 1) - <> " bytes represent each one the index of the first element of the" - , "-- corresponding list of aliases. When the list is empty, then the index is 0." - , "-- * Example: @\\\"\\\\5\\\\0\\\\13\\\\0\\\\0\\\\3XXX\\\\2YY\\\\0\\\\4ZZZZ\\\\0\\\\0\\\"#@" - , "-- represents: @[('Correction',[\\\"XXX\\\", \\\"YY\\\"]),('Alternate', [\\\"ZZZZ\\\"])]@." - , "--" - , "-- @since 0.1.0" - , "nameAliases :: Char# -> Addr#" - , "nameAliases = \\case" - , mconcat (intersperse "\n" names) - , " _ -> \"\\xff\"#" - ] - -genNumericValuesModule - :: Monad m - => String - -> Fold m CharNumericValue String -genNumericValuesModule moduleName = - done . foldMap mkNumericValue . sort <$> Fold.toListRev - - where - - mkNumericValue (char, value) = mconcat - [ "\n " - , show char - , " -> " - , either show show (bimap Just Just value) - ] - - done values = unlines - [ apacheLicense 2022 moduleName - , "{-# LANGUAGE LambdaCase #-}" - , "{-# OPTIONS_HADDOCK hide #-}" - , "" - , "module " <> moduleName - , "(numericValue)" - , "where" - , "" - , "import Data.Ratio ((%))" - , "" - , "numericValue :: Char -> Maybe Rational" - , "numericValue = \\case" <> values - , " _ -> Nothing" - ] - -genIdentifierStatusModule - :: Monad m - => String - -> Fold m CharIdentifierStatus String -genIdentifierStatusModule moduleName = - done <$> Fold.lmap mkAllowed Fold.toList - - where - - mkAllowed = \case - CharIdentifierStatus c Allowed -> ord c - x -> error ("Unexpected " <> show x) - - done values = unlines - [ apacheLicense 2022 moduleName - , "{-# LANGUAGE LambdaCase #-}" - , "{-# OPTIONS_HADDOCK hide #-}" - , "" - , "module " <> moduleName - , "(isAllowedInIdentifier)" - , "where" - , "" - , "import Data.Char (ord)" - , "import Data.Word (Word8)" - , "import GHC.Exts (Ptr(..))" - , "import Unicode.Internal.Bits (lookupBit64)" - , "" - , genBitmap "isAllowedInIdentifier" values - ] - -genIdentifierTypeModule - :: Monad m - => String - -> Fold m CharIdentifierTypes String -genIdentifierTypeModule moduleName = - done . mkIdentifiersTypes <$> Fold.foldl' addIdentifierType mempty - - where - - addIdentifierType - :: Map.Map Char [IdentifierType] - -> CharIdentifierTypes - -> Map.Map Char [IdentifierType] - addIdentifierType acc (CharIdentifierTypes c types) = - Map.insertWith (flip (<>)) c types acc - - mkIdentifiersTypes - :: Map.Map Char [IdentifierType] - -> (String, [Int]) - mkIdentifiersTypes types = - let encoding = Set.toList (Set.fromList (def : Map.elems types)) - in assert (length encoding < 0xff) - ( foldMap addEncoding (zip [0..] encoding) - , snd (Map.foldlWithKey' (addChar encoding) ('\0', mempty) types) ) - - -- Default value - def = [Not_Character] - - addEncoding :: (Int, [IdentifierType]) -> String - addEncoding (n, e) = mconcat - [ "\n " - , show n - , " -> " - , hackHaskellConstructor e ] - - addChar - :: [[IdentifierType]] - -> (Char, [Int]) - -> Char - -> [IdentifierType] - -> (Char, [Int]) - addChar encoding (expected, acc) c types = if expected < c - then - let acc' = encodeTypes encoding def : acc - in addChar encoding (succ expected, acc') c types - else (succ c, encodeTypes encoding types : acc) - - encodeTypes :: [[IdentifierType]] -> [IdentifierType] -> Int - encodeTypes encoding types - = assert (elemIndex def encoding == Just 0) - $ fromMaybe 0 (elemIndex types encoding) - - hackHaskellConstructor = filter (/= '_') . show - - done (encoding, identifiersTypes) = unlines - [ apacheLicense 2022 moduleName - , "{-# LANGUAGE LambdaCase, OverloadedLists #-}" - , "{-# OPTIONS_HADDOCK hide #-}" - , "" - , "module " <> moduleName - , "(IdentifierType(..), identifierTypes, decodeIdentifierTypes)" - , "where" - , "" - , "import Data.Char (ord)" - , "import Data.List.NonEmpty (NonEmpty)" - , "import Data.Word (Word8)" - , "import GHC.Exts (Ptr(..))" - , "import Unicode.Internal.Bits (lookupIntN)" - , "" - , "-- | Identifier type" - , "--" - , "-- @since 0.1.0" - , "data IdentifierType" - , " = NotCharacter" - , " -- ^ Unassigned characters, private use characters, surrogates," - , " -- non-whitespace control characters." - , " | Deprecated" - , " -- ^ Characters with the Unicode property @Deprecated=Yes@." - , " | DefaultIgnorable" - , " -- ^ Characters with the Unicode property \ - \@Default_Ignorable_Code_Point=Yes@." - , " | NotNFKC" - , " -- ^ Characters that cannot occur in strings normalized to NFKC." - , " | NotXID" - , " -- ^ Characters that do not qualify as default Unicode identifiers;" - , " -- that is, they do not have the Unicode property XID_Continue=True." - , " | Exclusion" - , " -- ^ Characters with @Script_Extensions@ values containing a script" - , " -- in /Excluded Scripts/, and no script from /Limited Use Scripts/" - , " -- or /Recommended Scripts/, other than “Common” or “Inherited”." - , " | Obsolete" - , " -- ^ Characters that are no longer in modern use, or that are not" - , " -- commonly used in modern text." - , " | Technical" - , " -- ^ Specialized usage: technical, liturgical, etc." - , " | UncommonUse" - , " -- ^ Characters that are uncommon, or are limited in use, or" - , " -- whose usage is uncertain." - , " | LimitedUse" - , " -- ^ Characters from scripts that are in limited use." - , " | Inclusion" - , " -- ^ Exceptionally allowed characters." - , " | Recommended" - , " -- ^ Characters from scripts that are in widespread everyday common use." - , " deriving (Eq, Ord, Bounded, Enum, Show)" - , "" - , "-- | Useful to decode the output of 'identifierTypes'." - , "decodeIdentifierTypes :: Int -> NonEmpty IdentifierType" - , "decodeIdentifierTypes = \\case" <> encoding - , " _ -> " <> hackHaskellConstructor def - , "" - , "-- | Returns the 'IdentifierType's corresponding to a character." - , genEnumBitmap - "identifierTypes" - (def', show (fromEnum def')) - (def', show (fromEnum def')) - planes0To3 - plane14 - ] - where - def' = 0 - (planes0To3, plane14) = splitPlanes - "Cannot generate: genIdentifierTypeModule" - (== def') - (reverse identifiersTypes) - -genConfusablesModule - :: Monad m - => String - -> Fold m Confusable String -genConfusablesModule moduleName = - done . foldMap mkConfusable . sort <$> Fold.toList - - where - - mkConfusable :: Confusable -> String - mkConfusable (Confusable c s) = mconcat - [ "\n " - , show c - , " -> Just (Ptr \"" - , stringToAddrLiteral s - , "\\0\"#)" - ] - - -- Encode string as a null-terminated utf-8 - stringToAddrLiteral = foldMap toWord8 . encodeUtf8 - -- [HACK] Encode in utf-8, then decode in latin1 in order to get bytes - encodeUtf8 s - = unsafePerformIO - $ Foreign.withCString Encoding.utf8 s - (Foreign.peekCString Encoding.latin1) - toWord8 = ('\\' :) . show . ord - - done confusables = unlines - [ apacheLicense 2022 moduleName - , "{-# LANGUAGE LambdaCase #-}" - , "{-# OPTIONS_HADDOCK hide #-}" - , "" - , "module " <> moduleName - , "(confusablePrototype)" - , "where" - , "" - , "import Foreign.C.String (CString)" - , "import GHC.Exts (Ptr(..))" - , "" - , "-- | Returns the /prototype/ of a character, if it is confusable." - , "--" - , "-- The resulting 'CString' is null-terminated and encoded in UTF-8." - , "--" - , "-- @since 0.1.0" - , "confusablePrototype :: Char -> Maybe CString" - , "confusablePrototype = \\case" <> confusables - , " _ -> Nothing" - ] - -genIntentionalConfusablesModule - :: Monad m - => String - -> Fold m IntentionalConfusable String -genIntentionalConfusablesModule moduleName = - done . Map.foldMapWithKey mkConfusable <$> Fold.foldl' addEntry mempty - - where - - addEntry - :: Map.Map Char (Set.Set Char) - -> IntentionalConfusable - -> Map.Map Char (Set.Set Char) - addEntry acc (IntentionalConfusable c1 c2) - = Map.insertWith (flip (<>)) c1 (Set.singleton c2) - . Map.insertWith (flip (<>)) c2 (Set.singleton c1) - $ acc - - mkConfusable :: Char -> Set.Set Char -> String - mkConfusable c cs = mconcat - [ "\n " - , show c - , " -> Just (Ptr \"" - , stringToAddrLiteral (Set.toList cs) - , "\\0\"#)" - ] - - -- Encode string as a null-terminated utf-8 - stringToAddrLiteral = foldMap toWord8 . encodeUtf8 - -- [HACK] Encode in utf-8, then decode in latin1 in order to get bytes - encodeUtf8 s - = unsafePerformIO - $ Foreign.withCString Encoding.utf8 s - (Foreign.peekCString Encoding.latin1) - toWord8 = ('\\' :) . show . ord - - done confusables = unlines - [ apacheLicense 2022 moduleName - , "{-# LANGUAGE LambdaCase #-}" - , "{-# OPTIONS_HADDOCK hide #-}" - , "" - , "module " <> moduleName - , "(intentionalConfusables)" - , "where" - , "" - , "import Foreign.C.String (CString)" - , "import GHC.Exts (Ptr(..))" - , "" - , "-- | Returns the /intentional/ confusables of a character, if any." - , "--" - , "-- The resulting 'CString' is null-terminated and encoded in UTF-8." - , "--" - , "-- @since 0.1.0" - , "intentionalConfusables :: Char -> Maybe CString" - , "intentionalConfusables = \\case" <> confusables - , " _ -> Nothing" - ] - -------------------------------------------------------------------------------- --- Parsing blocks file -------------------------------------------------------------------------------- - -type BlockLine = (String, (Int, Int)) - -parseBlockLine :: String -> Maybe BlockLine -parseBlockLine ln - | null ln = Nothing - | head ln == '#' = Nothing - | otherwise = Just (parseLine ln) - - where - - parseLine line = - let (rangeLn, line1) = span (/= ';') line - name = takeWhile (/= '#') (tail line1) - - in (trim' name, parseRange (trim rangeLn)) - - parseRange - = bimap parseCodePoint (parseCodePoint . drop 2) - . span (/= '.') - - parseCodePoint = read . ("0x" <>) - -parseBlockLines - :: (IsStream t, Monad m) - => t m String - -> t m BlockLine -parseBlockLines = Stream.mapMaybe parseBlockLine - -------------------------------------------------------------------------------- --- Parsing script file -------------------------------------------------------------------------------- - -type ScriptLine = (String, CharRange) - -parseScriptLine :: String -> Maybe ScriptLine -parseScriptLine ln - | null ln = Nothing - | head ln == '#' = Nothing - | otherwise = Just (parseLine ln) - - where - - parseLine line = - let (rangeLn, line1) = span (/= ';') line - script = takeWhile (/= '#') (tail line1) - - in (trim script, parseRange (trim rangeLn)) - - parseRange :: String -> CharRange - parseRange - = (\(c1, c2) -> maybe (Left c1) (Right . (c1,)) c2) - . bimap readCodePoint (readCodePointM . drop 2) - . span (/= '.') - -parseScriptLines - :: (IsStream t, Monad m) - => t m String - -> t m ScriptLine -parseScriptLines = Stream.mapMaybe parseScriptLine - -------------------------------------------------------------------------------- --- Parsing ScriptExtensions.txt -------------------------------------------------------------------------------- - -type ScriptExtensionsLine = (CharRangeStream, [String]) - -data CharScriptExtensions = CharScriptExtensions - { _scriptExtensionsChar :: !Char - , _scriptExtensionsScripts :: ![String] } -type ScriptExtensions = Map.Map Char [String] - -parseScriptExtensionsLine :: String -> Maybe ScriptExtensionsLine -parseScriptExtensionsLine = \case - "" -> Nothing -- empty line - '#':_ -> Nothing -- comment - line -> Just (parseLine line) - - where - - parseLine line = - let (rangeLn, line1) = span (/= ';') line - range = parseCodePointRange rangeLn - scripts = words (takeWhile (/= '#') (tail line1)) - in (either SingleChar (uncurry CharRange) range, scripts) - -parseScriptExtensionsLines - :: (IsStream t, Monad m) - => t m String - -> t m CharScriptExtensions -parseScriptExtensionsLines - = Stream.unfoldMany (Unfold.unfoldr mkScriptExtension) - . Stream.mapMaybe parseScriptExtensionsLine - - where - - mkScriptExtension - :: ScriptExtensionsLine - -> Maybe (CharScriptExtensions, ScriptExtensionsLine) - mkScriptExtension (step, scripts) = case step of - SingleChar c -> Just (CharScriptExtensions c scripts, (Stop, mempty)) - CharRange c1 c2 -> Just ( CharScriptExtensions c1 scripts - , if c1 < c2 - then (CharRange (succ c1) c2, scripts) - else (Stop, mempty) ) - Stop -> Nothing - -------------------------------------------------------------------------------- --- Parsing property files -------------------------------------------------------------------------------- - -trim :: String -> String -trim = takeWhile (not . isSpace) . dropWhile isSpace - -trim' :: String -> String -trim' = dropWhileEnd isSpace . dropWhile isSpace - -type PropertyLine = (String, [Int]) - -emptyPropertyLine :: PropertyLine -emptyPropertyLine = ("", []) - -combinePropertyLines :: PropertyLine -> PropertyLine -> PropertyLine -combinePropertyLines t1@(n1, o1) t2@(n2, o2) - | n1 == "" = t2 - | n2 == "" = t1 - | n1 == n2 = (n1, o1 ++ o2) - | otherwise = error $ "Cannot group " ++ n1 ++ " with " ++ n2 - -parsePropertyLine :: String -> PropertyLine -parsePropertyLine ln - | null ln = emptyPropertyLine - | head ln == '#' = emptyPropertyLine - | otherwise = parseLineJ ln - - where - - parseLineJ :: String -> (String, [Int]) - parseLineJ line = - let (rangeLn, line1) = span (/= ';') line - propLn = takeWhile (/= '#') (tail line1) - in (trim propLn, parseRange (trim rangeLn)) - - parseRange :: String -> [Int] - parseRange rng = - if '.' `elem` rng - then let low = read $ "0x" ++ takeWhile (/= '.') rng - high = - read $ "0x" ++ reverse (takeWhile (/= '.') (reverse rng)) - in [low .. high] - else [read $ "0x" ++ rng] - -isDivider :: String -> Bool -isDivider x = x == "# ================================================" - -parsePropertyLines :: (IsStream t, Monad m) => t m String -> t m PropertyLine -parsePropertyLines = - Stream.splitOn isDivider - $ Fold.lmap parsePropertyLine - $ Fold.foldl' combinePropertyLines emptyPropertyLine - -------------------------------------------------------------------------------- --- Parsing UnicodeData.txt -------------------------------------------------------------------------------- - -data PropertyValueAliasesLine = PropertyValueAliasesLine - { _prop :: !String - , _propValue :: !String - , _propValueAliases :: [String] } - -type PropertyValuesAliases = Map.Map String [String] -type PropertyValuesAliasesEntry = (String, PropertyValuesAliases) - -parsePropertyValueAliasesLine :: String -> Maybe PropertyValueAliasesLine -parsePropertyValueAliasesLine = \case - "" -> Nothing -- empty line - '#':_ -> Nothing -- comment - line -> case split line of - prop : value : as -> Just (PropertyValueAliasesLine prop value as) - _ -> error ("Unsupported line: " <> line) - - where - - split s = case s' of - "" -> [v'] - '#':_ -> [v'] - s'' -> v' : split (tail s'') - where (v, s') = span (\c -> c /= ';' && c /= '#') (dropWhile isSpace s) - v' = dropWhileEnd isSpace v - -parsePropertyValueAliasesLines - :: forall t m. (IsStream t, Monad m) - => t m String - -> t m PropertyValuesAliasesEntry -parsePropertyValueAliasesLines - = Stream.groupsBy ((==) `on` _prop) (Fold.foldr addEntry mempty) - . Stream.mapMaybe parsePropertyValueAliasesLine - - where - - addEntry (PropertyValueAliasesLine prop value aliases) (_, acc) = - ( prop - , Map.insert value aliases acc ) - -------------------------------------------------------------------------------- --- Parsing UnicodeData.txt -------------------------------------------------------------------------------- - --- | A range entry in @UnicodeData.txt@. -data UnicodeDataRange - = SingleCode !DetailedChar - -- ^ Regular entry for one code point - | FirstCode !String !DetailedChar - -- ^ A partial range for entry with a name as: @\@ - | CompleteRange !String !DetailedChar !DetailedChar - -- ^ A complete range, requiring 2 continuous entries with respective names: - -- - -- * @\@ - -- * @\@ - -{-| Parse UnicodeData.txt lines - -Parse ranges according to https://www.unicode.org/reports/tr44/#Code_Point_Ranges. - -__Note:__ this does /not/ fill missing char entries, -i.e. entries with no explicit entry nor within a range. --} -parseUnicodeDataLines :: forall t m. (IsStream t, Monad m) => t m String -> t m DetailedChar -parseUnicodeDataLines - = Stream.unfoldMany (Unfold.unfoldr unitToRange) - . Stream.foldMany ( Fold.lmap parseDetailedChar - $ Fold.mkFold_ step initial ) - - where - - step :: Maybe UnicodeDataRange - -> DetailedChar - -> Fold.Step (Maybe UnicodeDataRange) (Maybe UnicodeDataRange) - step Nothing dc = case span (/= ',') (_name dc) of - (range, ", First>") -> Fold.Partial (Just (FirstCode range dc)) - _ -> Fold.Done (Just (SingleCode dc)) - step (Just (FirstCode range1 dc1)) dc2 = case span (/= ',') (_name dc2) of - (range2, ", Last>") -> if range1 == range2 && _char dc1 < _char dc2 - then Fold.Done (Just (CompleteRange range1 dc1 dc2)) - else error $ "Cannot create range: incompatible ranges" <> show (dc1, dc2) - _ -> error $ "Cannot create range: missing entry corresponding to: " <> show range1 - step _ _ = error "impossible case" - - initial :: Fold.Step (Maybe UnicodeDataRange) (Maybe UnicodeDataRange) - initial = Fold.Partial Nothing - - unitToRange :: Maybe UnicodeDataRange -> Maybe (DetailedChar, Maybe UnicodeDataRange) - unitToRange = fmap $ \case - SingleCode dc -> (dc, Nothing) - FirstCode _ dc -> error $ "Incomplete range: " <> show dc - CompleteRange range dc1 dc2 -> if _char dc1 < _char dc2 - -- [NOTE] We do not need to create real names for ranges, - -- as the field `_name` is only used to detect ranges. - -- We use the file `DerivedName.txt` to get the proper names - -- (see `parseDerivedNameLine`). - then (dc1{_name="XXX"}, Just (CompleteRange range dc1{_char=succ (_char dc1)} dc2)) - else (dc2{_name="XXX"}, Nothing) - --- | Parse a single entry of @UnicodeData.txt@ -parseDetailedChar :: String -> DetailedChar -parseDetailedChar line = - DetailedChar - { _char = readCodePoint char - , _name = name - , _generalCategory = read gc - , _combiningClass = read combining - , _decompositionType = dctype - , _decomposition = dcval - , _simpleUpperCaseMapping = readCodePointM sUpper - , _simpleLowerCaseMapping = readCodePointM sLower - , _simpleTitleCaseMapping = readCodePointM sTitle - } - - where - - (char, line1) = span (/= ';') line - (name, line2) = span (/= ';') (tail line1) - (gc, line3) = span (/= ';') (tail line2) - (combining, line4) = span (/= ';') (tail line3) - (_bidi, line5) = span (/= ';') (tail line4) - (decomposition, line6) = span (/= ';') (tail line5) - (dctype, dcval) = readDecomp decomposition - (_decimal, line7) = span (/= ';') (tail line6) - (_digit, line8) = span (/= ';') (tail line7) - (_numeric, line9) = span (/= ';') (tail line8) - (_bidiM, line10) = span (/= ';') (tail line9) - (_uni1Name, line11) = span (/= ';') (tail line10) - (_iso, line12) = span (/= ';') (tail line11) - (sUpper, line13) = span (/= ';') (tail line12) - (sLower, line14) = span (/= ';') (tail line13) - sTitle = tail line14 - -------------------------------------------------------------------------------- --- Parse SpecialCasing.txt -------------------------------------------------------------------------------- - --- type SpecialCasings = Map.Map Char [SpecialCasing] -type SpecialCasings = Map.Map Char SpecialCasing - -data SpecialCasing = SpecialCasing - { _scChar :: Char - , _scLower :: String - , _scTitle :: String - , _scUpper :: String - -- , _scConditions :: [SpecialCasingCondition] - } - -parseSpecialCasingLines - :: forall m. (Monad m) - => SerialT m String - -> m SpecialCasings -parseSpecialCasingLines - = Stream.fold - ( Fold.mapMaybe parseSpecialCasing - $ Fold.foldl' combineSpecialCasings mempty - ) - - where - - -- combineSpecialCasings acc x = Map.insertWith (++) (_scChar x) [x] acc - combineSpecialCasings acc sc = Map.insert (_scChar sc) sc acc - -parseSpecialCasing :: String -> Maybe SpecialCasing -parseSpecialCasing line - | null line = Nothing - | head line == '#' = Nothing - -- Keep only entries without condititions - | null conditions = Just specialCasing - | otherwise = Nothing - - where - - (rawChar, line1) = span (/= ';') line - char = readCodePoint rawChar - (rawLower, line2) = span (/= ';') (tail line1) - lower = toChars rawLower - (rawTitle, line3) = span (/= ';') (tail line2) - title = toChars rawTitle - (rawUpper, line4) = span (/= ';') (tail line3) - upper = toChars rawUpper - (rawConditions, _line5) = span (/= ';') (tail line4) - (rawConditions', _comment) = span (/= '#') rawConditions - conditions = words (trim' rawConditions') - specialCasing = SpecialCasing - { _scChar = char - , _scLower = lower - , _scTitle = title - , _scUpper = upper - -- , _scConditions = conditions - } - - toChars = fmap readCodePoint . words - -------------------------------------------------------------------------------- --- Parsing CaseFolding.txt -------------------------------------------------------------------------------- - -data CaseFoldingType - = CommonCaseFolding - | FullCaseFolding - | SimpleCaseFolding - | SpecialCaseFolding - deriving (Eq, Ord) - -type CaseFoldings = (Char, [(CaseFoldingType, String)]) -type CaseFoldingLine = (Char, CaseFoldingType, String) - -parseCaseFoldingLines - :: forall t m. (IsStream t, Monad m) - => t m String - -> t m CaseFoldings -parseCaseFoldingLines - = Stream.groupsBy sameChar combine - . Stream.mapMaybe parseCaseFoldingLine - - where - - sameChar (c1, _, _) (c2, _, _) = c1 == c2 - - combine = Fold.foldr - (\(c, ty, cs) (_, xs) -> (c, (ty, cs):xs)) - ('\0', mempty) - -parseCaseFoldingLine :: String -> Maybe CaseFoldingLine -parseCaseFoldingLine line - | null line = Nothing - | head line == '#' = Nothing - | otherwise = Just (char, caseFoldingType, caseFolding) - - where - - (rawChar, line1) = span (/= ';') line - char = readCodePoint rawChar - (rawCaseFoldType, line2) = span (/= ';') (tail line1) - caseFoldingType = case trim rawCaseFoldType of - "C" -> CommonCaseFolding - "F" -> FullCaseFolding - "S" -> SimpleCaseFolding - "T" -> SpecialCaseFolding - ty -> error ("Unsupported case folding type: " <> ty) - (rawCaseFolding, _) = span (/= ';') (tail line2) - caseFolding = toChars rawCaseFolding - - toChars = fmap readCodePoint . words - -------------------------------------------------------------------------------- --- Parsing DerivedNumericValues.txt -------------------------------------------------------------------------------- - -type NumericValue = Either Int Rational -type CharNumericValue = (Char, NumericValue) -type CharRangeNumericValue = (Char, Char, NumericValue) -type DerivedNumericValuesLine = Either CharNumericValue CharRangeNumericValue - -parseDerivedNumericValuesLine :: String -> Maybe DerivedNumericValuesLine -parseDerivedNumericValuesLine line - | null line = Nothing - | head line == '#' = Nothing - | otherwise = - let (range , line1) = span (/= ';') line - (_field1, line2) = span (/= ';') (tail line1) - (_field2, line3) = span (/= ';') (tail line2) - value = takeWhile (/= '#') (tail line3) - value' = parseValue (trim' value) - in Just (bimap (,value') (mkRange value') (parseCodePointRange range)) - - where - - mkRange :: NumericValue -> (Char, Char) -> CharRangeNumericValue - mkRange value (c1, c2) = (c1, c2, value) - - parseValue :: String -> NumericValue - parseValue raw = - let (numerator, denominator) = span (/= '/') raw - in if null denominator - then Left (read numerator) - else Right (read numerator % read (tail denominator)) - -parseDerivedNumericValuesLines - :: (IsStream t, Monad m) - => t m String - -> t m CharNumericValue -parseDerivedNumericValuesLines - = Stream.unfoldMany (Unfold.unfoldr mkCharNumericValue) - . Stream.mapMaybe parseDerivedNumericValuesLine - - where - - mkCharNumericValue - :: DerivedNumericValuesLine - -> Maybe (CharNumericValue, DerivedNumericValuesLine) - mkCharNumericValue = \case - Left charValue -> Just (charValue, Right ((fst charValue), '\0', Left 0)) - Right (c1, c2, value) -> if c1 <= c2 - then Just ((c1, value), Right (succ c1, c2, value)) - else Nothing - -------------------------------------------------------------------------------- --- Parsing Aliases -------------------------------------------------------------------------------- - -data AliasType - = Correction - | Control - | Alternate - | Figment - | Abbreviation - deriving (Enum, Bounded, Eq, Ord, Read, Show) - -newtype Alias = Alias String -instance Show Alias where - show (Alias s) = "Ptr \"" <> s <> "\\0\"#" - -type Aliases = [(AliasType, Alias)] -data CharAliases = CharAliases - { _aChar :: !Char - , _aAliases :: !Aliases } -type AliasesLine = (Char, Alias, AliasType) - -parseAliasesLine :: String -> Maybe AliasesLine -parseAliasesLine line - | null line = Nothing - | head line == '#' = Nothing - | otherwise = - let (char , line1) = span (/= ';') line - (alias, line2) = span (/= ';') (tail line1) - type_ = tail line2 - in Just (readCodePoint char, Alias alias, readAliasType type_) - - where - - readAliasType :: String -> AliasType - readAliasType a = read (toUpper (head a) : tail a) - - -parseAliasesLines :: (IsStream t, Monad m) => t m String -> t m CharAliases -parseAliasesLines - = Stream.groupsBy compareChar - (Fold.foldr combineAliases (CharAliases '\0' mempty)) - . Stream.mapMaybe parseAliasesLine - - where - - compareChar :: AliasesLine -> AliasesLine -> Bool - compareChar (char1, _, _) (char2, _, _) = char1 == char2 - - combineAliases :: AliasesLine -> CharAliases -> CharAliases - combineAliases (char, alias, type_) (CharAliases _ as) = - CharAliases char ((type_, alias):as) - -------------------------------------------------------------------------------- --- Parsing Names -------------------------------------------------------------------------------- - -data CharName = CharName - { _nChar :: !Char - , _nName :: !String } - -type CharRangeName = (Char, Char, String) -type DerivedNameLine = Either CharName CharRangeName - -parseDerivedNameLine :: String -> Maybe DerivedNameLine -parseDerivedNameLine line - | null line = Nothing - | head line == '#' = Nothing - | otherwise = - let (range, line1) = span (/= ';') line - name = trim' (tail line1) - in Just (bimap (`CharName` name) (mkRange name) (parseCodePointRange range)) - where - - mkRange :: String -> (Char, Char) -> CharRangeName - mkRange name (c1, c2) = case elemIndex '*' name of - Nothing -> error - $ "Range name should contain “*”: " - <> show (showHexCodepoint c1, showHexCodepoint c2, name) - Just k -> if k == length name - 1 - then (c1, c2, init name) - else error - $ "Unexpected “*” before the end of range name: " - <> show (showHexCodepoint c1, showHexCodepoint c2, name) - -parseDerivedNameLines - :: forall t m. (IsStream t, Monad m) => - t m String -> - t m CharName -parseDerivedNameLines - = Stream.unfoldMany (Unfold.unfoldr mkCharsNames) - . Stream.mapMaybe parseDerivedNameLine - - where - - mkCharsNames :: DerivedNameLine -> Maybe (CharName, DerivedNameLine) - mkCharsNames = \case - Left named -> Just (named, Right ((_nChar named), '\0', mempty)) - Right (c1, c2, template) -> if c1 <= c2 - then Just ( mkName template c1 - , Right (succ c1, c2, template) ) - else Nothing - - mkName :: String -> Char -> CharName - mkName template char = CharName - { _nChar = char - , _nName = template <> showHexCodepoint char } - -------------------------------------------------------------------------------- --- Parsing Identifier_Status -------------------------------------------------------------------------------- - -data IdentifierStatus = Restricted | Allowed - deriving (Eq, Show, Read) - -data CharIdentifierStatus = CharIdentifierStatus - { _idStatusChar :: !Char - , _idStatus :: !IdentifierStatus } - deriving (Show) - -type IdentifierStatusLine = (Char, Char, IdentifierStatus) - -parseIdentifierStatusLine :: String -> Maybe IdentifierStatusLine -parseIdentifierStatusLine = \case - "" -> Nothing -- empty line - '#':_ -> Nothing -- comment - '\xFEFF':_ -> Nothing -- BOM - line -> - let (rawRange, line1) = span (/= ';') line - line2 = takeWhile (/= '#') (tail line1) - range = parseCodePointRange rawRange - status = read (trim' (tail line2)) - in Just (either (mkRange status . (id &&& id)) (mkRange status) range) - - where - - mkRange :: IdentifierStatus -> (Char, Char) -> IdentifierStatusLine - mkRange status (c1, c2) = (c1, c2, status) - -parseIdentifierStatusLines - :: forall t m. (IsStream t, Monad m) - => t m String - -> t m CharIdentifierStatus -parseIdentifierStatusLines - = Stream.unfoldMany (Unfold.unfoldr mkIdentifiersStatus) - . Stream.mapMaybe parseIdentifierStatusLine - - where - - mkIdentifiersStatus - :: IdentifierStatusLine - -> Maybe (CharIdentifierStatus, IdentifierStatusLine) - mkIdentifiersStatus (c1, c2, status) = if c1 <= c2 - then Just (CharIdentifierStatus c1 status, (succ c1, c2, status)) - else Nothing - -------------------------------------------------------------------------------- --- Parsing Identifier_Type -------------------------------------------------------------------------------- - -data IdentifierType - = Not_Character - | Deprecated - | Default_Ignorable - | Not_NFKC - | Not_XID - | Exclusion - | Obsolete - | Technical - | Uncommon_Use - | Limited_Use - | Inclusion - | Recommended - deriving (Eq, Ord, Bounded, Enum, Show, Read) - -data CharIdentifierTypes = CharIdentifierTypes - { _idTypesChar :: !Char - , _idTypes :: ![IdentifierType] } - deriving (Show) - -type IdentifierTypeLine = (Char, Char, [IdentifierType]) - -parseIdentifierTypeLine :: String -> Maybe IdentifierTypeLine -parseIdentifierTypeLine = \case - "" -> Nothing -- empty line - '#':_ -> Nothing -- comment - '\xFEFF':_ -> Nothing -- BOM - line -> - let (rawRange, line1) = span (/= ';') line - line2 = takeWhile (/= '#') (tail line1) - range = parseCodePointRange rawRange - types = read <$> words (trim' (tail line2)) - in Just (either (mkRange types . (id &&& id)) (mkRange types) range) - - where - - mkRange :: [IdentifierType] -> (Char, Char) -> IdentifierTypeLine - mkRange type_ (c1, c2) = (c1, c2, type_) - -parseIdentifierTypeLines - :: forall t m. (IsStream t, Monad m) - => t m String - -> t m CharIdentifierTypes -parseIdentifierTypeLines - = Stream.unfoldMany (Unfold.unfoldr mkIdentifiersStatus) - . Stream.mapMaybe parseIdentifierTypeLine - - where - - mkIdentifiersStatus - :: IdentifierTypeLine - -> Maybe (CharIdentifierTypes, IdentifierTypeLine) - mkIdentifiersStatus (c1, c2, types) = if c1 <= c2 - then Just (CharIdentifierTypes c1 types, (succ c1, c2, types)) - else Nothing - -------------------------------------------------------------------------------- --- Parsing Confusables -------------------------------------------------------------------------------- - -data Confusable = Confusable - { _confusableChar :: !Char - , _confusablePrototype :: !String } - deriving (Eq, Ord, Show) - -parseConfusablesLine :: String -> Maybe Confusable -parseConfusablesLine = \case - "" -> Nothing -- empty line - '#':_ -> Nothing -- comment - '\xFEFF':_ -> Nothing -- BOM - line -> - let (rawChar, line1) = span (/= ';') line - line2 = takeWhile (/= ';') (tail line1) - char = readCodePoint rawChar - prototype = readCodePoint <$> words (trim' (tail line2)) - in Just (Confusable char prototype) - -parseConfusablesLines - :: forall t m. (IsStream t, Monad m) - => t m String - -> t m Confusable -parseConfusablesLines - = Stream.mapMaybe parseConfusablesLine - -------------------------------------------------------------------------------- --- Parsing Intentional Confusables -------------------------------------------------------------------------------- - -data IntentionalConfusable = IntentionalConfusable - { _intantionConfusableChar :: !Char - , _intantionConfusablePrototype :: !Char } - deriving (Eq, Ord, Show) - -parseIntentionalConfusablesLine :: String -> Maybe IntentionalConfusable -parseIntentionalConfusablesLine = \case - "" -> Nothing -- empty line - '#':_ -> Nothing -- comment - '\xFEFF':_ -> Nothing -- BOM - line -> - let (rawChar, line1) = span (/= ';') line - line2 = takeWhile (/= '#') (tail line1) - char = readCodePoint rawChar - prototype = readCodePoint (trim' (tail line2)) - in Just (IntentionalConfusable char prototype) - -parseIntentionalConfusablesLines - :: forall t m. (IsStream t, Monad m) - => t m String - -> t m IntentionalConfusable -parseIntentionalConfusablesLines - = Stream.mapMaybe parseIntentionalConfusablesLine - -------------------------------------------------------------------------------- --- Generation -------------------------------------------------------------------------------- - -readLinesFromFile :: String -> SerialT IO String -readLinesFromFile file = - withFile file Sys.ReadMode - $ \h -> - Stream.unfold Handle.read h & Unicode.decodeUtf8 - & unicodeLines Fold.toList - - where - - unicodeLines = Stream.splitOnSuffix (== '\n') - - withFile file_ mode = - Stream.bracket (liftIO $ Sys.openFile file_ mode) (liftIO . Sys.hClose) - - -moduleToFileName :: String -> String -moduleToFileName = map (\x -> if x == '.' then '/' else x) - -dirFromFileName :: String -> String -dirFromFileName = reverse . dropWhile (/= '/') . reverse - --- ModuleRecipe is a tuple of the module name and a function that generates the --- module using the module name -type ModuleRecipe a = (String, String -> Fold IO a String) - --- GeneratorRecipe is a list of ModuleRecipe -type GeneratorRecipe a = [ModuleRecipe a] - -fileEmitter :: String -> String -> ModuleRecipe a -> Fold IO a () -fileEmitter file outdir (modName, fldGen) = Fold.rmapM action $ fldGen modName - - where - - pretext version = mconcat - [ "-- autogenerated from https://www.unicode.org/Public/" - , version - , "/ucd/" - , file - ,"\n" - ] - outfile = outdir <> moduleToFileName modName <> ".hs" - outfiledir = dirFromFileName outfile - action c = do - version <- - catch - (getEnv "UNICODE_VERSION") - (\(_ :: IOException) -> return "") - createDirectoryIfMissing True outfiledir - writeFile outfile (pretext version ++ c) - -runGenerator :: - String - -> String - -> (SerialT IO String -> SerialT IO a) - -> String - -> GeneratorRecipe a - -> IO () -runGenerator indir file transformLines outdir recipes = - readLinesFromFile (indir file) & transformLines & Stream.fold combinedFld - - where - - generatedFolds = map (fileEmitter file outdir) recipes - combinedFld = void $ Fold.distribute generatedFolds - -genCoreModules :: String -> String -> [String] -> IO () -genCoreModules indir outdir props = do - - compExclu <- - readLinesFromFile (indir "DerivedNormalizationProps.txt") - & parsePropertyLines - & Stream.find (\(name, _) -> name == "Full_Composition_Exclusion") - & fmap (snd . fromMaybe ("", [])) - - non0CC <- - readLinesFromFile (indir "extracted/DerivedCombiningClass.txt") - & parsePropertyLines - & Stream.filter (\(name, _) -> name /= "0") - & Stream.map snd - & Stream.fold (Fold.foldl' (++) []) - - specialCasings <- - readLinesFromFile (indir "SpecialCasing.txt") - & parseSpecialCasingLines - - runGenerator - indir - "Blocks.txt" - parseBlockLines - outdir - [ blocks ] - - runGenerator - indir - "UnicodeData.txt" - parseUnicodeDataLines - outdir - [ compositions compExclu non0CC - , combiningClass - , decomposable - , decomposableK - , decompositions - , decompositionsK2 - , decompositionsK - , generalCategory - , simpleUpperCaseMapping - , simpleLowerCaseMapping - , simpleTitleCaseMapping - , specialUpperCaseMapping specialCasings - , specialLowerCaseMapping specialCasings - , specialTitleCaseMapping specialCasings - ] - - runGenerator - indir - "PropList.txt" - parsePropertyLines - outdir - [ propList ] - - runGenerator - indir - "DerivedCoreProperties.txt" - parsePropertyLines - outdir - [ derivedCoreProperties ] - - runGenerator - indir - "extracted/DerivedNumericValues.txt" - parseDerivedNumericValuesLines - outdir - [ derivedNumericValues ] - - runGenerator - indir - "CaseFolding.txt" - parseCaseFoldingLines - outdir - [ caseFolding ] - - where - - blocks = - ( "Unicode.Internal.Char.Blocks" - , genBlocksModule) - - propList = - ("Unicode.Internal.Char.PropList" - , (`genCorePropertiesModule` (`elem` props))) - - derivedCoreProperties = - ("Unicode.Internal.Char.DerivedCoreProperties" - , (`genCorePropertiesModule` (`elem` props))) - - compositions exc non0 = - ( "Unicode.Internal.Char.UnicodeData.Compositions" - , \m -> genCompositionsModule m exc non0) - - combiningClass = - ( "Unicode.Internal.Char.UnicodeData.CombiningClass" - , genCombiningClassModule) - - decomposable = - ( "Unicode.Internal.Char.UnicodeData.Decomposable" - , (`genDecomposableModule` Canonical)) - - decomposableK = - ( "Unicode.Internal.Char.UnicodeData.DecomposableK" - , (`genDecomposableModule` Kompat)) - - decompositions = - let post = [" c -> [c]"] - in ( "Unicode.Internal.Char.UnicodeData.Decompositions" - , \m -> genDecomposeDefModule m [] post Canonical (const True)) - - decompositionsK2 = - let post = [" c -> [c]"] - in ( "Unicode.Internal.Char.UnicodeData.DecompositionsK2" - , \m -> genDecomposeDefModule m [] post Kompat (>= 60000)) - - decompositionsK = - let pre = ["import qualified " <> fst decompositionsK2 <> " as DK2", ""] - post = [" c -> DK2.decompose c"] - in ( "Unicode.Internal.Char.UnicodeData.DecompositionsK" - , \m -> genDecomposeDefModule m pre post Kompat (< 60000)) - - generalCategory = - ( "Unicode.Internal.Char.UnicodeData.GeneralCategory" - , genGeneralCategoryModule) - - simpleUpperCaseMapping = - ( "Unicode.Internal.Char.UnicodeData.SimpleUpperCaseMapping" - , \m -> genSimpleCaseMappingModule m "toSimpleUpperCase" _simpleUpperCaseMapping) - - simpleLowerCaseMapping = - ( "Unicode.Internal.Char.UnicodeData.SimpleLowerCaseMapping" - , \m -> genSimpleCaseMappingModule m "toSimpleLowerCase" _simpleLowerCaseMapping) - - simpleTitleCaseMapping = - ( "Unicode.Internal.Char.UnicodeData.SimpleTitleCaseMapping" - , \m -> genSimpleCaseMappingModule m "toSimpleTitleCase" _simpleTitleCaseMapping) - - specialUpperCaseMapping sc = - ( "Unicode.Internal.Char.SpecialCasing.UpperCaseMapping" - , \m -> genSpecialCaseMappingModule m - "toSpecialUpperCase" - sc - _scUpper - _simpleUpperCaseMapping ) - - specialLowerCaseMapping sc = - ( "Unicode.Internal.Char.SpecialCasing.LowerCaseMapping" - , \m -> genSpecialCaseMappingModule m - "toSpecialLowerCase" - sc - _scLower - _simpleLowerCaseMapping ) - - specialTitleCaseMapping sc = - ( "Unicode.Internal.Char.SpecialCasing.TitleCaseMapping" - , \m -> genSpecialCaseMappingModule m - "toSpecialTitleCase" - sc - _scTitle - _simpleTitleCaseMapping ) - - caseFolding = - ( "Unicode.Internal.Char.CaseFolding" - , genCaseFolding ) - - derivedNumericValues = - ( "Unicode.Internal.Char.DerivedNumericValues" - , genNumericValuesModule ) - -genNamesModules :: String -> String -> IO () -genNamesModules indir outdir = do - runGenerator - indir - "extracted/DerivedName.txt" - parseDerivedNameLines - outdir - [names] - - runGenerator - indir - "NameAliases.txt" - parseAliasesLines - outdir - [aliases] - - where - - names = - ( "Unicode.Internal.Char.UnicodeData.DerivedName" - , genNamesModule ) - aliases = - ( "Unicode.Internal.Char.UnicodeData.NameAliases" - , genAliasesModule ) - -genScriptsModules :: String -> String -> IO () -genScriptsModules indir outdir = do - scriptAliases <- - readLinesFromFile (indir "PropertyValueAliases.txt") - & parsePropertyValueAliasesLines - & Stream.lookup "sc" - & fmap (fromMaybe mempty) - - extensions <- - readLinesFromFile (indir "ScriptExtensions.txt") - & parseScriptExtensionsLines - & Stream.foldr - (\(CharScriptExtensions c s) -> Map.insertWith (<>) c s) - mempty - - runGenerator - indir - "Scripts.txt" - parseScriptLines - outdir - [ scripts scriptAliases - , scriptExtensions scriptAliases extensions ] - - where - - scripts scriptAliases = - ( "Unicode.Internal.Char.Scripts" - , \m -> genScriptsModule m scriptAliases ) - - scriptExtensions scriptAliases extensions = - ( "Unicode.Internal.Char.ScriptExtensions" - , \m -> genScriptExtensionsModule m scriptAliases extensions ) - -genSecurityModules :: String -> String -> IO () -genSecurityModules indir outdir = do - runGenerator - indir - "IdentifierStatus.txt" - parseIdentifierStatusLines - outdir - [isAllowedInIdentifier] - - runGenerator - indir - "IdentifierType.txt" - parseIdentifierTypeLines - outdir - [identifierTypes] - - runGenerator - indir - "confusables.txt" - parseConfusablesLines - outdir - [confusables] - - runGenerator - indir - "intentional.txt" - parseIntentionalConfusablesLines - outdir - [intentional] - - where - - isAllowedInIdentifier = - ( "Unicode.Internal.Char.Security.IdentifierStatus" - , genIdentifierStatusModule ) - - identifierTypes = - ( "Unicode.Internal.Char.Security.IdentifierType" - , genIdentifierTypeModule ) - - confusables = - ( "Unicode.Internal.Char.Security.Confusables" - , genConfusablesModule ) - - intentional = - ( "Unicode.Internal.Char.Security.IntentionalConfusables" - , genIntentionalConfusablesModule ) diff --git a/ucd2haskell/exe/UCD2Haskell.hs b/ucd2haskell/exe/UCD2Haskell.hs index e931afaf..b7ca9a21 100644 --- a/ucd2haskell/exe/UCD2Haskell.hs +++ b/ucd2haskell/exe/UCD2Haskell.hs @@ -8,10 +8,14 @@ module Main where import GHC.Generics (Generic) -import Parser.Text (genCoreModules, genNamesModules, genScriptsModules, genSecurityModules) import System.FilePath (()) import WithCli (HasArguments(..), withCli) +import qualified UCD2Haskell.Generator.Core as Core +import qualified UCD2Haskell.Generator.Names as Names +import qualified UCD2Haskell.Generator.Scripts as Scripts +import qualified UCD2Haskell.Generator.Security as Security + data CLIOptions = CLIOptions { input :: FilePath @@ -29,10 +33,10 @@ data CLIOptions = cliClient :: CLIOptions -> IO () cliClient opts - = genCoreModules (input opts "ucd") (output_core opts) (core_prop opts) - *> genNamesModules (input opts "ucd") (output_names opts) - *> genScriptsModules (input opts "ucd") (output_scripts opts) - *> genSecurityModules (input opts "security") (output_security opts) + = Core.generateModules (input opts "ucd") (output_core opts) (core_prop opts) + *> Names.generateModules (input opts "ucd") (output_names opts) + *> Scripts.generateModules (input opts "ucd") (output_scripts opts) + *> Security.generateModules (input opts "security") (output_security opts) main :: IO () main = withCli cliClient diff --git a/ucd2haskell/exe/UCD2Haskell/Common.hs b/ucd2haskell/exe/UCD2Haskell/Common.hs new file mode 100644 index 00000000..36017f3c --- /dev/null +++ b/ucd2haskell/exe/UCD2Haskell/Common.hs @@ -0,0 +1,195 @@ +{-# LANGUAGE ExistentialQuantification #-} +-- | +-- Copyright : (c) 2024 Pierre Le Marre +-- License : Apache-2.0 +-- Maintainer : streamly@composewell.com +-- Stability : experimental +-- +-- +module UCD2Haskell.Common + ( -- * Fold + Fold(..) + , distribute + , filterFold + , rmapFold + , runFold + + -- * Formatting + , showB + , showPaddedHex + , showPaddedHexB + , showPaddedHeX + , showPaddedHeXB + , showHexCodepoint + , showHexCodepointB + , showHexCodepointBS + + -- * Hangul + , jamoLCount + , jamoVCount + , jamoTCount + , hangulFirst + , hangulLast + , isHangul + , isHangulRange + , filterNonHangul + + -- * Miscellaneous + , allRange + , mkHaskellConstructor + ) where + +import Data.Foldable (Foldable(..)) +import Numeric (showHex) +import Data.Char (toUpper, ord, isAlphaNum) +import qualified Data.ByteString.Builder as BB +import qualified Unicode.CharacterDatabase.Parser.UnicodeData as UD +import qualified Unicode.CharacterDatabase.Parser.Common as U +import qualified Data.ByteString.Short as BS +import qualified Data.ByteString.Char8 as B8 + +-------------------------------------------------------------------------------- +-- Fold that mimimc Streamly’s one +-------------------------------------------------------------------------------- + +data Fold a b = forall s. Fold + { _step :: s -> a -> s + , _initial :: s + , _final :: s -> b } + +data Pair a b = Pair !a !b + +teeWith :: (a -> b -> c) -> Fold x a -> Fold x b -> Fold x c +teeWith f (Fold stepL initialL finalL) (Fold stepR initialR finalR) = + Fold step initial final + where + step (Pair sL sR) x = Pair (stepL sL x) (stepR sR x) + initial = Pair initialL initialR + final (Pair sL sR) = f (finalL sL) (finalR sR) + +distribute :: [Fold a b] -> Fold a [b] +distribute = foldr (teeWith (:)) (Fold const () (const [])) + +{-# INLINE filterFold #-} +filterFold :: (a -> Bool) -> Fold a b -> Fold a b +filterFold p (Fold step initial done) = Fold step' initial done + where + step' s a = if p a then step s a else s + +{-# INLINE rmapFold #-} +rmapFold :: (b -> c) -> Fold a b -> Fold a c +rmapFold f (Fold step initial final) = Fold step initial (f . final) + +runFold :: Fold a b -> [a] -> b +runFold (Fold step initial final) = final . foldl' step initial + +-------------------------------------------------------------------------------- +-- Formatting +-------------------------------------------------------------------------------- + +-- | /Warning:/ the use of 'BB.string7' make it unsafe if applied to non-ASCII. +showB :: (Show a) => a -> BB.Builder +showB = BB.string7 . show + +showPaddedHex :: Int -> String +showPaddedHex cp = + let hex = showHex cp mempty + padding = 4 - length hex + in replicate padding '0' <> hex + +showPaddedHexB :: Int -> BB.Builder +showPaddedHexB = BB.string7 . showPaddedHex + +showPaddedHeX :: Int -> String +showPaddedHeX = fmap toUpper . showPaddedHex + +showPaddedHeXB :: Int -> BB.Builder +showPaddedHeXB = BB.string7 . showPaddedHeX + +showHexCodepoint :: Char -> String +showHexCodepoint = showPaddedHeX . ord + +showHexCodepointB :: Char -> BB.Builder +showHexCodepointB = BB.string7 . showHexCodepoint + +showHexCodepointBS :: Char -> BS.ShortByteString +showHexCodepointBS = BS.toShort . B8.pack . showPaddedHeX . ord + +-------------------------------------------------------------------------------- +-- Hangul +-------------------------------------------------------------------------------- + +-- This bit of code is duplicated but this duplication allows us to reduce 2 +-- dependencies on the executable. + +jamoLCount :: Int +jamoLCount = 19 + +jamoVCount :: Int +jamoVCount = 21 + +jamoTCount :: Int +jamoTCount = 28 + +hangulFirst :: Int +hangulFirst = 0xac00 + +hangulLast :: Int +hangulLast = hangulFirst + jamoLCount * jamoVCount * jamoTCount - 1 + +isHangul :: Char -> Bool +isHangul c = n >= hangulFirst && n <= hangulLast + where n = ord c + +isHangulRange :: U.CodePointRange -> Bool +isHangulRange = \case + U.SingleChar c -> isHangul c + U.CharRange start end -> + ord start >= hangulFirst && ord end <= hangulLast + +filterNonHangul :: Fold UD.Entry a -> Fold UD.Entry a +filterNonHangul = filterFold (not . isHangulRange . UD.range) + +-------------------------------------------------------------------------------- +-- Miscellaneous +-------------------------------------------------------------------------------- + +allRange :: (Char -> Bool) -> U.CodePointRange -> Bool +allRange predicate = \case + U.SingleChar c -> predicate c + U.CharRange start end -> all predicate [start..end] + +-- -- Make a valid Haskell constructor (in CamelCase) from an identifier. +-- mkHaskellConstructor :: String -> String +-- mkHaskellConstructor = reverse . fst . foldl' convert (mempty, True) +-- where + +-- convert (acc, newWord) = \case +-- -- Skip the following and start a new word +-- ' ' -> (acc, True) +-- '-' -> (acc, True) +-- '_' -> (acc, True) +-- -- Letter or number +-- c -> if isAscii c && isAlphaNum c +-- then ( if newWord then toUpper c : acc else c : acc +-- , False) +-- else error ("Unsupported character: " <> show c) + +-- Make a valid Haskell constructor (in CamelCase) from an identifier. +mkHaskellConstructor :: BS.ShortByteString -> BB.Builder +mkHaskellConstructor = fst . BS.foldl' convert (mempty, True) + where + + convert (acc, newWord) = \case + -- Skip the following and start a new word + 0x20 -> (acc, True) -- Space + 0x2d -> (acc, True) -- Hyphen + 0x5f -> (acc, True) -- Underscore + -- Letter or number + c -> if isAlphaNum (word82Char c) + then ( acc <> BB.word8 if newWord then toUpper' c else c + , False ) + else error ("Unsupported character: " <> show (word82Char c)) + word82Char = toEnum . fromIntegral + char2Word8 = fromIntegral . fromEnum + toUpper' = char2Word8 . toUpper . word82Char diff --git a/ucd2haskell/exe/UCD2Haskell/Generator.hs b/ucd2haskell/exe/UCD2Haskell/Generator.hs new file mode 100644 index 00000000..c83d20d1 --- /dev/null +++ b/ucd2haskell/exe/UCD2Haskell/Generator.hs @@ -0,0 +1,392 @@ +-- | +-- Copyright : (c) 2024 Pierre Le Marre +-- License : Apache-2.0 +-- Maintainer : streamly@composewell.com +-- Stability : experimental +-- +module UCD2Haskell.Generator + ( -- * Recipe + FileRecipe(..) + -- * Generator + , runGenerator + -- * Bitmap + , genBitmap + , genEnumBitmap + , bitMapToAddrLiteral + , enumMapToAddrLiteral + , chunkAddrLiteral + , word32ToWord8s + , splitPlanes + -- * Helpers + , unlinesBB + , unwordsBB + , apacheLicense + ) where +import qualified Data.ByteString.Builder as BB +import UCD2Haskell.Common (Fold, showPaddedHeXB, showB, distribute, runFold, rmapFold) +import qualified Data.List as L +import Data.Bits (Bits(..)) +import Data.Word (Word8, Word32) +import GHC.Stack (HasCallStack) +import qualified Data.ByteString as B +import Data.Functor ((<&>)) +import Control.Exception (IOException, catch) +import System.Directory (createDirectoryIfMissing) +import System.Environment (getEnv) +import qualified Data.ByteString.Lazy as BL +import System.FilePath ((<.>), ()) + +-------------------------------------------------------------------------------- +-- Recipe +-------------------------------------------------------------------------------- + +data FileRecipe a + = ModuleRecipe + -- ^ A recipe to create a Haskell module file. + { moduleName :: String + -- ^ Module name + , generateModule :: BB.Builder -> Fold a BB.Builder } + -- ^ Function that generate the module, given the module name. + -- May be useful someday + -- TestOutputRecipe + -- -- ^ A recipe to create a test output file. + -- String + -- -- ^ Test name + -- (Fold a BB.Builder) + -- -- ^ Test output generator + +-- ModuleRecipe is a tuple of the module name and a function that generates the +-- module using the module name +type ModuleRecipe a = (String, BB.Builder -> Fold a BB.Builder) + +type GeneratorRecipe a = [FileRecipe a] + +-------------------------------------------------------------------------------- +-- Generator +-------------------------------------------------------------------------------- + +moduleToFileName :: String -> String +moduleToFileName = map (\x -> if x == '.' then '/' else x) + +dirFromFileName :: String -> String +dirFromFileName = reverse . dropWhile (/= '/') . reverse + +moduleFileEmitter :: FilePath -> FilePath -> ModuleRecipe a -> Fold a (IO ()) +moduleFileEmitter file outdir (modName, fldGen) = + rmapFold action $ fldGen (BB.string7 modName) + + where + + pretext version = mconcat + [ "-- autogenerated from https://www.unicode.org/Public/" + , BB.string7 version + , "/ucd/" + , BB.string7 file + ,"\n" + ] + outfile = outdir moduleToFileName modName <.> "hs" + outfiledir = dirFromFileName outfile + action c = do + version <- + catch + (getEnv "UNICODE_VERSION") + (\(_ :: IOException) -> return "") + createDirectoryIfMissing True outfiledir + B.writeFile outfile (BL.toStrict (BB.toLazyByteString (pretext version <> c))) + +runGenerator :: + FilePath + -> FilePath + -> (B.ByteString -> [a]) + -> FilePath + -> GeneratorRecipe a + -> IO () +runGenerator indir file transformLines outdir recipes = do + raw <- B.readFile (indir file) + sequence_ (runFold combinedFld (transformLines raw)) + + where + + generatedFolds = recipes <&> \case + ModuleRecipe name f -> moduleFileEmitter file outdir (name, f) + combinedFld = distribute generatedFolds + +-------------------------------------------------------------------------------- +-- Header +-------------------------------------------------------------------------------- + +apacheLicense + :: Word -- ^ Copyright year + -> BB.Builder -- ^ Module name + -> BB.Builder +apacheLicense year modName = + unlinesBB + [ "-- |" + , "-- Module : " <> modName + , "-- Copyright : (c) " + <> BB.wordDec year + <> " Composewell Technologies and Contributors" + , "-- License : Apache-2.0" + , "-- Maintainer : streamly@composewell.com" + , "-- Stability : experimental" + ] + +-------------------------------------------------------------------------------- +-- Bitmaps +-------------------------------------------------------------------------------- + +genBitmap :: HasCallStack => BB.Builder -> [Int] -> BB.Builder +genBitmap funcName ordList = mconcat + [ "{-# INLINE " <> funcName <> " #-}\n" + , funcName, " :: Char -> Bool\n" + , funcName, func + , " !(Ptr bitmap#) = ", bitmapLookup, "\n\n" + , bitmapLookup, " :: Ptr Word8\n" + , bitmapLookup, " = Ptr\n" + , " \"", bitMapToAddrLiteral bitmap "\"#\n" ] + where + rawBitmap = positionsToBitMap ordList + bitmapLookup = funcName <> "Bitmap" + (func, bitmap) = if length rawBitmap <= 0x40000 + -- Only planes 0-3 + then + ( mconcat + [ " = \\c -> let cp = ord c in cp >= 0x" + , showPaddedHeXB (minimum ordList) + , " && cp <= 0x" + , showPaddedHeXB (maximum ordList) + , " && lookupBit64 bitmap# cp\n" + , " where\n" ] + , rawBitmap ) + -- Planes 0-3 and 14 + else + let (planes0To3, plane14) = splitPlanes "genBitmap: cannot build" not rawBitmap + bound0 = pred (minimum ordList) + bound1 = length planes0To3 + bound2 = 0xE0000 + length plane14 + in ( mconcat + [ " c\n" + , if bound0 > 0 + then mconcat + [ " | cp < 0x" + , showPaddedHeXB bound0 + , " = False\n" ] + else "" + , " | cp < 0x", showPaddedHeXB bound1 + , " = lookupBit64 bitmap# cp\n" + , " | cp < 0xE0000 = False\n" + , " | cp < 0x", showPaddedHeXB bound2 + , " = lookupBit64 bitmap# (cp - 0x" + , showPaddedHeXB (0xE0000 - bound1) + , ")\n" + , " | otherwise = False\n" + , " where\n" + , " cp = ord c\n" ] + , planes0To3 <> plane14 ) + +positionsToBitMap :: [Int] -> [Bool] +positionsToBitMap = go 0 + + where + + go _ [] = [] + go i xxs@(x:xs) + | i < x = False : go (i + 1) xxs + | otherwise = True : go (i + 1) xs + +bitMapToAddrLiteral :: + -- | Values to encode + [Bool] -> + -- | String to append + BB.Builder -> + BB.Builder +bitMapToAddrLiteral bs = chunkAddrLiteral 4 0xff encode (L.unfoldr mkChunks bs) + + where + + mkChunks :: [a] -> Maybe ([a], [a]) + mkChunks [] = Nothing + mkChunks xs = Just $ splitAt 8 xs + + encode :: [Bool] -> BB.Builder -> BB.Builder + encode chunk acc = BB.char7 '\\' <> BB.intDec (toByte (padTo8 chunk)) <> acc + + padTo8 :: [Bool] -> [Bool] + padTo8 xs + | length xs >= 8 = xs + | otherwise = xs <> replicate (8 - length xs) False + + toByte :: [Bool] -> Int + toByte xs = sum $ map (\i -> if xs !! i then 1 `shiftL` i else 0) [0..7] + +splitPlanes :: (HasCallStack) => String -> (a -> Bool) -> [a] -> ([a], [a]) +splitPlanes msg isDef xs = if all isDef planes4To13 && null planes15To16 + then (planes0To3, plane14) + else error msg + where + planes0To3 = L.dropWhileEnd isDef (take 0x40000 xs) + planes4To16 = drop 0x40000 xs + planes4To13 = take (0xE0000 - 0x40000) planes4To16 + planes14To16 = drop (0xE0000 - 0x40000) planes4To16 + plane14 = L.dropWhileEnd isDef (take 0x10000 planes14To16) + planes15To16 = drop 0x10000 planes14To16 + +genEnumBitmap + :: forall a. (HasCallStack, Bounded a, Enum a, Eq a, Show a) + => BB.Builder + -- ^ Function name + -> (a, BB.Builder) + -- ^ Value for planes 15-16 + -> (a, BB.Builder) + -- ^ Default value + -> [a] + -- ^ List of values to encode for planes 0 to 3 + -> [a] + -- ^ List of values to encode for plane 14 + -> BB.Builder +genEnumBitmap funcName (defPUA, pPUA) (def, pDef) planes0To3 plane14 = + mconcat + [ "{-# INLINE ", funcName, " #-}\n" + , funcName, " :: Char -> Int\n" + , funcName, func + , " !(Ptr bitmap#) = ", bitmapLookup, "\n\n" + , bitmapLookup, " :: Ptr Word8\n" + , bitmapLookup, " = Ptr\n" + , " \"", enumMapToAddrLiteral 4 0xff bitmap "\"#" + ] + where + bitmapLookup = funcName <> "Bitmap" + planes0To3' = L.dropWhileEnd (== def) planes0To3 + check = if length planes0To3 <= 0x40000 + then () + else error "genEnumBitmap: Cannot build" + (func, bitmap) = check `seq` if null plane14 && defPUA == def + -- Only planes 0-3 + then + ( mconcat + [ " = \\c -> let cp = ord c in if cp >= 0x" + , showPaddedHeXB (length planes0To3') + , " then " + , pDef + , " else lookupIntN bitmap# cp\n" + , " where\n" ] + , planes0To3' ) + -- All the planes + else + let plane14' = L.dropWhileEnd (== def) plane14 + bound1 = length planes0To3' + bound2 = 0xE0000 + length plane14' + in ( mconcat + [ " c\n" + , " -- Planes 0-3\n" + , " | cp < 0x", showPaddedHeXB bound1 + , " = lookupIntN bitmap# cp\n" + , " -- Planes 4-13: ", showB def, "\n" + , " | cp < 0xE0000 = " <> pDef, "\n" + , " -- Plane 14\n" + , " | cp < 0x", showPaddedHeXB bound2 + , " = lookupIntN bitmap# (cp - 0x" + , showPaddedHeXB (0xE0000 - bound1) + , ")\n" + , if defPUA == def + then "" + else mconcat + [ " -- Plane 14: ", showB def, "\n" + , " | cp < 0xF0000 = ", pDef, "\n" + , " -- Plane 15: ", showB defPUA, "\n" + , " | cp < 0xFFFFE = ", pPUA, "\n" + , " -- Plane 15: ", showB def, "\n" + , " | cp < 0x100000 = ", pDef, "\n" + , " -- Plane 16: ", showB defPUA, "\n" + , " | cp < 0x10FFFE = ", pPUA, "\n" ] + , " -- Default: ", showB def, "\n" + , " | otherwise = " <> pDef, "\n" + , " where\n" + , " cp = ord c\n" ] + , planes0To3' <> plane14' ) + +{-| Encode a list of values as a byte map, using their 'Enum' instance. + +__Note:__ 'Enum' instance must respect the following: + +* @fromEnum minBound >= 0x00@ +* @fromEnum maxBound <= 0xff@ +-} +enumMapToAddrLiteral + :: forall a. (Bounded a, Enum a, Show a) + => Word8 + -- ^ Indentation + -> Int + -- ^ Chunk size + -> [a] + -- ^ Values to encode + -> BB.Builder + -- ^ String to append + -> BB.Builder +enumMapToAddrLiteral indentation chunkSize = + chunkAddrLiteral indentation chunkSize addWord + + where + + addWord :: a -> BB.Builder -> BB.Builder + addWord x acc = BB.char7 '\\' <> BB.word8Dec (toWord8 x) <> acc + + toWord8 :: a -> Word8 + toWord8 a = let w = fromEnum a in if 0 <= w && w <= 0xff + then fromIntegral w + else error $ "Cannot convert to Word8: " <> show a + +chunkAddrLiteral + :: forall a. Word8 + -- ^ Indentation + -> Int + -- ^ Chunk size + -> (a -> BB.Builder -> BB.Builder) + -- ^ Function to convert to 'Word8' and prepend to the accumulator + -> [a] + -- ^ Values to encode + -> BB.Builder + -- ^ String to append + -> BB.Builder +chunkAddrLiteral indentation chunkSize addWord xs cs + = fst + . foldr go (cs, NoIndent) + $ chunksOf chunkSize xs + + where + + indent = indent' indentation . (BB.char7 '\\' <>) + indent' = \case + 0 -> (BB.shortByteString "\\\n" <>) + i -> indent' (pred i) . (BB.char7 ' ' <>) + + go :: [a] -> (BB.Builder, Indent) -> (BB.Builder, Indent) + go as (acc, seps) = (foldr addWord (f acc) as, Indent) + where + f = case seps of + NoIndent -> id + Indent -> indent + +data Indent = NoIndent | Indent + +chunksOf :: Int -> [a] -> [[a]] +chunksOf i = go + where + go = \case + [] -> [] + as -> b : go as' + where (b, as') = splitAt i as + +-- Encode Word32 to [Word8] little endian +word32ToWord8s :: Word32 -> [Word8] +word32ToWord8s n = (\k -> fromIntegral ((n `shiftR` k) .&. 0xff)) <$> [0,8..24] + +-------------------------------------------------------------------------------- +-- Helpers +-------------------------------------------------------------------------------- + +unlinesBB :: [BB.Builder] -> BB.Builder +unlinesBB = (<> "\n") . mconcat . L.intersperse "\n" + +unwordsBB :: [BB.Builder] -> BB.Builder +unwordsBB = mconcat . L.intersperse " " diff --git a/ucd2haskell/exe/UCD2Haskell/Generator/Core.hs b/ucd2haskell/exe/UCD2Haskell/Generator/Core.hs new file mode 100644 index 00000000..169da32f --- /dev/null +++ b/ucd2haskell/exe/UCD2Haskell/Generator/Core.hs @@ -0,0 +1,101 @@ +-- | +-- Copyright : (c) 2024 Pierre Le Marre +-- License : Apache-2.0 +-- Maintainer : streamly@composewell.com +-- Stability : experimental +-- +module UCD2Haskell.Generator.Core + ( generateModules + ) where + +import qualified Data.ByteString as B +import qualified Data.Set as Set +import Data.String (IsString(..)) +import System.FilePath (()) +import qualified Unicode.CharacterDatabase.Parser.CaseFolding as CF +import qualified Unicode.CharacterDatabase.Parser.Extracted.DerivedNumericValues as N +import qualified Unicode.CharacterDatabase.Parser.Properties.Multiple as Props +import qualified Unicode.CharacterDatabase.Parser.Properties.Single as Prop +import qualified Unicode.CharacterDatabase.Parser.UnicodeData as UD + +import qualified UCD2Haskell.Modules.Blocks as Blocks +import qualified UCD2Haskell.Modules.CaseFoldings as CaseFoldings +import qualified UCD2Haskell.Modules.DerivedNumericValues as DerivedNumericValues +import qualified UCD2Haskell.Modules.Properties as Properties +import qualified UCD2Haskell.Modules.SpecialCasings as SpecialCasings +import qualified UCD2Haskell.Modules.UnicodeData.CombiningClass as CombiningClass +import qualified UCD2Haskell.Modules.UnicodeData.Composition as Composition +import qualified UCD2Haskell.Modules.UnicodeData.Decomposition as Decomposition +import qualified UCD2Haskell.Modules.UnicodeData.GeneralCategory as GeneralCategory +import qualified UCD2Haskell.Modules.UnicodeData.SimpleCaseMappings as SimpleCaseMappings +import UCD2Haskell.Generator (runGenerator) + +generateModules :: FilePath -> FilePath -> [String] -> IO () +generateModules indir outdir props = do + + fullCompositionExclusion <- Composition.parseFullCompositionExclusion + <$> B.readFile (indir "DerivedNormalizationProps.txt") + + combiningChars <- CombiningClass.parseCombining + <$> B.readFile (indir "extracted" "DerivedCombiningClass.txt") + + specialCasings <- SpecialCasings.parse + <$> B.readFile (indir "SpecialCasing.txt") + + runGenerator + indir + "Blocks.txt" + Prop.parse + outdir + [ Blocks.recipe ] + + runGenerator + indir + "UnicodeData.txt" + UD.parse + outdir + [ Composition.recipe fullCompositionExclusion combiningChars + , CombiningClass.recipe + , Decomposition.decomposable + , Decomposition.decomposableK + , Decomposition.decompositions + , Decomposition.decompositionsK2 + , Decomposition.decompositionsK + , GeneralCategory.recipe + , SimpleCaseMappings.upperRecipe + , SimpleCaseMappings.lowerRecipe + , SimpleCaseMappings.titleRecipe + , SpecialCasings.upperRecipe specialCasings + , SpecialCasings.lowerRecipe specialCasings + , SpecialCasings.titleRecipe specialCasings + ] + + let propsSet = Set.fromList (fromString <$> props) + + runGenerator + indir + "PropList.txt" + Props.parse + outdir + [ Properties.propList propsSet ] + + runGenerator + indir + "DerivedCoreProperties.txt" + Props.parse + outdir + [ Properties.derivedCoreProperties propsSet ] + + runGenerator + indir + "extracted/DerivedNumericValues.txt" + N.parse + outdir + [ DerivedNumericValues.recipe ] + + runGenerator + indir + "CaseFolding.txt" + CF.parse + outdir + [ CaseFoldings.recipe ] diff --git a/ucd2haskell/exe/UCD2Haskell/Generator/Names.hs b/ucd2haskell/exe/UCD2Haskell/Generator/Names.hs new file mode 100644 index 00000000..bc50819b --- /dev/null +++ b/ucd2haskell/exe/UCD2Haskell/Generator/Names.hs @@ -0,0 +1,33 @@ +-- | +-- Copyright : (c) 2024 Pierre Le Marre +-- License : Apache-2.0 +-- Maintainer : streamly@composewell.com +-- Stability : experimental +-- +module UCD2Haskell.Generator.Names + ( generateModules + ) where + +import System.FilePath (()) +import qualified Unicode.CharacterDatabase.Parser.Extracted.DerivedName as N +import qualified Unicode.CharacterDatabase.Parser.NameAliases as NA + +import qualified UCD2Haskell.Modules.UnicodeData.DerivedNames as Names +import qualified UCD2Haskell.Modules.UnicodeData.NameAliases as NameAliases +import UCD2Haskell.Generator (runGenerator) + +generateModules :: FilePath -> FilePath -> IO () +generateModules indir outdir = do + runGenerator + indir + ("extracted" "DerivedName.txt") + N.parse + outdir + [ Names.recipe ] + + runGenerator + indir + "NameAliases.txt" + NA.parse + outdir + [ NameAliases.recipe ] diff --git a/ucd2haskell/exe/UCD2Haskell/Generator/Scripts.hs b/ucd2haskell/exe/UCD2Haskell/Generator/Scripts.hs new file mode 100644 index 00000000..156d1ac4 --- /dev/null +++ b/ucd2haskell/exe/UCD2Haskell/Generator/Scripts.hs @@ -0,0 +1,33 @@ +-- | +-- Copyright : (c) 2024 Pierre Le Marre +-- License : Apache-2.0 +-- Maintainer : streamly@composewell.com +-- Stability : experimental +-- +module UCD2Haskell.Generator.Scripts + ( generateModules + ) where + +import qualified Data.ByteString as B +import System.FilePath (()) +import qualified Unicode.CharacterDatabase.Parser.Properties.Single as Prop + +import qualified UCD2Haskell.Modules.Scripts as Scripts +import qualified UCD2Haskell.Modules.ScriptsExtensions as ScriptsExtensions +import UCD2Haskell.Generator (runGenerator) + +generateModules :: FilePath -> FilePath -> IO () +generateModules indir outdir = do + scriptAliases <- Scripts.parseScriptAliases + <$> B.readFile (indir "PropertyValueAliases.txt") + + extensions <- ScriptsExtensions.parseScriptExtensions + <$> B.readFile (indir "ScriptExtensions.txt") + + runGenerator + indir + "Scripts.txt" + Prop.parse + outdir + [ Scripts.recipe scriptAliases + , ScriptsExtensions.recipe scriptAliases extensions ] diff --git a/ucd2haskell/exe/UCD2Haskell/Generator/Security.hs b/ucd2haskell/exe/UCD2Haskell/Generator/Security.hs new file mode 100644 index 00000000..ec30e353 --- /dev/null +++ b/ucd2haskell/exe/UCD2Haskell/Generator/Security.hs @@ -0,0 +1,47 @@ +-- | +-- Copyright : (c) 2024 Pierre Le Marre +-- License : Apache-2.0 +-- Maintainer : streamly@composewell.com +-- Stability : experimental +-- +module UCD2Haskell.Generator.Security + ( generateModules + ) where + +import qualified Unicode.CharacterDatabase.Parser.Properties.Single as Prop + +import qualified UCD2Haskell.Modules.Security.Confusables as Confusables +import qualified UCD2Haskell.Modules.Security.IdentifierStatus as IdentifierStatus +import qualified UCD2Haskell.Modules.Security.IdentifierType as IdentifierType +import qualified UCD2Haskell.Modules.Security.IntentionalConfusables as IntentionalConfusables +import UCD2Haskell.Generator (runGenerator) + +generateModules :: FilePath -> FilePath -> IO () +generateModules indir outdir = do + runGenerator + indir + "IdentifierStatus.txt" + Prop.parse + outdir + [IdentifierStatus.recipe] + + runGenerator + indir + "IdentifierType.txt" + Prop.parse + outdir + [IdentifierType.recipe] + + runGenerator + indir + "confusables.txt" + Prop.parseMultipleValues + outdir + [Confusables.recipe] + + runGenerator + indir + "intentional.txt" + Prop.parse + outdir + [IntentionalConfusables.recipe] diff --git a/ucd2haskell/exe/UCD2Haskell/Modules/Blocks.hs b/ucd2haskell/exe/UCD2Haskell/Modules/Blocks.hs new file mode 100644 index 00000000..f465aa6b --- /dev/null +++ b/ucd2haskell/exe/UCD2Haskell/Modules/Blocks.hs @@ -0,0 +1,161 @@ +-- | +-- Copyright : (c) 2024 Pierre Le Marre +-- License : Apache-2.0 +-- Maintainer : streamly@composewell.com +-- Stability : experimental +-- +module UCD2Haskell.Modules.Blocks (recipe) where + +import qualified Data.ByteString.Builder as BB +import Data.Char (ord) +import Data.Bits (Bits(..)) +import qualified Data.List as L +import Data.Word (Word8, Word32) +import qualified Unicode.CharacterDatabase.Parser.Common as U +import qualified Unicode.CharacterDatabase.Parser.Properties.Single as Prop + +import UCD2Haskell.Generator (FileRecipe (..), unlinesBB, apacheLicense, word32ToWord8s, enumMapToAddrLiteral) +import UCD2Haskell.Common (Fold (..), showPaddedHexB, showPaddedHeXB, mkHaskellConstructor) + +recipe :: FileRecipe Prop.Entry +recipe = ModuleRecipe + "Unicode.Internal.Char.Blocks" + genBlocksModule + +data Acc = Acc + { blocks :: ![BB.Builder] + , defs :: ![BB.Builder] + , ranges :: ![(Int, Int)] } + +genBlocksModule :: BB.Builder -> Fold Prop.Entry BB.Builder +genBlocksModule moduleName = Fold step initial done + where + + done Acc{..} = let ranges' = reverse ranges in unlinesBB + [ apacheLicense 2022 moduleName + , "{-# OPTIONS_HADDOCK hide #-}" + , "" + , "module " <> moduleName + , "(Block(..), BlockDefinition(..), block, blockDefinition)" + , "where" + , "" + , "import Data.Ix (Ix)" + , "import Data.Word (Word32)" + , "import GHC.Exts" + , "import Unicode.Internal.Bits (lookupWord32#)" + , "" + , "-- | Unicode [block](https://www.unicode.org/glossary/#block)." + , "--" + , "-- There is a total of " <> BB.intDec (length blocks) <> " blocks." + , "--" + , "-- @since 0.3.1" + , "data Block" + , " = " <> mconcat (L.intersperse "\n | " (reverse blocks)) + , " deriving (Enum, Bounded, Eq, Ord, Ix, Show)" + , "" + , "-- | Block definition: range and name." + , "--" + , "-- @since 0.3.1" + , "data BlockDefinition = BlockDefinition" + , " { blockRange :: !(Int, Int) -- ^ Range" + , " , blockName :: !String -- ^ Name" + , " } deriving (Eq, Ord, Show)" + , "" + , "-- | Block definition" + , "--" + , "-- @since 0.3.1" + , "blockDefinition :: Block -> BlockDefinition" + , "blockDefinition b = case b of" + , mconcat (reverse defs) + , "-- | Character block, if defined." + , "--" + , "-- @since 0.3.1" + , "block :: Char -> Maybe Int" + , "block (C# c#) = getBlock 0# " <> BB.intDec (length ranges - 1) <> BB.char7 '#' + , " where" + , " -- [NOTE] Encoding" + , " -- A range is encoded as two LE Word32:" + , " -- • First one is the lower bound, where the higher 11 bits are the block" + , " -- index and the lower 21 bits are the codepoint." + , " -- • Second one is the upper bound, which correspond to the codepoint." + , "" + , " cp# = int2Word# (ord# c#)" + , "" + , " -- Binary search" + , " getBlock l# u# = if isTrue# (l# ># u#)" + , " then Nothing" + , " else" + , " let k# = l# +# uncheckedIShiftRL# (u# -# l#) 1#" + , " j# = k# `uncheckedIShiftL#` 1#" + , " cpL0# = getRawCodePoint# j#" + , " cpL# = cpL0# `and#` 0x1fffff## -- Mask for codepoint: [0..0x10fff]" + , " cpU# = getRawCodePoint# (j# +# 1#)" + , " in if isTrue# (cpU# `ltWord#` cp#)" + , " -- cp > upper bound" + , " then getBlock (k# +# 1#) u#" + , " -- check lower bound" + , " else if isTrue# (cp# `ltWord#` cpL#)" + , " -- cp < lower bound" + , " then getBlock l# (k# -# 1#)" + , " -- cp in block: get block index" + , " else let block# = cpL0# `uncheckedShiftRL#` 21#" + , " in Just (I# (word2Int# block#))" + , "" + , " getRawCodePoint# = lookupWord32# ranges#" + , "" + , " -- Encoded ranges" + , " !(Ptr ranges#) = rangesBitmap" + , "" + , "rangesBitmap :: Ptr Word32" + , "rangesBitmap = Ptr" + , " \"" <> enumMapToAddrLiteral 4 0xff (mkRanges ranges') "\"#" + ] + + initial = Acc mempty mempty mempty + + step Acc{..} (Prop.Entry range blockName) = case range of + U.SingleChar c -> error ("genBlocksModule: expected range, got: " <> show c) + U.CharRange start end -> + let blockID = mkHaskellConstructor blockName + blockRange = (ord start, ord end) + blockName' = BB.shortByteString blockName + in Acc + { blocks = mkBlockConstructor blockID blockName' blockRange : blocks + , defs = mkBlockDef blockID blockName' blockRange : defs + , ranges = blockRange : ranges } + + mkBlockConstructor blockID blockName (l, u) = mconcat + [ blockID + , " -- ^ @U+" + , showPaddedHeXB l + , "..U+" + , showPaddedHeXB u + , "@: " + , blockName + , "." + ] + + mkBlockDef blockID blockName (l, u) = mconcat + [ " " + , blockID + , " -> BlockDefinition (0x" + , showPaddedHexB l + , ", 0x" + , showPaddedHexB u + , ") \"" + , blockName + , "\"\n" + ] + + -- [NOTE] Encoding: a range is encoded as two LE Word32: + -- • First one is the lower bound, where the higher 11 bits are the block + -- index and the lower 21 bits are the codepoint. + -- • Second one is upper bound, which correspond to the codepoint. + mkRanges :: [(Int, Int)] -> [Word8] + mkRanges = foldMap (uncurry mkBlockRange) . zip [0..] + + mkBlockRange :: Word32 -> (Int, Int) -> [Word8] + mkBlockRange idx (l, u) = encodeBound idx l <> encodeBound 0 u + + encodeBound :: Word32 -> Int -> [Word8] + encodeBound idx n = word32ToWord8s ((idx `shiftL` 21) .|. fromIntegral n) diff --git a/ucd2haskell/exe/UCD2Haskell/Modules/CaseFoldings.hs b/ucd2haskell/exe/UCD2Haskell/Modules/CaseFoldings.hs new file mode 100644 index 00000000..30cec03f --- /dev/null +++ b/ucd2haskell/exe/UCD2Haskell/Modules/CaseFoldings.hs @@ -0,0 +1,67 @@ +-- | +-- Copyright : (c) 2024 Pierre Le Marre +-- License : Apache-2.0 +-- Maintainer : streamly@composewell.com +-- Stability : experimental +-- +module UCD2Haskell.Modules.CaseFoldings (recipe) where + +import Control.Applicative (Alternative(..)) +import Data.Bits (Bits(..)) +import qualified Data.ByteString.Builder as BB +import Data.Char (ord) +import qualified Data.Map.Strict as Map +import qualified Unicode.CharacterDatabase.Parser.CaseFolding as C + +import UCD2Haskell.Generator (FileRecipe (..), unlinesBB, apacheLicense) +import UCD2Haskell.Common (Fold (..), showB) + +recipe :: FileRecipe C.Entry +recipe = ModuleRecipe + "Unicode.Internal.Char.CaseFolding" + genCaseFolding + +-- [NOTE] Case folding encodes up to 3 code points on 21 bits each in an Int64. +genCaseFolding :: BB.Builder -> Fold C.Entry BB.Builder +genCaseFolding moduleName = Fold step mempty done + where + step acc C.Entry{..} = Map.alter + (Just . \case + Nothing -> Map.singleton caseFoldingType caseFolding + Just cf -> Map.insert caseFoldingType caseFolding cf) + char + acc + + done acc = unlinesBB + [ apacheLicense 2022 moduleName + , "{-# LANGUAGE LambdaCase #-}" + , "{-# OPTIONS_HADDOCK hide #-}" + , "" + , "module " <> moduleName + , "(toCasefold)" + , "where" + , "" + , "import Data.Int (Int64)" + , "" + , "{-# NOINLINE toCasefold #-}" + , "toCasefold :: Char -> Int64" + , "toCasefold = \\case" <> Map.foldlWithKey' addEntry mempty acc + , " _ -> 0" + ] + + addEntry acc c cfs = + case Map.lookup C.FullCaseFolding cfs <|> Map.lookup C.CommonCaseFolding cfs of + Nothing -> acc + Just cf -> acc <> mconcat + [ "\n " + , showB c + , " -> 0x" + , BB.wordHex (fromIntegral (encode cf)) + ] + + encode :: String -> Int + encode + = foldr (\(k, c) -> (+) (ord c `shiftL` k)) 0 + . zip [0, 21, 42] + -- Check min 1 character, max 3 characters + . (\cs -> if null cs || length cs > 3 then error (show cs) else cs) diff --git a/ucd2haskell/exe/UCD2Haskell/Modules/DerivedNumericValues.hs b/ucd2haskell/exe/UCD2Haskell/Modules/DerivedNumericValues.hs new file mode 100644 index 00000000..a5f6f516 --- /dev/null +++ b/ucd2haskell/exe/UCD2Haskell/Modules/DerivedNumericValues.hs @@ -0,0 +1,59 @@ +-- | +-- Copyright : (c) 2024 Pierre Le Marre +-- License : Apache-2.0 +-- Maintainer : streamly@composewell.com +-- Stability : experimental +-- +module UCD2Haskell.Modules.DerivedNumericValues (recipe) where + +import qualified Data.ByteString.Builder as BB +import Data.Semigroup (Arg(..)) +import qualified Data.Set as Set +import qualified Unicode.CharacterDatabase.Parser.Common as U +import qualified Unicode.CharacterDatabase.Parser.Extracted.DerivedNumericValues as N + +import UCD2Haskell.Generator (FileRecipe (..), unlinesBB, apacheLicense) +import UCD2Haskell.Common (Fold (..), showB) + +recipe :: FileRecipe N.Entry +recipe = ModuleRecipe + "Unicode.Internal.Char.DerivedNumericValues" + genNumericValuesModule + +genNumericValuesModule :: BB.Builder -> Fold N.Entry BB.Builder +genNumericValuesModule moduleName = Fold step mempty done + where + + step acc (N.Entry range value) = case range of + U.SingleChar c -> Set.insert (Arg c value) acc + U.CharRange{..} -> acc <> Set.fromDistinctAscList + ((`Arg` value) <$> [start..end]) + + mkNumericValue char value = mconcat + [ "\n " + , showB char + , " -> " + , case value of + N.Integer i -> "Just " <> BB.integerDec i + N.Rational r -> showB (Just r) + ] + + mkEntries = foldr + (\(Arg c value) -> (<>) (mkNumericValue c value)) + mempty + + done values = unlinesBB + [ apacheLicense 2022 moduleName + , "{-# LANGUAGE LambdaCase #-}" + , "{-# OPTIONS_HADDOCK hide #-}" + , "" + , "module " <> moduleName + , "(numericValue)" + , "where" + , "" + , "import Data.Ratio ((%))" + , "" + , "numericValue :: Char -> Maybe Rational" + , "numericValue = \\case" <> mkEntries values + , " _ -> Nothing" + ] diff --git a/ucd2haskell/exe/UCD2Haskell/Modules/Properties.hs b/ucd2haskell/exe/UCD2Haskell/Modules/Properties.hs new file mode 100644 index 00000000..ce4d698c --- /dev/null +++ b/ucd2haskell/exe/UCD2Haskell/Modules/Properties.hs @@ -0,0 +1,95 @@ +-- | +-- Copyright : (c) 2024 Pierre Le Marre +-- License : Apache-2.0 +-- Maintainer : streamly@composewell.com +-- Stability : experimental +-- +module UCD2Haskell.Modules.Properties + ( propList + , derivedCoreProperties) + where + +import Control.Exception (assert) +import qualified Data.ByteString.Builder as BB +import qualified Data.ByteString.Short as BS +import Data.Char (ord) +import qualified Data.List as L +import qualified Data.Map.Strict as Map +import Data.Maybe (isNothing) +import qualified Data.IntSet as IntSet +import qualified Data.Set as Set +import qualified Unicode.CharacterDatabase.Parser.Common as U +import qualified Unicode.CharacterDatabase.Parser.Properties.Multiple as Props + +import UCD2Haskell.Generator (FileRecipe (..), unlinesBB, apacheLicense, genBitmap, unwordsBB) +import UCD2Haskell.Common (Fold (..)) + +propList :: Set.Set BS.ShortByteString -> FileRecipe Props.Entry +propList props = ModuleRecipe + "Unicode.Internal.Char.PropList" + (`genCorePropertiesModule` (`elem` props)) + +derivedCoreProperties :: Set.Set BS.ShortByteString -> FileRecipe Props.Entry +derivedCoreProperties props = ModuleRecipe + "Unicode.Internal.Char.DerivedCoreProperties" + (`genCorePropertiesModule` (`elem` props)) + +data Acc = Acc + { properties :: ![BS.ShortByteString] + , values :: !(Map.Map BS.ShortByteString IntSet.IntSet) } + +genCorePropertiesModule :: BB.Builder -> (BS.ShortByteString -> Bool) -> Fold Props.Entry BB.Builder +genCorePropertiesModule moduleName isProp = Fold step initial done + where + + prop2FuncName = ("is" <>) . BB.shortByteString + + initial = Acc mempty mempty + + step acc@Acc{..} (Props.Entry range property pValues) + | not (isProp property) = acc -- Skip property + | otherwise = assert (isNothing pValues) case range of + U.SingleChar c -> Acc + { properties = if property `elem` properties + then properties + else property : properties + , values = addChar property (ord c) values } + U.CharRange{..} -> Acc + { properties = if property `elem` properties + then properties + else property : properties + , values = addChars + property + (IntSet.fromDistinctAscList [ord start..ord end]) + values } + + addChar property c = flip Map.alter property \case + Nothing -> Just (IntSet.singleton c) + Just cs -> Just (IntSet.insert c cs) + + addChars property xs = flip Map.alter property \case + Nothing -> Just xs + Just ys -> Just (xs <> ys) + + done Acc{..} = unlinesBB (header properties <> genBitmaps values properties) + + genBitmaps values = foldr addBitMap mempty + where + addBitMap property = (:) + (genBitmap (prop2FuncName property) + (IntSet.toAscList (values Map.! property))) + + header exports = + [ apacheLicense 2020 moduleName + , "{-# OPTIONS_HADDOCK hide #-}" + , "" + , "module " <> moduleName + , "(" <> unwordsBB (L.intersperse "," (map prop2FuncName exports)) <> ")" + , "where" + , "" + , "import Data.Char (ord)" + , "import Data.Word (Word8)" + , "import GHC.Exts (Ptr(..))" + , "import Unicode.Internal.Bits (lookupBit64)" + , "" + ] diff --git a/ucd2haskell/exe/UCD2Haskell/Modules/Scripts.hs b/ucd2haskell/exe/UCD2Haskell/Modules/Scripts.hs new file mode 100644 index 00000000..cf4f231d --- /dev/null +++ b/ucd2haskell/exe/UCD2Haskell/Modules/Scripts.hs @@ -0,0 +1,227 @@ +-- | +-- Copyright : (c) 2024 Pierre Le Marre +-- License : Apache-2.0 +-- Maintainer : streamly@composewell.com +-- Stability : experimental +-- +module UCD2Haskell.Modules.Scripts (recipe, parseScriptAliases) where + +import qualified Data.ByteString as B +import qualified Data.ByteString.Builder as BB +import qualified Data.ByteString.Short as BS +import Data.Char (ord) +import Data.Bits (Bits(..)) +import Data.Foldable (Foldable(..)) +import Data.Function (on) +import qualified Data.List as L +import qualified Data.List.NonEmpty as NE +import Data.Maybe (fromMaybe) +import Data.Semigroup (Semigroup(..)) +import qualified Data.Set as Set +import qualified Data.Map.Strict as Map +import qualified Unicode.CharacterDatabase.Parser.Common as U +import qualified Unicode.CharacterDatabase.Parser.Properties.Defaults as Defaults +import qualified Unicode.CharacterDatabase.Parser.Properties.Single as Prop +import qualified Unicode.CharacterDatabase.Parser.PropertyValueAliases as PVA + +import UCD2Haskell.Generator (FileRecipe (..), unlinesBB, apacheLicense, word32ToWord8s, genEnumBitmap, splitPlanes) +import UCD2Haskell.Common (Fold (..), mkHaskellConstructor) + +recipe :: PropertyValuesAliases -> FileRecipe Prop.Entry +recipe aliases = ModuleRecipe + "Unicode.Internal.Char.Scripts" + (`genScriptsModule` aliases) + +type PropertyValuesAliases = Map.Map BS.ShortByteString (NE.NonEmpty BS.ShortByteString) + +genScriptsModule :: BB.Builder -> PropertyValuesAliases -> Fold Prop.Entry BB.Builder +genScriptsModule moduleName aliases = Fold step mempty done + -- done <$> Fold.foldl' addRange mempty + where + + done ranges = + let scripts = Set.toList + (foldr addScript (Set.singleton Defaults.defaultScript) ranges) + in unlinesBB + [ apacheLicense 2022 moduleName + , "{-# OPTIONS_HADDOCK hide #-}" + , "" + , "module " <> moduleName + , "(Script(..), script, scriptDefinition)" + , "where" + , "" + , "import Data.Char (ord)" + , "import Data.Int (Int32)" + , "import Data.Ix (Ix)" + , "import Data.Word (Word8)" + , "import GHC.Exts (Ptr(..))" + , "import Unicode.Internal.Bits (lookupIntN)" + , "" + , "-- | Unicode [script](https://www.unicode.org/reports/tr24/)." + , "--" + , "-- The constructors descriptions are the original Unicode values" + , "-- (short and long forms)." + , "--" + , "-- There is a total of " <> BB.intDec (length scripts) <> " scripts." + , "--" + , "-- @since 0.1.0" + , "data Script" + , " = " <> mkScripts scripts + , " deriving (Enum, Bounded, Eq, Ord, Ix, Show)" + , "" + , "-- | Script definition: list of corresponding characters." + , "--" + , "-- @since 0.1.0" + , "scriptDefinition :: Script -> (Ptr Int32, Int)" + , "scriptDefinition b = case b of" + , mkScriptDefinitions ranges + , "-- | Script of a character." + , "--" + , "-- @since 0.1.0" + , if length scripts <= 0xff + then mkCharScripts scripts ranges + else error "Cannot encode scripts" + , "" + ] + + step :: [Prop.Entry] -> Prop.Entry -> [Prop.Entry] + step acc l@(Prop.Entry r script) = case acc of + Prop.Entry r' script':acc' -> if script == script' + then case combineRanges r r' of + Left r'' -> Prop.Entry r'' script : acc + Right r'' -> Prop.Entry r'' script : acc' + else l : acc + _ -> [l] + + combineRanges :: U.CodePointRange -> U.CodePointRange -> Either U.CodePointRange U.CodePointRange + combineRanges r = case r of + U.SingleChar c1 -> \case + U.SingleChar c2 -> if c1 == succ c2 + then Right (U.CharRange c2 c1) + else Left r + U.CharRange c2 c3 -> if c1 == succ c3 + then Right (U.CharRange c2 c1) + else Left r + U.CharRange c1 c2 -> \case + U.SingleChar c3 -> if c1 == succ c3 + then Right (U.CharRange c3 c2) + else Left r + U.CharRange c3 c4 -> if c1 == succ c4 + then Right (U.CharRange c3 c2) + else Left r + + addScript :: Prop.Entry -> Set.Set BS.ShortByteString -> Set.Set BS.ShortByteString + addScript (Prop.Entry _ script) = Set.insert script + + mkScripts :: [BS.ShortByteString] -> BB.Builder + mkScripts + = mconcat + . L.intersperse "\n | " + . fmap (\script -> mconcat + [ mkHaskellConstructor script + , " -- ^ " + , case Map.lookup script aliases of + Just as -> mkAliases as + Nothing -> error ("No abbreviation for script: " <> show script) + , ": @" + , BB.shortByteString script + , "@" + ]) + + mkAliases + = sconcat + . NE.intersperse ", " + . fmap (\abbr -> mconcat ["@", BB.shortByteString abbr, "@"]) + + mkScriptDefinitions :: [Prop.Entry] -> BB.Builder + mkScriptDefinitions + = foldMap mkScriptDefinition + . NE.groupBy ((==) `on` Prop.value) + . NE.fromList + . reverse + . addUnknownRanges + + addUnknownRanges :: [Prop.Entry] -> [Prop.Entry] + addUnknownRanges ls = + let addUnknown (acc, expected) (c, _) = case mkMissingRange expected c of + Just r -> (,succ c) $ case acc of + r':acc' -> either (:acc) (:acc') (combineRanges r r') + _ -> [r] + Nothing -> (acc, succ expected) + addRest (acc@(r':acc'), expected) = + let r = U.CharRange expected maxBound + in either (:acc) (:acc') (combineRanges r r') + addRest _ = error "impossible" + unknown = fmap (`Prop.Entry` Defaults.defaultScript) . addRest $ foldl' + addUnknown + (mempty, '\0') + (L.sort (foldMap (rangeToCharScripts id) ls)) + in unknown <> ls + + mkMissingRange :: Char -> Char -> Maybe U.CodePointRange + mkMissingRange expected c + | c == expected = Nothing + | c == succ expected = Just (U.SingleChar expected) + | otherwise = Just (U.CharRange expected (pred c)) + + mkScriptDefinition :: NE.NonEmpty Prop.Entry -> BB.Builder + mkScriptDefinition ranges = mconcat + [ " " + , mkHaskellConstructor (Prop.value (NE.head ranges)) + , " -> (Ptr \"" + , foldMap encodeRange ranges + , "\"#, " + , BB.wordDec + (foldr + (\(Prop.Entry r _) -> case r of + U.SingleChar{} -> (+1) + U.CharRange{} -> (+2)) + 0 + ranges) + , ")\n" + ] + + -- Encoding: + -- • A single char is encoded as an LE Int32. + -- • A range is encoded as two LE Int32 (first is lower bound, second is + -- upper bound), which correspond to the codepoints with the 32th bit set. + encodeRange :: Prop.Entry -> BB.Builder + encodeRange (Prop.Entry r _) = case r of + U.SingleChar c -> encodeBytes (fromIntegral (ord c)) + U.CharRange l u -> encodeBytes (setBit (fromIntegral (ord l)) 31) + <> encodeBytes (setBit (fromIntegral (ord u)) 31) + encodeBytes = foldr addByte "" . word32ToWord8s + addByte n acc = BB.char7 '\\' <> BB.word8Dec n <> acc + + mkCharScripts :: [BS.ShortByteString] -> [Prop.Entry] -> BB.Builder + mkCharScripts scripts scriptsRanges = + let charScripts = L.sort (foldMap (rangeToCharScripts getScript) scriptsRanges) + charScripts' = reverse (fst (foldl' addMissing (mempty, '\0') charScripts)) + addMissing (acc, expected) x@(c, script) = if expected < c + then addMissing (def:acc, succ expected) x + else (script:acc, succ c) + def = getScript Defaults.defaultScript + getScript s = fromMaybe (error "script not found") (L.elemIndex s scripts) + -- [TODO] simplify + (planes0To3, plane14) = splitPlanes "Cannot generate: genScriptsModule" (== def) charScripts' + in genEnumBitmap + "script" + (def, BB.intDec (fromEnum def)) + (def, BB.intDec (fromEnum def)) + planes0To3 + plane14 + + rangeToCharScripts :: (BS.ShortByteString -> b) -> Prop.Entry -> [(Char, b)] + rangeToCharScripts f (Prop.Entry r script) = case r of + U.SingleChar cp -> [(cp, f script)] + U.CharRange l u -> (, f script) <$> [l..u] + +-- Map: script long form → short forms +parseScriptAliases :: B.ByteString -> PropertyValuesAliases +parseScriptAliases = foldr addScript mempty . PVA.parse + where + addScript PVA.Entry{..} + | property /= "sc" = id + | otherwise = Map.insert + (PVA.longName value) + (PVA.shortName value NE.:| PVA.aliases value) diff --git a/ucd2haskell/exe/UCD2Haskell/Modules/ScriptsExtensions.hs b/ucd2haskell/exe/UCD2Haskell/Modules/ScriptsExtensions.hs new file mode 100644 index 00000000..b3f9a9f4 --- /dev/null +++ b/ucd2haskell/exe/UCD2Haskell/Modules/ScriptsExtensions.hs @@ -0,0 +1,172 @@ +-- | +-- Copyright : (c) 2024 Pierre Le Marre +-- License : Apache-2.0 +-- Maintainer : streamly@composewell.com +-- Stability : experimental +-- +module UCD2Haskell.Modules.ScriptsExtensions + ( recipe + , parseScriptExtensions + ) where + +import qualified Data.ByteString as B +import qualified Data.ByteString.Builder as BB +import qualified Data.ByteString.Short as BS +import Data.Foldable (Foldable(..)) +import Data.Function (on) +import qualified Data.List as L +import qualified Data.List.NonEmpty as NE +import Data.Maybe (fromMaybe) +import Data.Semigroup (Semigroup(..)) +import qualified Data.Set as Set +import qualified Data.Map.Strict as Map +import qualified Unicode.CharacterDatabase.Parser.Common as U +import qualified Unicode.CharacterDatabase.Parser.Properties.Defaults as Defaults +import qualified Unicode.CharacterDatabase.Parser.Properties.Single as Prop + +import UCD2Haskell.Generator (FileRecipe (..), unlinesBB, apacheLicense, genEnumBitmap, splitPlanes) +import UCD2Haskell.Common (Fold (..), mkHaskellConstructor) + +recipe :: PropertyValuesAliases -> ScriptExtensions -> FileRecipe Prop.Entry +recipe aliases extensions = ModuleRecipe + "Unicode.Internal.Char.ScriptExtensions" + (\m -> genScriptExtensionsModule m aliases extensions) + +type PropertyValuesAliases = Map.Map BS.ShortByteString (NE.NonEmpty BS.ShortByteString) + +genScriptExtensionsModule :: BB.Builder -> PropertyValuesAliases -> ScriptExtensions -> Fold Prop.Entry BB.Builder +genScriptExtensionsModule moduleName aliases extensions = Fold step mempty done + where + -- [NOTE] We rely on all the scripts having a short form + + -- Map: abbreviation -> script + scripts = Map.foldlWithKey' + (\acc s as -> Map.insert (NE.head as) s acc) + mempty + aliases + + -- Map: script → short form + getScriptAbbr :: BS.ShortByteString -> BS.ShortByteString + getScriptAbbr = maybe (error "script not found") NE.head . (aliases Map.!?) + + -- All possible values: extensions + scripts + extensionsSet :: Set.Set (NE.NonEmpty BS.ShortByteString) + extensionsSet = Set.fromList (Map.elems extensions) + <> Set.map pure (Map.keysSet scripts) + extensionsList = L.sortBy + (compare `on` fmap (scripts Map.!)) + (Set.toList extensionsSet) + + encodeExtensions :: NE.NonEmpty BS.ShortByteString -> Int + encodeExtensions e = fromMaybe + (error ("extension not found: " <> show e)) + (L.elemIndex e extensionsList) + + encodedExtensions :: Map.Map (NE.NonEmpty BS.ShortByteString) Int + encodedExtensions = + let l = length extensionsSet + in if length extensionsSet > 0xff + then error ("Too many script extensions: " <> show l) + else Map.fromSet encodeExtensions extensionsSet + + step + :: (Set.Set (NE.NonEmpty BS.ShortByteString), Map.Map Char Int) -- used exts, encoded char exts + -> Prop.Entry + -> (Set.Set (NE.NonEmpty BS.ShortByteString), Map.Map Char Int) + step acc (Prop.Entry range script) = case range of + U.SingleChar c -> addChar script c acc + U.CharRange c1 c2 -> foldr (addChar script) acc [c1..c2] + + addChar + :: BS.ShortByteString -- script + -> Char -- processed char + -> (Set.Set (NE.NonEmpty BS.ShortByteString), Map.Map Char Int) + -> (Set.Set (NE.NonEmpty BS.ShortByteString), Map.Map Char Int) + addChar script c (extsAcc, charAcc) = case Map.lookup c extensions of + -- Char has explicit extensions + Just exts -> ( Set.insert exts extsAcc + , Map.insert c (encodedExtensions Map.! exts) charAcc) + -- Char has no explicit extensions: use its script + Nothing -> + let exts = getScriptAbbr script NE.:| [] + in ( Set.insert exts extsAcc + , Map.insert c (encodedExtensions Map.! exts) charAcc) + + done (usedExts, exts) = unlinesBB + [ apacheLicense 2022 moduleName + , "{-# LANGUAGE OverloadedLists #-}" + , "{-# OPTIONS_HADDOCK hide #-}" + , "" + , "module " <> moduleName + , "(scriptExtensions, decodeScriptExtensions)" + , "where" + , "" + , "import Data.Char (ord)" + , "import Data.List.NonEmpty (NonEmpty)" + , "import Data.Word (Word8)" + , "import GHC.Exts (Ptr(..))" + , "import Unicode.Internal.Char.Scripts (Script(..))" + , "import Unicode.Internal.Bits (lookupIntN)" + , "" + , "-- | Useful to decode the output of 'scriptExtensions'." + , "decodeScriptExtensions :: Int -> NonEmpty Script" + , "decodeScriptExtensions = \\case" <> mkDecodeScriptExtensions usedExts + , " _ -> [" <> mkHaskellConstructor Defaults.defaultScript <> "]" + , "" + , "-- | Script extensions of a character." + , "--" + , "-- @since 0.1.0" + , genEnumBitmap + "scriptExtensions" + (def, BB.intDec (fromEnum def)) + (def, BB.intDec (fromEnum def)) + planes0To3 + plane14 + ] + where + scriptExtensions = mkScriptExtensions exts + -- [TODO] simplify + (planes0To3, plane14) = splitPlanes + "Cannot generate: genScriptExtensionsModule" + (== def) + scriptExtensions + + mkDecodeScriptExtensions :: Set.Set (NE.NonEmpty BS.ShortByteString) -> BB.Builder + mkDecodeScriptExtensions + = mkDecodeScriptExtensions' + . Set.map (\exts -> (encodedExtensions Map.! exts, exts)) + mkDecodeScriptExtensions' = foldMap $ \(v, exts) -> mconcat + [ "\n " + , BB.intDec v + , " -> [" + , sconcat (NE.intersperse ", " (mkScript <$> exts)) + , "]" + ] + mkScript :: BS.ShortByteString -> BB.Builder + mkScript = mkHaskellConstructor . (scripts Map.!) + + def :: Int + def = encodedExtensions Map.! (getScriptAbbr Defaults.defaultScript NE.:| []) + + mkScriptExtensions + = reverse + . snd + . Map.foldlWithKey addCharExt ('\0', mempty) + addCharExt (expected, acc) c v = if expected < c + then addCharExt (succ expected, def : acc) c v + else (succ c, v : acc) + +type ScriptExtensions = Map.Map Char (NE.NonEmpty BS.ShortByteString) + +parseScriptExtensions :: B.ByteString -> ScriptExtensions +parseScriptExtensions = foldr addExt mempty . Prop.parse + where + addExt Prop.Entry{..} = case range of + U.SingleChar c -> Map.insertWith (<>) c values + U.CharRange{..} -> \acc -> foldl' + (\a c -> Map.insertWith (<>) c values a) + acc + [start..end] + where + values = NE.fromList (U.parseList value) + diff --git a/ucd2haskell/exe/UCD2Haskell/Modules/Security/Confusables.hs b/ucd2haskell/exe/UCD2Haskell/Modules/Security/Confusables.hs new file mode 100644 index 00000000..dc42bede --- /dev/null +++ b/ucd2haskell/exe/UCD2Haskell/Modules/Security/Confusables.hs @@ -0,0 +1,69 @@ +-- | +-- Copyright : (c) 2024 Pierre Le Marre +-- License : Apache-2.0 +-- Maintainer : streamly@composewell.com +-- Stability : experimental +-- +module UCD2Haskell.Modules.Security.Confusables (recipe) where + +import qualified Data.ByteString.Builder as BB +import qualified Data.ByteString.Lazy as BL +import qualified Data.List.NonEmpty as NE +import qualified Data.Map.Strict as Map +import qualified Unicode.CharacterDatabase.Parser.Common as U +import qualified Unicode.CharacterDatabase.Parser.Properties.Single as Prop + +import UCD2Haskell.Common (Fold (..), showB) +import UCD2Haskell.Generator (FileRecipe (..), unlinesBB, apacheLicense) + +recipe :: FileRecipe Prop.EntryMultipleValues +recipe = ModuleRecipe + "Unicode.Internal.Char.Security.Confusables" + genConfusablesModule + +genConfusablesModule :: BB.Builder -> Fold Prop.EntryMultipleValues BB.Builder +genConfusablesModule moduleName = Fold step mempty done + where + + step acc Prop.EntryMultipleValues{..} = case range of + U.SingleChar c -> Map.insert + c + (U.parseCodePointList (NE.head values)) + acc + U.CharRange{} -> error ("unexpected range: " <> show (range, values)) + + mkConfusable :: Char -> String -> BB.Builder + mkConfusable c s = mconcat + [ "\n " + , showB c + , " -> Just (Ptr \"" + , stringToAddrLiteral s + , "\\0\"#)" + ] + + -- Encode string with UTF-8 + stringToAddrLiteral = foldMap toEscapedWord8 . encodeUtf8 + encodeUtf8 = BL.unpack . BB.toLazyByteString . BB.stringUtf8 + toEscapedWord8 = (BB.char7 '\\' <>) . BB.word8Dec + + done confusables = unlinesBB + [ apacheLicense 2022 moduleName + , "{-# LANGUAGE LambdaCase #-}" + , "{-# OPTIONS_HADDOCK hide #-}" + , "" + , "module " <> moduleName + , "(confusablePrototype)" + , "where" + , "" + , "import Foreign.C.String (CString)" + , "import GHC.Exts (Ptr(..))" + , "" + , "-- | Returns the /prototype/ of a character, if it is confusable." + , "--" + , "-- The resulting 'CString' is null-terminated and encoded in UTF-8." + , "--" + , "-- @since 0.1.0" + , "confusablePrototype :: Char -> Maybe CString" + , "confusablePrototype = \\case" <> Map.foldMapWithKey mkConfusable confusables + , " _ -> Nothing" + ] diff --git a/ucd2haskell/exe/UCD2Haskell/Modules/Security/IdentifierStatus.hs b/ucd2haskell/exe/UCD2Haskell/Modules/Security/IdentifierStatus.hs new file mode 100644 index 00000000..dc87f96f --- /dev/null +++ b/ucd2haskell/exe/UCD2Haskell/Modules/Security/IdentifierStatus.hs @@ -0,0 +1,50 @@ +-- | +-- Copyright : (c) 2024 Pierre Le Marre +-- License : Apache-2.0 +-- Maintainer : streamly@composewell.com +-- Stability : experimental +-- +module UCD2Haskell.Modules.Security.IdentifierStatus (recipe) where + +import qualified Data.ByteString.Builder as BB +import Data.Char (ord) +import Data.Foldable (Foldable(..)) +import qualified Unicode.CharacterDatabase.Parser.Common as U +import qualified Unicode.CharacterDatabase.Parser.Properties.Single as Prop + +import UCD2Haskell.Generator (FileRecipe (..), unlinesBB, apacheLicense, genBitmap) +import UCD2Haskell.Common (Fold (..)) + +recipe :: FileRecipe Prop.Entry +recipe = ModuleRecipe + "Unicode.Internal.Char.Security.IdentifierStatus" + genIdentifierStatusModule + +genIdentifierStatusModule :: BB.Builder -> Fold Prop.Entry BB.Builder +genIdentifierStatusModule moduleName = Fold step mempty done + where + + step acc Prop.Entry{..} = case range of + U.SingleChar c -> addAllowed value acc c + U.CharRange{..} -> foldl' (addAllowed value) acc [start..end] + + addAllowed = \case + "Allowed" -> \acc -> (: acc) . ord + x -> error ("Unexpected " <> show x) + + done values = unlinesBB + [ apacheLicense 2022 moduleName + , "{-# LANGUAGE LambdaCase #-}" + , "{-# OPTIONS_HADDOCK hide #-}" + , "" + , "module " <> moduleName + , "(isAllowedInIdentifier)" + , "where" + , "" + , "import Data.Char (ord)" + , "import Data.Word (Word8)" + , "import GHC.Exts (Ptr(..))" + , "import Unicode.Internal.Bits (lookupBit64)" + , "" + , genBitmap "isAllowedInIdentifier" (reverse values) + ] diff --git a/ucd2haskell/exe/UCD2Haskell/Modules/Security/IdentifierType.hs b/ucd2haskell/exe/UCD2Haskell/Modules/Security/IdentifierType.hs new file mode 100644 index 00000000..46240ae3 --- /dev/null +++ b/ucd2haskell/exe/UCD2Haskell/Modules/Security/IdentifierType.hs @@ -0,0 +1,210 @@ +{-# OPTIONS_GHC -Wno-unrecognised-pragmas #-} +{-# HLINT ignore "Use camelCase" #-} +-- | +-- Copyright : (c) 2024 Pierre Le Marre +-- License : Apache-2.0 +-- Maintainer : streamly@composewell.com +-- Stability : experimental +-- +module UCD2Haskell.Modules.Security.IdentifierType (recipe) where + +import Control.Exception (assert) +import qualified Data.ByteString.Builder as BB +import qualified Data.ByteString.Short as BS +import Data.Foldable (Foldable(..)) +import qualified Data.List as L +import qualified Data.List.NonEmpty as NE +import qualified Data.Map.Strict as Map +import Data.Maybe (fromMaybe) +import qualified Data.Set as Set +import qualified Unicode.CharacterDatabase.Parser.Common as U +import qualified Unicode.CharacterDatabase.Parser.Properties.Single as Prop + +import UCD2Haskell.Generator (FileRecipe (..), unlinesBB, apacheLicense, genEnumBitmap, splitPlanes) +import UCD2Haskell.Common (Fold (..)) + +recipe :: FileRecipe Prop.Entry +recipe = ModuleRecipe + "Unicode.Internal.Char.Security.IdentifierType" + genIdentifierTypeModule + +data IdentifierType + = Not_Character + | Deprecated + | Default_Ignorable + | Not_NFKC + | Not_XID + | Exclusion + | Obsolete + | Technical + | Uncommon_Use + | Limited_Use + | Inclusion + | Recommended + deriving (Eq, Ord, Bounded, Enum, Show) + +-- TODO: We should really use Set instead of NonEmpty. +-- This is just to maintain compatibility during the migration +newtype IdentifierTypes = IdentifierTypes + -- TODO { unIdentifierTypes :: Set.Set IdentifierType } + { unIdentifierTypes :: NE.NonEmpty IdentifierType } + deriving newtype (Eq, Ord, Semigroup) + +parseIdentifierType :: BS.ShortByteString -> IdentifierType +parseIdentifierType = \case + "Not_Character" -> Not_Character + "Deprecated" -> Deprecated + "Default_Ignorable" -> Default_Ignorable + "Not_NFKC" -> Not_NFKC + "Not_XID" -> Not_XID + "Exclusion" -> Exclusion + "Obsolete" -> Obsolete + "Technical" -> Technical + "Uncommon_Use" -> Uncommon_Use + "Limited_Use" -> Limited_Use + "Inclusion" -> Inclusion + "Recommended" -> Recommended + raw -> error ("parseIdentifierType: Cannot parse: " <> show raw) + +genIdentifierTypeModule :: BB.Builder -> Fold Prop.Entry BB.Builder +genIdentifierTypeModule moduleName = Fold step mempty done + where + + step + :: Map.Map Char IdentifierTypes + -> Prop.Entry + -> Map.Map Char IdentifierTypes + step acc Prop.Entry{..} = case range of + U.SingleChar c -> addIdentifierType values acc c + U.CharRange{..} -> foldl' (addIdentifierType values) acc [start..end] + where values = NE.fromList (U.parseList value) + + addIdentifierType types acc c = Map.insertWith + -- TODO (<>) + (flip (<>)) + c + -- TODO (IdentifierTypes (Set.fromList (parseIdentifierType <$> NE.toList types))) + (IdentifierTypes (parseIdentifierType <$> types)) + acc + + mkIdentifiersTypes + :: Map.Map Char IdentifierTypes + -> (BB.Builder, [Int], Int) + mkIdentifiersTypes types = + let encoding = Set.toList (Set.fromList (def : Map.elems types)) + defIdx = case L.elemIndex def encoding of + Nothing -> error "impossible" + Just i -> assert (i == 0) i + in assert (length encoding < 0xff) + ( foldMap addEncoding (zip [0..] encoding) + , snd (Map.foldlWithKey' (addChar defIdx encoding) ('\0', mempty) types) + , defIdx ) + + -- Default value + -- TODO def = IdentifierTypes (Set.singleton Not_Character) + def = IdentifierTypes (NE.singleton Not_Character) + + addEncoding :: (Int, IdentifierTypes) -> BB.Builder + addEncoding (n, e) = mconcat + [ "\n " + , BB.intDec n + , " -> " + , mkHaskellConstructorsList e ] + + addChar + :: Int + -> [IdentifierTypes] + -> (Char, [Int]) + -> Char + -> IdentifierTypes + -> (Char, [Int]) + addChar defIdx encoding (expected, acc) c types = if expected < c + then + let acc' = encodeTypes defIdx encoding def : acc + in addChar defIdx encoding (succ expected, acc') c types + else (succ c, encodeTypes defIdx encoding types : acc) + + encodeTypes :: Int -> [IdentifierTypes] -> IdentifierTypes -> Int + encodeTypes defIdx encoding types = + fromMaybe defIdx (L.elemIndex types encoding) + + -- HACK: remove underscore to get constructor + mkHaskellConstructorsList + = BB.string7 + . filter (/= '_') + . show + -- TODO . Set.toList + . NE.toList + . unIdentifierTypes + + done acc = unlinesBB + [ apacheLicense 2022 moduleName + , "{-# LANGUAGE LambdaCase, OverloadedLists #-}" + , "{-# OPTIONS_HADDOCK hide #-}" + , "" + , "module " <> moduleName + , "(IdentifierType(..), identifierTypes, decodeIdentifierTypes)" + , "where" + , "" + , "import Data.Char (ord)" + , "import Data.List.NonEmpty (NonEmpty)" + , "import Data.Word (Word8)" + , "import GHC.Exts (Ptr(..))" + , "import Unicode.Internal.Bits (lookupIntN)" + , "" + , "-- | Identifier type" + , "--" + , "-- @since 0.1.0" + , "data IdentifierType" + , " = NotCharacter" + , " -- ^ Unassigned characters, private use characters, surrogates," + , " -- non-whitespace control characters." + , " | Deprecated" + , " -- ^ Characters with the Unicode property @Deprecated=Yes@." + , " | DefaultIgnorable" + , " -- ^ Characters with the Unicode property \ + \@Default_Ignorable_Code_Point=Yes@." + , " | NotNFKC" + , " -- ^ Characters that cannot occur in strings normalized to NFKC." + , " | NotXID" + , " -- ^ Characters that do not qualify as default Unicode identifiers;" + , " -- that is, they do not have the Unicode property XID_Continue=True." + , " | Exclusion" + , " -- ^ Characters with @Script_Extensions@ values containing a script" + , " -- in /Excluded Scripts/, and no script from /Limited Use Scripts/" + , " -- or /Recommended Scripts/, other than “Common” or “Inherited”." + , " | Obsolete" + , " -- ^ Characters that are no longer in modern use, or that are not" + , " -- commonly used in modern text." + , " | Technical" + , " -- ^ Specialized usage: technical, liturgical, etc." + , " | UncommonUse" + , " -- ^ Characters that are uncommon, or are limited in use, or" + , " -- whose usage is uncertain." + , " | LimitedUse" + , " -- ^ Characters from scripts that are in limited use." + , " | Inclusion" + , " -- ^ Exceptionally allowed characters." + , " | Recommended" + , " -- ^ Characters from scripts that are in widespread everyday common use." + , " deriving (Eq, Ord, Bounded, Enum, Show)" + , "" + , "-- | Useful to decode the output of 'identifierTypes'." + , "decodeIdentifierTypes :: Int -> NonEmpty IdentifierType" + , "decodeIdentifierTypes = \\case" <> encoding + , " _ -> " <> mkHaskellConstructorsList def + , "" + , "-- | Returns the 'IdentifierType's corresponding to a character." + , genEnumBitmap + "identifierTypes" + (defIdx, BB.intDec (fromEnum defIdx)) + (defIdx, BB.intDec (fromEnum defIdx)) + planes0To3 + plane14 + ] + where + (planes0To3, plane14) = splitPlanes + "Cannot generate: genIdentifierTypeModule" + (== defIdx) + (reverse identifiersTypes) + (encoding, identifiersTypes, defIdx) = mkIdentifiersTypes acc diff --git a/ucd2haskell/exe/UCD2Haskell/Modules/Security/IntentionalConfusables.hs b/ucd2haskell/exe/UCD2Haskell/Modules/Security/IntentionalConfusables.hs new file mode 100644 index 00000000..9d7d6b30 --- /dev/null +++ b/ucd2haskell/exe/UCD2Haskell/Modules/Security/IntentionalConfusables.hs @@ -0,0 +1,69 @@ +-- | +-- Copyright : (c) 2024 Pierre Le Marre +-- License : Apache-2.0 +-- Maintainer : streamly@composewell.com +-- Stability : experimental +-- +module UCD2Haskell.Modules.Security.IntentionalConfusables (recipe) where + +import qualified Data.ByteString.Builder as BB +import qualified Data.ByteString.Lazy as BL +import qualified Data.Map.Strict as Map +import qualified Data.Set as Set +import qualified Unicode.CharacterDatabase.Parser.Common as U +import qualified Unicode.CharacterDatabase.Parser.Properties.Single as Prop + +import UCD2Haskell.Common (Fold (..), showB) +import UCD2Haskell.Generator (FileRecipe (..), unlinesBB, apacheLicense) + +recipe :: FileRecipe Prop.Entry +recipe = ModuleRecipe + "Unicode.Internal.Char.Security.IntentionalConfusables" + generate + +generate :: BB.Builder -> Fold Prop.Entry BB.Builder +generate moduleName = Fold step mempty done + where + + step acc Prop.Entry{..} = case range of + U.SingleChar c1 -> Map.insertWith (flip (<>)) c1 (Set.singleton c2) + . Map.insertWith (flip (<>)) c2 (Set.singleton c1) + $ acc + U.CharRange{} -> error ("unexpected range: " <> show (range, value)) + where c2 = U.parseCodePoint value + + mkConfusable :: Char -> Set.Set Char -> BB.Builder + mkConfusable c cs = mconcat + [ "\n " + , showB c + , " -> Just (Ptr \"" + , stringToAddrLiteral (Set.toList cs) + , "\\0\"#)" + ] + + -- Encode string with UTF-8 + stringToAddrLiteral = foldMap toEscapedWord8 . encodeUtf8 + encodeUtf8 = BL.unpack . BB.toLazyByteString . BB.stringUtf8 + toEscapedWord8 = (BB.char7 '\\' <>) . BB.word8Dec + + done confusables = unlinesBB + [ apacheLicense 2022 moduleName + , "{-# LANGUAGE LambdaCase #-}" + , "{-# OPTIONS_HADDOCK hide #-}" + , "" + , "module " <> moduleName + , "(intentionalConfusables)" + , "where" + , "" + , "import Foreign.C.String (CString)" + , "import GHC.Exts (Ptr(..))" + , "" + , "-- | Returns the /intentional/ confusables of a character, if any." + , "--" + , "-- The resulting 'CString' is null-terminated and encoded in UTF-8." + , "--" + , "-- @since 0.1.0" + , "intentionalConfusables :: Char -> Maybe CString" + , "intentionalConfusables = \\case" <> Map.foldMapWithKey mkConfusable confusables + , " _ -> Nothing" + ] diff --git a/ucd2haskell/exe/UCD2Haskell/Modules/SpecialCasings.hs b/ucd2haskell/exe/UCD2Haskell/Modules/SpecialCasings.hs new file mode 100644 index 00000000..9482d1fe --- /dev/null +++ b/ucd2haskell/exe/UCD2Haskell/Modules/SpecialCasings.hs @@ -0,0 +1,133 @@ +-- | +-- Copyright : (c) 2024 Pierre Le Marre +-- License : Apache-2.0 +-- Maintainer : streamly@composewell.com +-- Stability : experimental +-- +module UCD2Haskell.Modules.SpecialCasings + ( upperRecipe + , lowerRecipe + , titleRecipe + , parse + ) where + +import qualified Data.ByteString.Builder as BB +import Data.Char (ord) +import Data.Functor ((<&>)) +import qualified Data.Map.Strict as Map +import qualified Unicode.CharacterDatabase.Parser.Common as U +import qualified Unicode.CharacterDatabase.Parser.SpecialCasing as SC +import qualified Unicode.CharacterDatabase.Parser.UnicodeData as UD + +import UCD2Haskell.Generator (FileRecipe (..), unlinesBB, apacheLicense) +import UCD2Haskell.Common (Fold (..)) +import Control.Applicative (Alternative(..)) +import Data.Bits (Bits(..)) +import qualified Data.ByteString as B + +type SpecialCasings = Map.Map Char SC.SpecialCasing + +parse :: B.ByteString -> SpecialCasings +parse = foldr addCasings mempty . SC.parse + where + addCasings (SC.Entry ch sc) + -- Do not add locale-specific casings + | null (SC.conditions sc) = Map.insert ch sc + | otherwise = id + +upperRecipe :: SpecialCasings -> FileRecipe UD.Entry +upperRecipe sc = ModuleRecipe + "Unicode.Internal.Char.SpecialCasing.UpperCaseMapping" + (\m -> genSpecialCaseMappingModule m + "toSpecialUpperCase" + sc + SC.upper + UD.simpleUpperCaseMapping ) + +lowerRecipe :: SpecialCasings -> FileRecipe UD.Entry +lowerRecipe sc = ModuleRecipe + "Unicode.Internal.Char.SpecialCasing.LowerCaseMapping" + (\m -> genSpecialCaseMappingModule m + "toSpecialLowerCase" + sc + SC.lower + UD.simpleLowerCaseMapping ) + +titleRecipe :: SpecialCasings -> FileRecipe UD.Entry +titleRecipe sc = ModuleRecipe + "Unicode.Internal.Char.SpecialCasing.TitleCaseMapping" + (\m -> genSpecialCaseMappingModule m + "toSpecialTitleCase" + sc + SC.title + UD.simpleTitleCaseMapping ) + +-- [NOTE] Case mapping encodes up to 3 code points on 21 bits each in an Int64. +genSpecialCaseMappingModule + :: BB.Builder + -> BB.Builder + -> SpecialCasings + -- ^ Special casings + -> (SC.SpecialCasing -> String) + -- ^ Special case selector + -> (UD.CharDetails -> Maybe Char) + -- ^ Simple case selector + -> Fold UD.Entry BB.Builder +genSpecialCaseMappingModule moduleName funcName specialCasings special simple = + Fold step initial done + + where + + genHeader = + [ apacheLicense 2022 moduleName + , "{-# LANGUAGE LambdaCase #-}" + , "{-# OPTIONS_HADDOCK hide #-}" + , "" + , "module " <> moduleName + , "(" <> funcName <> ")" + , "where" + , "" + , "import Data.Int (Int64)" + , "" + , "{-# NOINLINE " <> funcName <> " #-}" + , funcName <> " :: Char -> Int64" + , funcName <> " = \\case" + ] + initial = [] + + step xs (UD.Entry range dc) = case range of + U.SingleChar ch -> case mkEntry ch dc of + Nothing -> xs + Just x -> x : xs + _ -> xs + + after = [" _ -> 0"] + + done st = + let body = mconcat [genHeader, reverse st, after] + in unlinesBB body + + mkEntry ch dc = (mkSpecial ch <|> mkSimple dc) <&> \cp -> mconcat + [ " " + , BB.string7 . show $ ch + , " -> 0x" + , BB.wordHex . fromIntegral $ cp + ] + -- TODO: switch to hexadecimal formatting for better debugging? + -- what about code size increase? + -- [ " '\\x" + -- , showHexCodePoint (ord ch) + -- , "' -> '\\x" + -- , showHexCodePoint cp + -- , "'" + -- ] + -- showHexCodePoint = BB.wordHex . fromIntegral + + mkSimple = fmap ord . simple + mkSpecial = fmap (encode . special) . (specialCasings Map.!?) + encode :: String -> Int + encode + = foldr (\(k, c) -> (+) (ord c `shiftL` k)) 0 + . zip [0, 21, 42] + -- Check min 1 character, max 3 characters + . (\cs -> if null cs || length cs > 3 then error (show cs) else cs) diff --git a/ucd2haskell/exe/UCD2Haskell/Modules/UnicodeData/CombiningClass.hs b/ucd2haskell/exe/UCD2Haskell/Modules/UnicodeData/CombiningClass.hs new file mode 100644 index 00000000..e4aa4782 --- /dev/null +++ b/ucd2haskell/exe/UCD2Haskell/Modules/UnicodeData/CombiningClass.hs @@ -0,0 +1,85 @@ +-- | +-- Copyright : (c) 2024 Pierre Le Marre +-- License : Apache-2.0 +-- Maintainer : streamly@composewell.com +-- Stability : experimental +-- +module UCD2Haskell.Modules.UnicodeData.CombiningClass + ( recipe + , parseCombining) where + +import qualified Data.ByteString.Builder as BB +import qualified Unicode.CharacterDatabase.Parser.Common as U +import qualified Unicode.CharacterDatabase.Parser.UnicodeData as UD +import qualified Unicode.CharacterDatabase.Parser.Properties.Single as Props + +import UCD2Haskell.Generator (FileRecipe (..), unlinesBB, apacheLicense, genBitmap) +import UCD2Haskell.Common (Fold (..), showB) +import Data.Char (ord) +import Data.Foldable (Foldable(..)) +import qualified Data.ByteString as B +import qualified Data.Set as Set + +recipe :: FileRecipe UD.Entry +recipe = ModuleRecipe + "Unicode.Internal.Char.UnicodeData.CombiningClass" + genCombiningClassModule + +data Acc = Acc + { combiningClasses :: ![BB.Builder] + , combiningCodePoints :: ![Int] } + +genCombiningClassModule :: BB.Builder -> Fold UD.Entry BB.Builder +genCombiningClassModule moduleName = Fold step initial done + where + + initial = Acc [] [] + + step acc (UD.Entry range details) + -- Skip non combining + | cc == 0 = acc + | otherwise = case range of + U.SingleChar c -> step' cc acc c + U.CharRange start end -> foldl' (step' cc) acc [start..end] + where cc = UD.combiningClass details + step' cc Acc{..} c = Acc + { combiningClasses = genCombiningClassDef c cc : combiningClasses + , combiningCodePoints = ord c : combiningCodePoints } + + done Acc{..} = + unlinesBB + [ apacheLicense 2020 moduleName + , "{-# LANGUAGE LambdaCase #-}" + , "{-# OPTIONS_HADDOCK hide #-}" + , "module " <> moduleName + , "(combiningClass, isCombining)" + , "where" + , "" + , "import Data.Char (ord)" + , "import Data.Word (Word8)" + , "import GHC.Exts (Ptr(..))" + , "import Unicode.Internal.Bits (lookupBit64)" + , "" + , "combiningClass :: Char -> Int" + , "combiningClass = \\case" + , unlinesBB (reverse combiningClasses) + , " _ -> 0\n" + , "" + , genBitmap "isCombining" (reverse combiningCodePoints) + ] + + genCombiningClassDef c cc = mconcat + [ " " + , showB c + , " -> " + , BB.word8Dec cc + ] + +parseCombining :: B.ByteString -> Set.Set Char +parseCombining = foldr addCombining mempty . Props.parse + where + addCombining Props.Entry{..} + | value == "0" = id + | otherwise = case range of + U.SingleChar c -> Set.insert c + U.CharRange{..} -> (Set.fromList [start..end] <>) diff --git a/ucd2haskell/exe/UCD2Haskell/Modules/UnicodeData/Composition.hs b/ucd2haskell/exe/UCD2Haskell/Modules/UnicodeData/Composition.hs new file mode 100644 index 00000000..ce94ebb6 --- /dev/null +++ b/ucd2haskell/exe/UCD2Haskell/Modules/UnicodeData/Composition.hs @@ -0,0 +1,135 @@ +-- | +-- Copyright : (c) 2024 Pierre Le Marre +-- License : Apache-2.0 +-- Maintainer : streamly@composewell.com +-- Stability : experimental +-- +module UCD2Haskell.Modules.UnicodeData.Composition + ( recipe + , parseFullCompositionExclusion) where + +import qualified Data.ByteString.Builder as BB +import qualified Unicode.CharacterDatabase.Parser.Common as U +import qualified Unicode.CharacterDatabase.Parser.UnicodeData as UD +import qualified Unicode.CharacterDatabase.Parser.Properties.Multiple as Props + +import UCD2Haskell.Generator (FileRecipe (..), unlinesBB, apacheLicense, genBitmap) +import UCD2Haskell.Common (Fold (..), showB, isHangulRange, allRange) +import Data.Char (ord) +import Data.Foldable (Foldable(..)) +import qualified Data.Set as Set +import qualified Data.ByteString as B + +recipe :: Set.Set Char -> Set.Set Char -> FileRecipe UD.Entry +recipe excluded combiningChars = ModuleRecipe + "Unicode.Internal.Char.UnicodeData.Compositions" + (\m -> genCompositionsModule m excluded combiningChars) + +data Acc = Acc + { decompositions :: ![BB.Builder] + , starters :: ![BB.Builder] + , secondStarters :: !(Set.Set Int) } + +data Decomposition2 = Decomposition2 + { first :: !Char + , second :: !Char } + +genCompositionsModule + :: BB.Builder + -> Set.Set Char + -> Set.Set Char + -> Fold UD.Entry BB.Builder +genCompositionsModule moduleName excluded combiningChars = + Fold step initial done + + where + + isNotExcluded = allRange (not . (`elem` excluded)) + + genComposePairDef name c Decomposition2{..} = mconcat + [ name + , " " + , showB first + , " " + , showB second + , " = Just " + , showB c ] + + initial = Acc mempty mempty mempty + + step acc (UD.Entry range (UD.decomposition -> decomp)) + | not (isHangulRange range) && isNotExcluded range = + case decomp of + UD.Decomposition UD.Canonical [c1, c2] -> + stepRange acc (Decomposition2 c1 c2) range + _ -> acc + -- Filtered out + | otherwise = acc + stepRange acc decomp = \case + U.SingleChar c -> step' decomp acc c + U.CharRange{..} -> foldl' (step' decomp) acc [start..end] + step' decomp@Decomposition2{..} Acc{..} c = Acc + { decompositions = decompositions' + , starters= starters' + , secondStarters = secondStarters' } + + where + + secondCP = ord second + decompositions' = genComposePairDef "compose" c decomp : decompositions + starters' = + if second `notElem` combiningChars + then genComposePairDef "composeStarters" c decomp : starters + else starters + secondStarters' = + if second `notElem` combiningChars + then Set.insert secondCP secondStarters + else secondStarters + + header = + [ apacheLicense 2020 moduleName + , "{-# OPTIONS_HADDOCK hide #-}" + , "" + , "module " <> moduleName + , "(compose, composeStarters, isSecondStarter)" + , "where" + , "" + , "import Data.Char (ord)" + , "import Data.Word (Word8)" + , "import GHC.Exts (Ptr(..))" + , "import Unicode.Internal.Bits (lookupBit64)" + , "" + ] + + composePair decomps = + [ "{-# NOINLINE compose #-}" + , "compose :: Char -> Char -> Maybe Char" + , unlinesBB decomps + , "compose _ _ = " <> "Nothing" <> "\n" + , "" + ] + + composeStarterPair starterPairs = + [ "composeStarters :: Char -> Char -> Maybe Char" + , unlinesBB starterPairs + , "composeStarters _ _ = " <> "Nothing" <> "\n" + , "" + ] + + isSecondStarter secondStarters = + [genBitmap "isSecondStarter" (Set.toAscList secondStarters)] + + done Acc{..} = unlinesBB . mconcat $ + [ header + , composePair (reverse decompositions) + , composeStarterPair (reverse starters) + , isSecondStarter secondStarters ] + +parseFullCompositionExclusion :: B.ByteString -> Set.Set Char +parseFullCompositionExclusion = foldr addExcluded mempty . Props.parse + where + addExcluded Props.Entry{..} + | property /= "Full_Composition_Exclusion" = id + | otherwise = case range of + U.SingleChar c -> Set.insert c + U.CharRange{..} -> (Set.fromList [start..end] <>) diff --git a/ucd2haskell/exe/UCD2Haskell/Modules/UnicodeData/Decomposition.hs b/ucd2haskell/exe/UCD2Haskell/Modules/UnicodeData/Decomposition.hs new file mode 100644 index 00000000..b2425d09 --- /dev/null +++ b/ucd2haskell/exe/UCD2Haskell/Modules/UnicodeData/Decomposition.hs @@ -0,0 +1,174 @@ +-- | +-- Copyright : (c) 2024 Pierre Le Marre +-- License : Apache-2.0 +-- Maintainer : streamly@composewell.com +-- Stability : experimental +-- +module UCD2Haskell.Modules.UnicodeData.Decomposition + ( -- * Recipes + decomposable + , decomposableK + , decompositions + , decompositionsK2 + , decompositionsK + -- * Helpers + , DType(..) + , hasDecomposableType + ) where + +import qualified Data.ByteString.Builder as BB +import qualified Unicode.CharacterDatabase.Parser.Common as U +import qualified Unicode.CharacterDatabase.Parser.UnicodeData as UD + +import UCD2Haskell.Generator (FileRecipe (..), unlinesBB, apacheLicense, genBitmap, unwordsBB) +import UCD2Haskell.Common (Fold (..), filterFold, filterNonHangul, showB, allRange) +import Data.Char (ord) +import Data.Foldable (Foldable(..)) + +data DType = Canonical | Kompat + +-------------------------------------------------------------------------------- +-- Predicates +-------------------------------------------------------------------------------- + +decomposable :: FileRecipe UD.Entry +decomposable = ModuleRecipe + "Unicode.Internal.Char.UnicodeData.Decomposable" + (`genDecomposableModule` Canonical) + +decomposableK :: FileRecipe UD.Entry +decomposableK = ModuleRecipe + "Unicode.Internal.Char.UnicodeData.DecomposableK" + (`genDecomposableModule` Kompat) + +genDecomposableModule + :: BB.Builder + -> DType + -> Fold UD.Entry BB.Builder +genDecomposableModule moduleName dtype + = filterNonHangul + . filterDecomposableType dtype + $ Fold step initial done + + where + + initial :: [Int] + initial = [] + + step acc (UD.Entry range _) = + case range of + U.SingleChar c -> ord c : acc + U.CharRange{..} -> foldl' (\a c -> ord c : a) acc [start..end] + + done st = + unlinesBB + [ apacheLicense 2020 moduleName + , "{-# OPTIONS_HADDOCK hide #-}" + , "" + , "module " <> moduleName + , "(isDecomposable)" + , "where" + , "" + , "import Data.Char (ord)" + , "import Data.Word (Word8)" + , "import GHC.Exts (Ptr(..))" + , "import Unicode.Internal.Bits (lookupBit64)" + , "" + , genBitmap "isDecomposable" (reverse st) + ] + +filterDecomposableType :: DType -> Fold UD.Entry a -> Fold UD.Entry a +filterDecomposableType dtype = + filterFold (hasDecomposableType dtype . UD.decomposition . UD.details) + +hasDecomposableType :: DType -> UD.Decomposition -> Bool +hasDecomposableType dtype = \case + UD.Self -> False + UD.Decomposition t _ -> case dtype of + Canonical -> t == UD.Canonical + Kompat -> True + +-------------------------------------------------------------------------------- +-- Decompositions +-------------------------------------------------------------------------------- + +decompositions :: FileRecipe UD.Entry +decompositions = ModuleRecipe + "Unicode.Internal.Char.UnicodeData.Decompositions" + let post = [" c -> [c]"] + in (\m -> genDecomposeDefModule m [] post Canonical (const True)) + +decompositionsK2 :: FileRecipe UD.Entry +decompositionsK2 = ModuleRecipe + "Unicode.Internal.Char.UnicodeData.DecompositionsK2" + let post = [" c -> [c]"] + in (\m -> genDecomposeDefModule m [] post Kompat (>= '\60000')) + +decompositionsK :: FileRecipe UD.Entry +decompositionsK = ModuleRecipe + "Unicode.Internal.Char.UnicodeData.DecompositionsK" + let pre = [ unwordsBB [ "import qualified" + , BB.string7 (moduleName decompositionsK2) + , "as DK2" ] + , "" ] + post = [" c -> DK2.decompose c"] + in (\m -> genDecomposeDefModule m pre post Kompat (< '\60000')) + +genDecomposeDefModule + :: BB.Builder + -> [BB.Builder] + -> [BB.Builder] + -> DType + -> (Char -> Bool) + -> Fold UD.Entry BB.Builder +genDecomposeDefModule moduleName before after dtype predicate + = filterFold predicate' + . filterNonHangul + . filterDecomposableType dtype + $ Fold step initial done + + where + + predicate' = allRange predicate . UD.range + + decomposeChar c = \case + UD.Self -> [c] + UD.Decomposition _ ds -> ds + + genHeader = + [ apacheLicense 2020 moduleName + , "{-# LANGUAGE LambdaCase #-}" + , "{-# OPTIONS_HADDOCK hide #-}" + , "" + , "module " <> moduleName + , "(decompose)" + , "where" + , "" + ] + genSign = + [ "{-# NOINLINE decompose #-}" + , "decompose :: Char -> [Char]" + , "decompose = \\case" + ] + + initial :: [BB.Builder] + initial = [] + + step acc (UD.Entry range details) = case range of + U.SingleChar c -> step' (UD.decomposition details) acc c + U.CharRange start end -> foldl' + (step' (UD.decomposition details)) + acc + [start..end] + step' decomp acc c = genDecomposeDef c decomp : acc + + done st = + let body = mconcat [genHeader, before, genSign, reverse st, after] + in unlinesBB body + + genDecomposeDef c decomp = mconcat + [ " " + , showB c + , " -> " + , showB (decomposeChar c decomp) + ] diff --git a/ucd2haskell/exe/UCD2Haskell/Modules/UnicodeData/DerivedNames.hs b/ucd2haskell/exe/UCD2Haskell/Modules/UnicodeData/DerivedNames.hs new file mode 100644 index 00000000..2c2a1cbd --- /dev/null +++ b/ucd2haskell/exe/UCD2Haskell/Modules/UnicodeData/DerivedNames.hs @@ -0,0 +1,249 @@ +-- | +-- Copyright : (c) 2024 Pierre Le Marre +-- License : Apache-2.0 +-- Maintainer : streamly@composewell.com +-- Stability : experimental +-- +module UCD2Haskell.Modules.UnicodeData.DerivedNames (recipe) where + +import Data.Bits (Bits(..)) +import qualified Data.ByteString.Lazy as BL +import qualified Data.ByteString.Builder as BB +import qualified Data.ByteString.Short as BS +import Data.Char (ord) +import Data.Foldable (Foldable(..)) +import Data.Word (Word8) +import qualified Unicode.CharacterDatabase.Parser.Extracted.DerivedName as N + +import UCD2Haskell.Generator (FileRecipe (..), unlinesBB, apacheLicense, enumMapToAddrLiteral, chunkAddrLiteral) +import UCD2Haskell.Common (Fold (..), showPaddedHeXB, showHexCodepoint, showPaddedHeX, showHexCodepointB, showHexCodepointBS) + +recipe :: FileRecipe N.Entry +recipe = ModuleRecipe + "Unicode.Internal.Char.UnicodeData.DerivedName" + genNamesModule + +data Acc = Acc + { expected :: !Char + , names :: ![BS.ShortByteString] + , offsets :: ![[Int]] + , offset :: !Int + , maxPlanes03 :: !Char + , maxPlane14 :: !Char + } + +genNamesModule :: BB.Builder -> Fold N.Entry BB.Builder +genNamesModule moduleName = Fold step initial done + where + + initial = Acc '\0' mempty mempty 0 '\0' '\0' + + step acc = \case + N.SingleChar{..} -> step' acc char name + N.CharRange{..} -> foldl' + (\a c -> step' a c (mkName prefix c)) + acc + [start..end] + + mkName prefix c = prefix <> showHexCodepointBS c + + step' Acc{..} char name = if expected < char + then if expected < '\xE0000' && char >= '\xE0000' + then step' + Acc { expected = '\xE0000' + , names + , offsets + , offset + , maxPlanes03 + , maxPlane14 } + char name + else step' + Acc { expected = succ expected + , names + , offsets = encodeOffset 0 0 : offsets + , offset + , maxPlanes03 + , maxPlane14 } + char name + else + let !(name', len, len', compressed) = encodeName name + in if (char < '\x40000' || char >= '\xE0000') && + offset <= 0xffffff && (len < hangul || compressed) + then Acc + { expected = succ expected + , names = name' : names + , offsets = encodeOffset offset len : offsets + , offset = offset + len' + , maxPlanes03 = if char < '\x40000' + then max maxPlanes03 char + else maxPlanes03 + , maxPlane14 = max maxPlane14 char } + else error (mconcat + [ "genNamesModule: Cannot encode '\\x" + , showHexCodepoint char + , "' ", show name + , " (offset: 0x" + , showPaddedHeX offset + , ", length: 0x" + , showPaddedHeX len + , ")" ]) + + cjkCompat = 0xf0 + cjkUnified = 0xf1 + tangut = 0xf2 + hangul = 0x80 + + encodeName name + | BS.take 28 name == "CJK COMPATIBILITY IDEOGRAPH-" = ("", cjkCompat, 0, True) + | BS.take 22 name == "CJK UNIFIED IDEOGRAPH-" = ("", cjkUnified, 0, True) + | BS.take 17 name == "TANGUT IDEOGRAPH-" = ("", tangut, 0, True) + | BS.take 16 name == "HANGUL SYLLABLE " = + let !name' = BS.drop 16 name; !len = BS.length name' + in if len <= 12 + then (name', hangul + len, len, True) + else error ("genNamesModule: cannot encode Hangul: " <> show len) + | otherwise = let !len = BS.length name in (name, len, len, False) + + encodeOffset offset len = encode32LE offset' mempty + where !offset' = len .|. (offset `shiftL` 8) + encode32LE v acc + = (v .&. 0xff) + : (v `shiftR` 8 .&. 0xff) + : (v `shiftR` 16 .&. 0xff) + : v `shiftR` 24 + : acc + + done Acc{..} = unlinesBB + [ apacheLicense 2022 moduleName + , "{-# LANGUAGE PatternSynonyms #-}" + , "{-# OPTIONS_HADDOCK hide #-}" + , "" + , "module " <> moduleName + , " ( name" + , " , pattern NoName" + , " , pattern CjkCompatibilityIdeograph" + , " , pattern CjkUnifiedIdeograph" + , " , pattern TangutIdeograph" + , " , pattern HangulSyllable" + , " ) where" + , "" + , "import Data.Int (Int32)" + , "import Foreign.C (CChar)" + , "import GHC.Exts" + , " ( Addr#, Char#, Int#, Ptr(..)," + , " ord#, (-#), (<#)," + , " uncheckedIShiftRL#, andI#," + , " plusAddr#, isTrue# )" + , "import Unicode.Internal.Bits.Names (lookupInt32#)" + , "" + , "-- | No name. Used to test length returned by 'name'." + , "--" + , "-- @since 0.3.0" + , "pattern NoName :: Int#" + , "pattern NoName = 0#" + , "" + , "-- | CJK compatibility ideograph. Used to test the length returned by 'name'." + , "--" + , "-- @since 0.3.0" + , "pattern CjkCompatibilityIdeograph :: Int#" + , "pattern CjkCompatibilityIdeograph = 0x" <> intHex cjkCompat <> "#" + , "" + , "-- | CJK unified ideograph. Used to test the length returned by 'name'." + , "--" + , "-- @since 0.3.0" + , "pattern CjkUnifiedIdeograph :: Int#" + , "pattern CjkUnifiedIdeograph = 0x" <> intHex cjkUnified <> "#" + , "" + , "-- | Tangut ideograph. Used to test the length returned by 'name'." + , "--" + , "-- @since 0.3.0" + , "pattern TangutIdeograph :: Int#" + , "pattern TangutIdeograph = 0x" <> intHex tangut <> "#" + , "" + , "-- | Hangul syllable. Used to test the length returned by 'name'." + , "--" + , "-- @since 0.3.0" + , "pattern HangulSyllable :: Int#" + , "pattern HangulSyllable = 0x" <> intHex hangul <> "#" + , "" + , "-- | Name of a character, if defined." + , "--" + , "-- The return value represents: (ASCII string, string length or special value)." + , "--" + , "-- Some characters require specific processing:" + , "--" + , "-- * If length = @'CjkCompatibilityIdeograph'@," + , "-- then the name is generated from the pattern “CJK COMPATIBILITY IDEOGRAPH-*”," + , "-- where * is the hexadecimal codepoint." + , "-- * If length = @'CjkUnifiedIdeograph'@," + , "-- then the name is generated from the pattern “CJK UNIFIED IDEOGRAPH-*”," + , "-- where * is the hexadecimal codepoint." + , "-- * If length = @'TangutIdeograph'@," + , "-- then the name is generated from the pattern “TANGUT IDEOGRAPH-*”," + , "-- where * is the hexadecimal codepoint." + , "-- * If length ≥ @'HangulSyllable'@," + , "-- then the name is generated by prepending “HANGUL SYLLABLE ”" + , "-- to the returned string." + , "--" + , "-- See an example of such implementation using 'String's in 'Unicode.Char.General.Names.name'." + , "--" + , "-- @since 0.1.0" + , "{-# INLINE name #-}" + , "name :: Char# -> (# Addr#, Int# #)" + , "name c#" + , " | isTrue# (cp# <# 0x" + <> showHexCodepointB (succ maxPlanes03) + <> "#) = getName cp#" + , " | isTrue# (cp# <# 0xE0000#) = (# \"\\0\"#, 0# #)" + , " | isTrue# (cp# <# 0x" + <> showHexCodepointB (succ maxPlane14) + <> "#) = getName (cp# -# 0x" + <> showPaddedHeXB (0xE0000 - ord (succ maxPlanes03)) + <> "#)" + , " | otherwise = (# \"\\0\"#, 0# #)" + , "" + , " where" + , "" + , " -- [NOTE] Encoding" + , " -- • The names are ASCII. Each name is encoded as a raw bytes literal." + , " -- • The names are concatenated in names#." + , " -- There are exceptions (see function’s doc)." + , " -- • The name of a character, if defined, is referenced by an offset in names#." + , " -- • The offsets are stored in offsets#. A character entry is composed of:" + , " -- • a LE Word24 for the offset;" + , " -- • a Word8 for the length of the name or a special value." + , "" + , " !cp# = ord# c#" + , "" + , " {-# INLINE getName #-}" + , " getName k# =" + , " let !entry# = lookupInt32# offsets# k#" + , " !offset# = entry# `uncheckedIShiftRL#` 8#" + , " !name# = names# `plusAddr#` offset#" + , " !len# = entry# `andI#` 0xff#" + , " in (# name#, len# #)" + , "" + , " !(Ptr names#) = namesBitmap" + , " !(Ptr offsets#) = offsetsBitmap" + , "" + , "namesBitmap :: Ptr CChar" + , "namesBitmap = Ptr" + , " \"" + <> chunkAddrLiteral 4 0xff showsB (shortByteStringsToString names) "\"#" + , "" + , "offsetsBitmap :: Ptr Int32" + , "offsetsBitmap = Ptr" + , " \"" + <> enumMapToAddrLiteral 4 0xff (mconcat (reverse offsets)) "\"#" + ] + where + showsB :: Word8 -> BB.Builder -> BB.Builder + showsB = \case + 0 -> (BB.string7 "\\0" <>) + c -> (BB.word8 c <>) -- Note: names are ASCII + intHex = BB.wordHex . fromIntegral + shortByteStringsToString + = BL.unpack + . BB.toLazyByteString + . foldMap BB.shortByteString + . reverse diff --git a/ucd2haskell/exe/UCD2Haskell/Modules/UnicodeData/GeneralCategory.hs b/ucd2haskell/exe/UCD2Haskell/Modules/UnicodeData/GeneralCategory.hs new file mode 100644 index 00000000..6bba97b6 --- /dev/null +++ b/ucd2haskell/exe/UCD2Haskell/Modules/UnicodeData/GeneralCategory.hs @@ -0,0 +1,224 @@ +-- | +-- Copyright : (c) 2024 Pierre Le Marre +-- License : Apache-2.0 +-- Maintainer : streamly@composewell.com +-- Stability : experimental +-- +module UCD2Haskell.Modules.UnicodeData.GeneralCategory + ( recipe + ) where + +import qualified Data.ByteString.Short as BS +import qualified Data.ByteString.Builder as BB +import Data.Foldable (Foldable(..)) +import qualified Unicode.CharacterDatabase.Parser.Common as U +import qualified Unicode.CharacterDatabase.Parser.UnicodeData as UD + +import UCD2Haskell.Generator (FileRecipe (..), unlinesBB, apacheLicense, genEnumBitmap) +import UCD2Haskell.Common (Fold (..), showHexCodepointB, showB) +import qualified Data.List as L + +recipe :: FileRecipe UD.Entry +recipe = ModuleRecipe + "Unicode.Internal.Char.UnicodeData.GeneralCategory" + genGeneralCategoryModule + +data GeneralCategoryAcc = GeneralCategoryAcc + { _categories0_3 :: ![UD.GeneralCategory] + , _categories14 :: ![UD.GeneralCategory] + , _expectedChar :: !Char + , _bounds :: !CharBounds + } + +genGeneralCategoryModule :: BB.Builder -> Fold UD.Entry BB.Builder +genGeneralCategoryModule moduleName = Fold step initial done + + where + + -- (categories planes 0-3, categories plane 14, expected char) + initial = GeneralCategoryAcc + [] + [] + minBound + (CharBounds minBound minBound minBound minBound minBound minBound minBound) + + step :: GeneralCategoryAcc -> UD.Entry -> GeneralCategoryAcc + step acc (UD.Entry range details) = + case range of + U.SingleChar c -> step' acc c details + U.CharRange{..} -> foldl' (\a c -> step' a c details) acc [start..end] + + step' :: GeneralCategoryAcc -> Char -> UD.CharDetails -> GeneralCategoryAcc + step' acc@(GeneralCategoryAcc acc1 acc2 p bounds) c details + -- Plane 0 to 3, missing char + -- Fill missing char entry with default category Cn + -- See: https://www.unicode.org/reports/tr44/#Default_Values_Table + | plane0To3 && p < c = step' (GeneralCategoryAcc (UD.Cn : acc1) acc2 (succ p) bounds) + c details + -- Plane 0 to 3, Regular entry + | plane0To3 = GeneralCategoryAcc + (generalCategory : acc1) + acc2 + (succ c) + (updateCharBounds bounds c generalCategory) + -- Plane 4 to 13: no entry expected + | plane4To13 = error ("Unexpected char in plane 4-13: " <> show (c, details)) + -- Plane 15 to 16: skip if PUA + | plane15To16 = case generalCategory of + UD.Co -> acc -- skip + _ -> error ("Unexpected char in plane 15-16: " <> show (c, details)) + -- Leap to plane 14 + | p < '\xE0000' = step' (GeneralCategoryAcc acc1 acc2 '\xE0000' bounds) c details + -- Plane 14, missing char + | p < c = step' (GeneralCategoryAcc acc1 (UD.Cn : acc2) (succ p) bounds) c details + -- Plane 14, regular entry + | otherwise = GeneralCategoryAcc + acc1 + (generalCategory : acc2) + (succ c) + (updateCharBounds bounds c generalCategory) + where + generalCategory = UD.generalCategory details + plane0To3 = c <= '\x3FFFF' + plane4To13 = c <= '\xDFFFF' + plane15To16 = c >= '\xF0000' + + done (GeneralCategoryAcc acc1 acc2 _ CharBounds{..}) = unlinesBB + [ apacheLicense 2020 moduleName + , "{-# OPTIONS_HADDOCK hide #-}" + , "{-# LANGUAGE PatternSynonyms #-}" + , "" + , "module " <> moduleName + , "( -- * Lookup functions" + , " generalCategory" + , ", generalCategoryPlanes0To3" + , "" + , " -- * General categories" + , foldMapWithNewLine mkGeneralCategoryPatternExport [minBound..maxBound] + , "" + , " -- * Characters bounds for predicates" + , foldMapWithNewLine mkCharBoundPatternExport charBoundPatterns + , ") where" + , "" + , "import Data.Char (ord)" + , "import Data.Word (Word8)" + , "import GHC.Exts (Ptr(..))" + , "import Unicode.Internal.Bits (lookupIntN)" + , "" + , "--------------------------------------------------------------------------------" + , "-- General category patterns" + , "--------------------------------------------------------------------------------" + , foldMap mkGeneralCategoryPattern [minBound..maxBound] + , "--------------------------------------------------------------------------------" + , "-- Characters bounds for predicates" + , "--------------------------------------------------------------------------------" + , foldMap mkCharBoundPattern charBoundPatterns + , "--------------------------------------------------------------------------------" + , "-- Lookup functions" + , "--------------------------------------------------------------------------------" + , "" + , "-- | Return the general category of a code point in planes 0 to 3" + , "--" + , "-- The caller of this function must ensure its parameter is \\< @0x40000@." + , "{-# INLINE generalCategoryPlanes0To3 #-}" + , "generalCategoryPlanes0To3 :: Int -> Int" + , "generalCategoryPlanes0To3 = lookupIntN bitmap#" + , " where" + , " !(Ptr bitmap#) = generalCategoryBitmap" + , "" + , "-- | Return the general category of a character" + , genEnumBitmap + "generalCategory" + (UD.Co, generalCategoryConstructor UD.Co) + (UD.Cn, generalCategoryConstructor UD.Cn) + (reverse acc1) + (reverse acc2) + ] + where + foldMapWithNewLine f = mconcat . L.intersperse "\n" . fmap f + mkExport p = ", pattern " <> p + mkGeneralCategoryPatternExport = mkExport . generalCategoryConstructor + mkGeneralCategoryPattern gc = mconcat + [ "\n-- | General category ", showB gc, "\n" + , "pattern ", generalCategoryConstructor gc, " :: Int\n" + , "pattern ", generalCategoryConstructor gc + , " = " + , BB.intDec (fromEnum gc), "\n"] + mkCharBoundPatternExport = mkExport . BB.shortByteString . fst + mkCharBoundPattern :: (BS.ShortByteString, Char) -> BB.Builder + mkCharBoundPattern (p, c) = mconcat + [ "\n-- | Maximum codepoint satisfying @i", BB.shortByteString (BS.drop 4 p), "@\n" + , "pattern ", BB.shortByteString p, " :: Int\n" + , "pattern ", BB.shortByteString p, " = 0x", showHexCodepointB c, "\n"] + charBoundPatterns :: [(BS.ShortByteString, Char)] + charBoundPatterns = + [ ("MaxIsLetter" , maxIsLetter ) + , ("MaxIsAlphaNum" , maxIsAlphaNum ) + , ("MaxIsLower" , maxIsLower ) + , ("MaxIsUpper" , maxIsUpper ) + , ("MaxIsNumber" , maxIsNumber ) + , ("MaxIsSpace" , maxIsSpace ) + , ("MaxIsSeparator", maxIsSeparator) ] + +data CharBounds = CharBounds + { maxIsLetter :: !Char + , maxIsAlphaNum :: !Char + , maxIsLower :: !Char + , maxIsUpper :: !Char + , maxIsNumber :: !Char + , maxIsSpace :: !Char + , maxIsSeparator :: !Char } + +updateCharBounds :: CharBounds -> Char -> UD.GeneralCategory -> CharBounds +updateCharBounds acc@CharBounds{..} c = \case + UD.Lu -> acc{ maxIsAlphaNum = max maxIsAlphaNum c + , maxIsLetter = max maxIsLetter c + , maxIsUpper = max maxIsUpper c } + UD.Ll -> acc{ maxIsAlphaNum = max maxIsAlphaNum c + , maxIsLetter = max maxIsLetter c + , maxIsLower = max maxIsLower c } + UD.Lt -> acc{ maxIsAlphaNum = max maxIsAlphaNum c + , maxIsLetter = max maxIsLetter c + , maxIsUpper = max maxIsUpper c } + UD.Lm -> acc{maxIsAlphaNum=max maxIsAlphaNum c, maxIsLetter=max maxIsLetter c} + UD.Lo -> acc{maxIsAlphaNum=max maxIsAlphaNum c, maxIsLetter=max maxIsLetter c} + UD.Nd -> acc{maxIsAlphaNum=max maxIsAlphaNum c, maxIsNumber=max maxIsNumber c} + UD.Nl -> acc{maxIsAlphaNum=max maxIsAlphaNum c, maxIsNumber=max maxIsNumber c} + UD.No -> acc{maxIsAlphaNum=max maxIsAlphaNum c, maxIsNumber=max maxIsNumber c} + UD.Zs -> acc{maxIsSeparator=max maxIsAlphaNum c, maxIsSpace=max maxIsSpace c} + UD.Zl -> acc{maxIsSeparator=max maxIsAlphaNum c} + UD.Zp -> acc{maxIsSeparator=max maxIsAlphaNum c} + _ -> acc + +generalCategoryConstructor :: UD.GeneralCategory -> BB.Builder +generalCategoryConstructor = \case + UD.Lu -> "UppercaseLetter" + UD.Ll -> "LowercaseLetter" + UD.Lt -> "TitlecaseLetter" + UD.Lm -> "ModifierLetter" + UD.Lo -> "OtherLetter" + UD.Mn -> "NonSpacingMark" + UD.Mc -> "SpacingCombiningMark" + UD.Me -> "EnclosingMark" + UD.Nd -> "DecimalNumber" + UD.Nl -> "LetterNumber" + UD.No -> "OtherNumber" + UD.Pc -> "ConnectorPunctuation" + UD.Pd -> "DashPunctuation" + UD.Ps -> "OpenPunctuation" + UD.Pe -> "ClosePunctuation" + UD.Pi -> "InitialQuote" + UD.Pf -> "FinalQuote" + UD.Po -> "OtherPunctuation" + UD.Sm -> "MathSymbol" + UD.Sc -> "CurrencySymbol" + UD.Sk -> "ModifierSymbol" + UD.So -> "OtherSymbol" + UD.Zs -> "Space" + UD.Zl -> "LineSeparator" + UD.Zp -> "ParagraphSeparator" + UD.Cc -> "Control" + UD.Cf -> "Format" + UD.Cs -> "Surrogate" + UD.Co -> "PrivateUse" + UD.Cn -> "NotAssigned" diff --git a/ucd2haskell/exe/UCD2Haskell/Modules/UnicodeData/NameAliases.hs b/ucd2haskell/exe/UCD2Haskell/Modules/UnicodeData/NameAliases.hs new file mode 100644 index 00000000..a7d1fe46 --- /dev/null +++ b/ucd2haskell/exe/UCD2Haskell/Modules/UnicodeData/NameAliases.hs @@ -0,0 +1,159 @@ +-- | +-- Copyright : (c) 2024 Pierre Le Marre +-- License : Apache-2.0 +-- Maintainer : streamly@composewell.com +-- Stability : experimental +-- +module UCD2Haskell.Modules.UnicodeData.NameAliases (recipe) where + +import qualified Data.ByteString.Builder as BB +import qualified Data.ByteString.Short as BS +import qualified Data.Map.Strict as Map +import Data.Word (Word8) +import qualified Unicode.CharacterDatabase.Parser.NameAliases as N + +import UCD2Haskell.Generator (FileRecipe (..), unlinesBB, apacheLicense, enumMapToAddrLiteral) +import UCD2Haskell.Common (Fold (..), showHexCodepointB) +import Data.Foldable (Foldable(..)) + +recipe :: FileRecipe N.Entry +recipe = ModuleRecipe + "Unicode.Internal.Char.UnicodeData.NameAliases" + genAliasesModule + +type CharAliases = Map.Map N.AliasType [BS.ShortByteString] +type Aliases = Map.Map Char CharAliases + +genAliasesModule :: BB.Builder -> Fold N.Entry BB.Builder +genAliasesModule moduleName = Fold step mempty done + where + + step :: Aliases -> N.Entry -> Aliases + step acc N.Entry{..} = Map.alter + (Just . \case + Nothing -> Map.singleton nameAliasType [nameAlias] + Just as -> Map.insertWith (flip (<>)) nameAliasType [nameAlias] as) + char + acc + + mkCharAliases :: Char -> CharAliases -> BB.Builder + mkCharAliases char aliases = mconcat + [ "\n '\\x" + , showHexCodepointB char + , "'# -> \"" + , mkCharAliasesLiteral char aliases + , "\"#" + ] + + mkCharAliasesLiteral :: Char -> CharAliases -> BB.Builder + mkCharAliasesLiteral char aliasesMap = + enumMapToAddrLiteral 0 0xfff + (reverse index) + (mconcat (reverse ("\\0":aliases))) + where + (index, aliases, _) = foldl' + (\acc ty -> addAliasType char acc ty (Map.findWithDefault mempty ty aliasesMap)) + (mempty, mempty, typesCount) + [minBound..maxBound] + + typesCount = fromEnum (maxBound :: N.AliasType) + - fromEnum (minBound :: N.AliasType) + + 1 + + -- [FIXME] [(Word8:AliasType,Word8:index of first alias)] [CString] + addAliasType + :: Char + -> ([Word8], [BB.Builder], Int) -- (index, aliases, last alias index) + -> N.AliasType + -> [BS.ShortByteString] + -> ([Word8], [BB.Builder], Int) + addAliasType char (index, aliasesAcc, lastAliasIndex) _ty = \case + [] -> + ( 0 : index + , aliasesAcc + , lastAliasIndex ) + aliases -> if lastAliasIndex < 0xff + then + ( fromIntegral lastAliasIndex : index + , encodedAliases + , lastAliasIndex' ) + else error . mconcat $ + [ "Cannot encode char ", show char + , "aliases. Offset: ", show lastAliasIndex, " >= 0xff" ] + where + (encodedAliases, lastAliasIndex') = + addEncodedAliases (aliasesAcc, lastAliasIndex) aliases + addEncodedAliases (as, offset) = \case + alias : rest -> addEncodedAliases + ( mconcat ["\\", BB.intDec len, BB.shortByteString alias] : as + , offset' ) + rest + where + len = BS.length alias + offset' = offset + len + 1 + [] -> ("\\0" : as, offset + 1) + + done names = unlinesBB + [ apacheLicense 2022 moduleName + , "{-# LANGUAGE DeriveGeneric, PatternSynonyms #-}" + , "{-# OPTIONS_HADDOCK hide #-}" + , "" + , "module " <> moduleName + , "(NameAliasType(..), pattern MaxNameAliasType, nameAliases)" + , "where" + , "" + , "import Data.Ix (Ix)" + , "import GHC.Exts (Addr#, Char#, Int#)" + , "import GHC.Generics (Generic)" + , "" + , "-- | Type of name alias. See Unicode Standard 15.0.0, section 4.8." + , "--" + , "-- @since 0.1.0" + , "data NameAliasType" + , " = Correction" + , " -- ^ Corrections for serious problems in the character names." + , " | Control" + , " -- ^ ISO 6429 names for @C0@ and @C1@ control functions, and other" + , " -- commonly occurring names for control codes." + , " | Alternate" + , " -- ^ A few widely used alternate names for format characters." + , " | Figment" + , " -- ^ Several documented labels for @C1@ control code points which" + , " -- were never actually approved in any standard." + , " | Abbreviation" + , " -- ^ Commonly occurring abbreviations (or acronyms) for control codes," + , " -- format characters, spaces, and variation selectors." + , " deriving (Generic, Enum, Bounded, Eq, Ord, Ix, Show)" + , "" + , "-- $setup" + , "-- >>> import GHC.Exts (Int(..))" + , "" + , "-- |" + , "-- >>> I# MaxNameAliasType == fromEnum (maxBound :: NameAliasType)" + , "-- True" + , "pattern MaxNameAliasType :: Int#" + , "pattern MaxNameAliasType = " + <> BB.intDec (fromEnum (maxBound :: N.AliasType)) <> "#" + , "" + , "-- | Detailed character names aliases." + , "-- The names are listed in the original order of the UCD." + , "--" + , "-- Encoding:" + , "--" + , "-- * If there is no alias, return @\"\\\\xff\"#@." + , "-- * For each type of alias, the aliases are encoded as list of (length, alias)." + , "-- The list terminates with @\\\\0@." + , "-- * The list are then concatenated in order of type of alias and" + , "-- terminates with @\\\\0@." + , "-- * The first " + <> BB.intDec (fromEnum (maxBound :: N.AliasType) + 1) + <> " bytes represent each one the index of the first element of the" + , "-- corresponding list of aliases. When the list is empty, then the index is 0." + , "-- * Example: @\\\"\\\\5\\\\0\\\\13\\\\0\\\\0\\\\3XXX\\\\2YY\\\\0\\\\4ZZZZ\\\\0\\\\0\\\"#@" + , "-- represents: @[('Correction',[\\\"XXX\\\", \\\"YY\\\"]),('Alternate', [\\\"ZZZZ\\\"])]@." + , "--" + , "-- @since 0.1.0" + , "nameAliases :: Char# -> Addr#" + , "nameAliases = \\case" <> Map.foldMapWithKey mkCharAliases names + , " _ -> \"\\xff\"#" + ] diff --git a/ucd2haskell/exe/UCD2Haskell/Modules/UnicodeData/SimpleCaseMappings.hs b/ucd2haskell/exe/UCD2Haskell/Modules/UnicodeData/SimpleCaseMappings.hs new file mode 100644 index 00000000..d2e67fd9 --- /dev/null +++ b/ucd2haskell/exe/UCD2Haskell/Modules/UnicodeData/SimpleCaseMappings.hs @@ -0,0 +1,90 @@ +-- | +-- Copyright : (c) 2024 Pierre Le Marre +-- License : Apache-2.0 +-- Maintainer : streamly@composewell.com +-- Stability : experimental +-- +module UCD2Haskell.Modules.UnicodeData.SimpleCaseMappings + ( upperRecipe + , lowerRecipe + , titleRecipe + ) where + +import qualified Data.ByteString.Builder as BB +import Data.Functor ((<&>), ($>)) +import qualified Unicode.CharacterDatabase.Parser.Common as U +import qualified Unicode.CharacterDatabase.Parser.UnicodeData as UD + +import UCD2Haskell.Generator (FileRecipe (..), unlinesBB, apacheLicense) +import UCD2Haskell.Common (Fold (..)) + +upperRecipe :: FileRecipe UD.Entry +upperRecipe = ModuleRecipe + "Unicode.Internal.Char.UnicodeData.SimpleUpperCaseMapping" + (\m -> genSimpleCaseMappingModule m "toSimpleUpperCase" UD.simpleUpperCaseMapping) + +lowerRecipe :: FileRecipe UD.Entry +lowerRecipe = ModuleRecipe + "Unicode.Internal.Char.UnicodeData.SimpleLowerCaseMapping" + (\m -> genSimpleCaseMappingModule m "toSimpleLowerCase" UD.simpleLowerCaseMapping) + +titleRecipe :: FileRecipe UD.Entry +titleRecipe = ModuleRecipe + "Unicode.Internal.Char.UnicodeData.SimpleTitleCaseMapping" + (\m -> genSimpleCaseMappingModule m "toSimpleTitleCase" UD.simpleTitleCaseMapping) + +genSimpleCaseMappingModule + :: BB.Builder + -> BB.Builder + -> (UD.CharDetails -> Maybe Char) + -> Fold UD.Entry BB.Builder +genSimpleCaseMappingModule moduleName funcName field = + Fold step initial done + + where + + genHeader = + [ apacheLicense 2020 moduleName + , "{-# LANGUAGE LambdaCase #-}" + , "{-# OPTIONS_HADDOCK hide #-}" + , "" + , "module " <> moduleName + , "(" <> funcName <> ")" + , "where" + , "" + ] + genSign = + [ "{-# NOINLINE " <> funcName <> " #-}" + , funcName <> " :: Char -> Char" + , funcName <> " = \\case" + ] + initial = [] + + step ds dc = case mkEntry dc of + Nothing -> ds + Just d -> d : ds + + after = [" c -> c"] + + done st = + let body = mconcat [genHeader, genSign, reverse st, after] + in unlinesBB body + + mkEntry (UD.Entry r dc) = case r of + U.SingleChar ch -> field dc <&> \c -> mconcat + [ " " + , BB.string7 . show $ ch + , " -> " + , BB.string7 . show $ c + ] + -- TODO: switch to hexadecimal formatting for better debugging? + -- what about code size increase? + -- [ " '\\x" + -- , showHexChar ch + -- , "' -> '\\x" + -- , showHexChar c + -- , "'" + -- ] + -- where showHexChar = BB.wordHex . fromIntegral . ord + U.CharRange{} -> field dc $> error ("genSimpleCaseMappingModule: unexpected char range: " <> show r) + diff --git a/ucd2haskell/ucd2haskell.cabal b/ucd2haskell/ucd2haskell.cabal index 750c91dc..15d188b4 100644 --- a/ucd2haskell/ucd2haskell.cabal +++ b/ucd2haskell/ucd2haskell.cabal @@ -1,6 +1,6 @@ cabal-version: 2.2 name: ucd2haskell -version: 0.4.0 +version: 0.5.0 synopsis: Generate the code of the unicode-data* packages description: Internal package for the generation of @unicode-data*@ packages. @@ -48,8 +48,13 @@ common default-extensions ScopedTypeVariables TupleSections FlexibleContexts - - -- Experimental, may lead to issues + OverloadedStrings + LambdaCase + GeneralizedNewtypeDeriving + BlockArguments + ViewPatterns + NamedFieldPuns + DerivingStrategies DeriveAnyClass UnboxedTuples @@ -66,16 +71,41 @@ executable ucd2haskell ghc-options: -O2 hs-source-dirs: exe main-is: UCD2Haskell.hs - other-modules: Parser.Text + other-modules: + UCD2Haskell.Common + UCD2Haskell.Generator + UCD2Haskell.Generator.Core + UCD2Haskell.Generator.Names + UCD2Haskell.Generator.Scripts + UCD2Haskell.Generator.Security + UCD2Haskell.Modules.Blocks + UCD2Haskell.Modules.CaseFoldings + UCD2Haskell.Modules.DerivedNumericValues + UCD2Haskell.Modules.Properties + UCD2Haskell.Modules.Scripts + UCD2Haskell.Modules.ScriptsExtensions + UCD2Haskell.Modules.Security.Confusables + UCD2Haskell.Modules.Security.IdentifierStatus + UCD2Haskell.Modules.Security.IdentifierType + UCD2Haskell.Modules.Security.IntentionalConfusables + UCD2Haskell.Modules.SpecialCasings + UCD2Haskell.Modules.UnicodeData.CombiningClass + UCD2Haskell.Modules.UnicodeData.Composition + UCD2Haskell.Modules.UnicodeData.Decomposition + UCD2Haskell.Modules.UnicodeData.DerivedNames + UCD2Haskell.Modules.UnicodeData.GeneralCategory + UCD2Haskell.Modules.UnicodeData.NameAliases + UCD2Haskell.Modules.UnicodeData.SimpleCaseMappings if flag(ucd2haskell) buildable: True build-depends: - base >= 4.7 && < 4.21 - , containers >= 0.5 && < 0.7 - , directory >= 1.3.6 && < 1.4 - , filepath >= 1.4.100 && < 1.5 - , getopt-generics >= 0.13 && < 0.14 - , streamly >= 0.8.0 && < 0.8.1 + base >= 4.7 && < 4.21 + , bytestring >= 0.11 && < 0.13 + , containers >= 0.5 && < 0.7 + , directory >= 1.3.6 && < 1.4 + , filepath >= 1.4.100 && < 1.5 + , getopt-generics >= 0.13 && < 0.14 + , unicode-data-parser >= 0.3.0 && < 0.4 else buildable: False