bytestring-0.10.0.0: Fast, compact, strict and lazy byte strings with a list interface

PortabilityGHC
MaintainerSimon Meier <iridcode@gmail.com>

Data.ByteString.Lazy.Builder.BasicEncoding

Contents

Description

This module provides the types of fixed-size and bounded-size encodings, which are the basic building blocks for constructing Builders. They are used for application specific performance tuning of Builders. For example, libraries such as blaze-html or aeson use the functions provided by this module to implement efficient encodings that combine escaping and character encoding. We explain fixed-size and bounded-size encodings in three steps. First, we define them formally. Then, we explain how they can improve the performance of a Builder. Finally, we give two examples to illustrate their use.

Fixed(-size) encodings are encodings that always result in a sequence of bytes of a predetermined, fixed length. An example for a fixed encoding is the big-endian encoding of a Word64, which always results in exactly 8 bytes. Bounded(-size) encodings are encodings that always result in a sequence of bytes that is no larger than a predetermined bound. An example for a bounded encoding is the UTF-8 encoding of a Char, which results always in less or equal to 4 bytes.

Note that every fixed encoding is also a bounded encoding. This module does not expose functions that exploit the special properties of fixed-size encodings. However, they are exploited in the functions encodeSizePrefixed and encodeChunked from Data.ByteString.Lazy.Builder.Extras, which prefix a Builder with its (chunk) size. In the following, we therefore only refer to bounded encodings.

The goal of bounded encodings is to improve the performance of Builders. These improvements stem from making the two most common steps performed by a Builder more efficient.

The first most common step is the concatentation of two Builders. Internally, concatentation corresponds to function composition. (Note that Builders can be seen as difference-lists of buffer-filling functions; cf. http://hackage.haskell.org/cgi-bin/hackage-scripts/package/dlist. ) Function composition is a fast O(1) operation. However, we can use bounded encodings to remove some of these function compositions altoghether, which is obviously more efficient.

The second most common step performed by a Builder is to fill a buffer using a bounded encoding, which works as follows. The Builder checks whether there is enough space left to execute the bounded encoding. If there is, then the Builder executes the bounded encoding and calls the next Builder with the updated buffer. Otherwise, the Builder signals its driver that it requires a new buffer. This buffer must be at least as large as the bound of the encoding. We can use bounded encodings to reduce the number of buffer-free checks by fusing the buffer-free checks of consecutive Builders. We can also use bounded encodings to simplify the control flow for signalling that a buffer is full by ensuring that we check first that there is enough space left and only then decide on how to encode a given value.

Let us illustrate these improvements on the CSV-table rendering example from Data.ByteString.Lazy.Builder. Its "hot code" is the rendering of a table's cells, which we implement as follows using only the functions from the Builder API.

import           Data.ByteString.Lazy.Builder         as B
import           Data.ByteString.Lazy.Builder.ASCII   as B

renderCell :: Cell -> Builder
renderCell (StringC cs) = renderString cs
renderCell (IntC i)     = B.intDec i

renderString :: String -> Builder
renderString cs = B.charUtf8 '"' <> foldMap escape cs <> B.charUtf8 '"'
  where
    escape '\\' = B.charUtf8 '\\' <> B.charUtf8 '\\'
    escape '\"' = B.charUtf8 '\\' <> B.charUtf8 '\"'
    escape c    = B.charUtf8 c

Efficient encoding of Ints as decimal numbers is performed by intDec from Data.ByteString.Lazy.Builder.ASCII. Optimization potential exists for the escaping of Strings. The above implementation has two optimization opportunities. First, we can use a single buffer free check for 4 bytes before escaping a character and only then decide on how to escape the character. This simplifies the control flow and allows to avoid a closure construction, which improves the performance of the Builder. Second, the concatenations performed by foldMap can be eliminated. The following implementation exploits these optimizations.

import qualified Data.ByteString.Lazy.Builder.BasicEncoding  as E
import           Data.ByteString.Lazy.Builder.BasicEncoding
                 ( ifB, fromF, (>*<), (>$<) )

renderString :: String -> Builder
renderString cs =
    B.charUtf8 '"' <> E.encodeListWithB escape cs <> B.charUtf8 '"'
  where
    escape :: E.BoundedEncoding Char
    escape =
      ifB (== '\\') (fixed2 ('\\', '\\')) $
      ifB (== '\"') (fixed2 ('\\', '\"')) $
      E.charUtf8
     
    {-# INLINE fixed2 #-}
    fixed2 x = fromF $ const x >$< E.char7 >*< E.char7

The code should be mostly self-explanatory. The slightly awkward syntax is because the combinators are written such that the sizeBound of the resulting BoundedEncoding can be computed at compile time. We also explicitly inline the fixed2 encoding, which encodes a fixed tuple of characters, to ensure that the bound compuation happens at compile time. When encoding the following list of Strings, the optimized implementation of renderString is two times faster.

maxiStrings :: [String]
maxiStrings = take 1000 $ cycle ["hello", "\"1\"", "λ-wörld"]

Most of the performance gain stems from using encodeListWithB, which encodes a list of values from left-to-right with a BoundedEncoding. It exploits the Builder internals to avoid unnecessary function compositions (i.e., concatentations). In the future, we would expect the compiler to perform the optimizations implemented in encodeListWithB by himself. However, it seems that the code is currently to complicated for the compiler to see through. Therefore, we provide the BoundedEncoding escape hatch, which allows data structures to provide very efficient encoding traversals, like encodeListWithB for lists.

Note that BoundedEncodings are a bit verbose, but quite versatile. Here is an example of a BoundedEncoding for combined HTML escaping and UTF-8 encoding. It exploits that the escaped character with the maximal Unicode codepoint is '>'.

{-# INLINE charUtf8HtmlEscaped #-}
charUtf8HtmlEscaped :: E.BoundedEncoding Char
charUtf8HtmlEscaped =
    ifB (>  '>' ) E.charUtf8 $
    ifB (== '<' ) (fixed4 ('&',('l',('t',';')))) $        -- &lt;
    ifB (== '>' ) (fixed4 ('&',('g',('t',';')))) $        -- &gt;
    ifB (== '&' ) (fixed5 ('&',('a',('m',('p',';'))))) $  -- &amp;
    ifB (== '"' ) (fixed5 ('&',('#',('3',('4',';'))))) $  -- &#34;
    ifB (== '\'') (fixed5 ('&',('#',('3',('9',';'))))) $  -- &#39;
    (fromF E.char7)         -- fallback for Chars smaller than '>'
  where
    {-# INLINE fixed4 #-}
    fixed4 x = fromF $ const x >$<
      E.char7 >*< E.char7 >*< E.char7 >*< E.char7
     
    {-# INLINE fixed5 #-}
    fixed5 x = fromF $ const x >$<
      E.char7 >*< E.char7 >*< E.char7 >*< E.char7 >*< E.char7

Synopsis

Fixed-size encodings

data FixedEncoding a Source

An encoding that always results in a sequence of bytes of a pre-determined, fixed size.

Instances

Contravariant FixedEncoding 

size :: FixedEncoding a -> IntSource

The fixed size of the sequences of bytes generated by this FixedEncoding.

Combinators

The combinators for FixedEncodings are implemented such that the size of the resulting FixedEncoding is computed at compile time.

emptyF :: FixedEncoding aSource

The FixedEncoding that always results in the zero-length sequence.

pairF :: FixedEncoding a -> FixedEncoding b -> FixedEncoding (a, b)Source

Encode a pair by encoding its first component and then its second component.

contramapF :: (b -> a) -> FixedEncoding a -> FixedEncoding bSource

Change an encoding such that it first applies a function to the value to be encoded.

Note that encodings are Contrafunctors http://hackage.haskell.org/package/contravariant. The following laws hold.

contramapF id = id
contramapF f . contramapF g = contramapF (g . f)

Builder construction

In terms of expressivity, the function encodeWithF would be sufficient for constructing Builders from FixedEncodings. The fused variants of this function are provided because they allow for more efficient implementations. Our compilers are just not smart enough yet; and for some of the employed optimizations (see the code of encodeByteStringWithF) they will very likely never be.

Note that functions marked with "Heavy inlining." are forced to be inlined because they must be specialized for concrete encodings, but are rather heavy in terms of code size. We recommend to define a top-level function for every concrete instantiation of such a function in order to share its code. A typical example is the function byteStringHexFixed from Data.ByteString.Lazy.Builder.ASCII, which is implemented as follows.

 {-# NOINLINE byteStringHexFixed #-}
 byteStringHexFixed :: S.ByteString -> Builder
 byteStringHexFixed = encodeByteStringWithF word8HexFixed

encodeWithF :: FixedEncoding a -> a -> BuilderSource

Encode a value with a FixedEncoding.

encodeListWithF :: FixedEncoding a -> [a] -> BuilderSource

Encode a list of values from left-to-right with a FixedEncoding.

encodeUnfoldrWithF :: FixedEncoding b -> (a -> Maybe (b, a)) -> a -> BuilderSource

Encode a list of values represented as an unfoldr with a FixedEncoding.

encodeByteStringWithF :: FixedEncoding Word8 -> ByteString -> BuilderSource

Heavy inlining. Encode all bytes of a strict ByteString from left-to-right with a FixedEncoding. This function is quite versatile. For example, we can use it to construct a Builder that maps every byte before copying it to the buffer to be filled.

mapToBuilder :: (Word8 -> Word8) -> S.ByteString -> Builder
mapToBuilder f = encodeByteStringWithF (contramapF f word8)

We can also use it to hex-encode a strict ByteString as shown in the byteStringHexFixed example above.

encodeLazyByteStringWithF :: FixedEncoding Word8 -> ByteString -> BuilderSource

Heavy inlining. Encode all bytes of a lazy ByteString from left-to-right with a FixedEncoding.

Bounded-size encodings

data BoundedEncoding a Source

An encoding that always results in sequence of bytes that is no longer than a pre-determined bound.

Instances

Contravariant BoundedEncoding 

sizeBound :: BoundedEncoding a -> IntSource

The bound on the size of sequences of bytes generated by this BoundedEncoding.

Combinators

The combinators for BoundedEncodings are implemented such that the sizeBound of the resulting BoundedEncoding is computed at compile time.

emptyB :: BoundedEncoding aSource

The BoundedEncoding that always results in the zero-length sequence.

pairB :: BoundedEncoding a -> BoundedEncoding b -> BoundedEncoding (a, b)Source

Encode a pair by encoding its first component and then its second component.

eitherB :: BoundedEncoding a -> BoundedEncoding b -> BoundedEncoding (Either a b)Source

Encode an Either value using the first BoundedEncoding for Left values and the second BoundedEncoding for Right values.

Note that the functions eitherB, pairB, and contramapB (written below using >$<) suffice to construct BoundedEncodings for all non-recursive algebraic datatypes. For example,

maybeB :: BoundedEncoding () -> BoundedEncoding a -> BoundedEncoding (Maybe a)
maybeB nothing just = maybe (Left ()) Right >$< eitherB nothing just

ifB :: (a -> Bool) -> BoundedEncoding a -> BoundedEncoding a -> BoundedEncoding aSource

Conditionally select a BoundedEncoding. For example, we can implement the ASCII encoding that drops characters with Unicode codepoints above 127 as follows.

charASCIIDrop = ifB (< '\128') (fromF char7) emptyB

contramapB :: (b -> a) -> BoundedEncoding a -> BoundedEncoding bSource

Change a BoundedEncoding such that it first applies a function to the value to be encoded.

Note that BoundedEncodings are Contrafunctors http://hackage.haskell.org/package/contravariant. The following laws hold.

contramapB id = id
contramapB f . contramapB g = contramapB (g . f)

We provide an overloaded operator for the contramapF and contramapB combinators to allow for a more convenient syntax. The operator is source-compatible with the one provided by the contravariant library. Once this library is part of the Haskell platform, we will make FixedEncoding and BoundedEncoding instances of its Contravariant type-class.

(>$<) :: Contravariant f => (a -> b) -> f b -> f aSource

An overloaded infix operator for contramapF and contramapB.

We can use it for example to prepend and/or append fixed values to an encoding. The following implementation surrounds an UTF-8 encoded character with singlequotes; e.g., showF singleQuotedUtf8 'a' = "'a'".

 singleQuotedUtf8 :: FixedEncoding Char
 singleQuotedUtf8 =
   ((x -> ('\'', (x, '\''))) >$< pairF char7 (pairF charUtf8 char7)

Note that the rather verbose syntax for composition stems from the requirement to be able to compute the sizes and sizeBounds at compile time.

Builder construction

encodeWithB :: BoundedEncoding a -> a -> BuilderSource

Encode a value with a BoundedEncoding.

Note that consecutive uses of encodeWithB and encodeWithF are rewritten such that their bounds-checks are fused; i.e., we rewrite using the rules

  encodeWithF 
= encodeWithB . fromF
  encodeWithB be1 x1 `mappend` (encodeWithB be2 x2)
= encodeWithB (pairB be1 be2) (x1, x2)

For example,

encodeWithB (word32 x1) `mappend` encodeWithB (word32 x2)

is rewritten such that the resulting Builder checks only once, if ther are at least 8 free bytes, instead of checking twice, if there are 4 free bytes. This rewrite rule is not observationally equivalent, as it may change the boundaries of the generated chunks. We deem this acceptable, as for all use-cases of Builders known to us the precise location of chunk boundaries does not matter.

encodeListWithB :: BoundedEncoding a -> [a] -> BuilderSource

Encodes a list of values from left-to-right using a BoundedEncoding.

encodeUnfoldrWithB :: BoundedEncoding b -> (a -> Maybe (b, a)) -> a -> BuilderSource

Create a Builder that encodes a list of values represented as an unfoldr with a BoundedEncoding.

encodeByteStringWithB :: BoundedEncoding Word8 -> ByteString -> BuilderSource

Heavy inlining. Encode all bytes of a strict ByteString from left-to-right with a BoundedEncoding.

For example, we can use this function to construct a Builder that filters every byte before copying it to the buffer to be filled.

filterToBuilder :: (Word8 -> Bool) -> S.ByteString -> Builder
filterToBuilder p =
  encodeByteStringWithB (ifB p (fromF word8) emptyB)

encodeLazyByteStringWithB :: BoundedEncoding Word8 -> ByteString -> BuilderSource

Heavy inlining. Encode all bytes of a lazy ByteString from left-to-right with a BoundedEncoding.

Standard encodings of Haskell values

Binary encodings

int8 :: FixedEncoding Int8Source

Encoding single signed bytes as-is.

word8 :: FixedEncoding Word8Source

Encoding single unsigned bytes as-is.

Fixed-length

Big-endian

int16BE :: FixedEncoding Int16Source

Encoding Int16s in big endian format.

int32BE :: FixedEncoding Int32Source

Encoding Int32s in big endian format.

int64BE :: FixedEncoding Int64Source

Encoding Int64s in big endian format.

word16BE :: FixedEncoding Word16Source

Encoding Word16s in big endian format.

word32BE :: FixedEncoding Word32Source

Encoding Word32s in big endian format.

word64BE :: FixedEncoding Word64Source

Encoding Word64s in big endian format.

floatBE :: FixedEncoding FloatSource

Encode an IEEE Float in big endian format.

doubleBE :: FixedEncoding DoubleSource

Encode an IEEE Double in big endian format.

Little-endian

int16LE :: FixedEncoding Int16Source

Encoding Int16s in little endian format.

int32LE :: FixedEncoding Int32Source

Encoding Int32s in little endian format.

int64LE :: FixedEncoding Int64Source

Encoding Int64s in little endian format.

word16LE :: FixedEncoding Word16Source

Encoding Word16s in little endian format.

word32LE :: FixedEncoding Word32Source

Encoding Word32s in little endian format.

word64LE :: FixedEncoding Word64Source

Encoding Word64s in little endian format.

floatLE :: FixedEncoding FloatSource

Encode an IEEE Float in little endian format.

doubleLE :: FixedEncoding DoubleSource

Encode an IEEE Double in little endian format.

Non-portable, host-dependent

intHost :: FixedEncoding IntSource

Encode a single native machine Int. The Ints is encoded in host order, host endian form, for the machine you are on. On a 64 bit machine the Int is an 8 byte value, on a 32 bit machine, 4 bytes. Values encoded this way are not portable to different endian or integer sized machines, without conversion.

int16Host :: FixedEncoding Int16Source

Encoding Int16s in native host order and host endianness.

int32Host :: FixedEncoding Int32Source

Encoding Int32s in native host order and host endianness.

int64Host :: FixedEncoding Int64Source

Encoding Int64s in native host order and host endianness.

wordHost :: FixedEncoding WordSource

Encode a single native machine Word. The Words is encoded in host order, host endian form, for the machine you are on. On a 64 bit machine the Word is an 8 byte value, on a 32 bit machine, 4 bytes. Values encoded this way are not portable to different endian or word sized machines, without conversion.

word16Host :: FixedEncoding Word16Source

Encoding Word16s in native host order and host endianness.

word32Host :: FixedEncoding Word32Source

Encoding Word32s in native host order and host endianness.

word64Host :: FixedEncoding Word64Source

Encoding Word64s in native host order and host endianness.

floatHost :: FixedEncoding FloatSource

Encode an IEEE Float in native host order and host endianness. Values written this way are not portable to different endian machines, without conversion.

doubleHost :: FixedEncoding DoubleSource

Encode an IEEE Double in native host order and host endianness.

Variable-length

Little-endian, base-128

word8Base128LE :: BoundedEncoding Word8Source

Variable-length, little-endian, base-128 encoding of a Word8.

word16Base128LE :: BoundedEncoding Word16Source

Variable-length, little-endian, base-128 encoding of a Word16.

word32Base128LE :: BoundedEncoding Word32Source

Variable-length, little-endian, base-128 encoding of a Word32.

word64Base128LE :: BoundedEncoding Word64Source

Variable-length, little-endian, base-128 encoding of a Word64.

wordBase128LE :: BoundedEncoding WordSource

Variable-length, little-endian, base-128 encoding of a Word.

Note that in contrast to the fixed-width binary encoding of a Word, whose width depends on the register-width of a machine, this encoding is machine-independent for values small enough to be represented using a Word on all relevant machines.

Little-endian, base-128, zig-zag

int8ZigZagBase128LE :: BoundedEncoding Int8Source

Variable-length, little-endian, base-128, zig-zag encoding of an Int8.

int16ZigZagBase128LE :: BoundedEncoding Int16Source

Variable-length, little-endian, base-128, zig-zag encoding of an Int16.

int32ZigZagBase128LE :: BoundedEncoding Int32Source

Variable-length, little-endian, base-128, zig-zag encoding of an Int32.

int64ZigZagBase128LE :: BoundedEncoding Int64Source

Variable-length, little-endian, base-128, zig-zag encoding of an Int64.

intZigZagBase128LE :: BoundedEncoding IntSource

Variable-length, little-endian, base-128, zig-zag encoding of an Int.

Note that in contrast to the fixed-width binary encoding of an Int, whose width depends on the register-width of a machine, this encoding is machine-independent for values small enough to be represented using an Int on all relevant machines.

Character encodings

ASCII

char7 :: FixedEncoding CharSource

Encode the least 7-bits of a Char using the ASCII encoding.

Decimal numbers

Decimal encoding of numbers using ASCII encoded characters.

int8Dec :: BoundedEncoding Int8Source

Decimal encoding of an Int8.

int16Dec :: BoundedEncoding Int16Source

Decimal encoding of an Int16.

int32Dec :: BoundedEncoding Int32Source

Decimal encoding of an Int32.

int64Dec :: BoundedEncoding Int64Source

Decimal encoding of an Int64.

intDec :: BoundedEncoding IntSource

Decimal encoding of an Int.

word8Dec :: BoundedEncoding Word8Source

Decimal encoding of a Word8.

word16Dec :: BoundedEncoding Word16Source

Decimal encoding of a Word16.

word32Dec :: BoundedEncoding Word32Source

Decimal encoding of a Word32.

word64Dec :: BoundedEncoding Word64Source

Decimal encoding of a Word64.

wordDec :: BoundedEncoding WordSource

Decimal encoding of a Word.

Hexadecimal numbers

Encoding positive integers as hexadecimal numbers using lower-case ASCII characters. The shortest possible representation is used. For example,

 showB word16Hex 0x0a10 = "a10"

Note that there is no support for using upper-case characters. Please contact the maintainer if your application cannot work without hexadecimal encodings that use upper-case characters.

word8Hex :: BoundedEncoding Word8Source

Hexadecimal encoding of a Word8.

word16Hex :: BoundedEncoding Word16Source

Hexadecimal encoding of a Word16.

word32Hex :: BoundedEncoding Word32Source

Hexadecimal encoding of a Word32.

word64Hex :: BoundedEncoding Word64Source

Hexadecimal encoding of a Word64.

wordHex :: BoundedEncoding WordSource

Hexadecimal encoding of a Word.

Fixed-width hexadecimal numbers

Encoding the bytes of fixed-width types as hexadecimal numbers using lower-case ASCII characters. For example,

 showF word16HexFixed 0x0a10 = "0a10"

int8HexFixed :: FixedEncoding Int8Source

Encode a Int8 using 2 nibbles (hexadecimal digits).

int16HexFixed :: FixedEncoding Int16Source

Encode a Int16 using 4 nibbles.

int32HexFixed :: FixedEncoding Int32Source

Encode a Int32 using 8 nibbles.

int64HexFixed :: FixedEncoding Int64Source

Encode a Int64 using 16 nibbles.

word8HexFixed :: FixedEncoding Word8Source

Encode a Word8 using 2 nibbles (hexadecimal digits).

word16HexFixed :: FixedEncoding Word16Source

Encode a Word16 using 4 nibbles.

word32HexFixed :: FixedEncoding Word32Source

Encode a Word32 using 8 nibbles.

word64HexFixed :: FixedEncoding Word64Source

Encode a Word64 using 16 nibbles.

floatHexFixed :: FixedEncoding FloatSource

Encode an IEEE Float using 8 nibbles.

doubleHexFixed :: FixedEncoding DoubleSource

Encode an IEEE Double using 16 nibbles.

ISO/IEC 8859-1 (Char8)

The ISO/IEC 8859-1 encoding is an 8-bit encoding often known as Latin-1. The Char8 encoding implemented here works by truncating the Unicode codepoint to 8-bits and encoding them as a single byte. For the codepoints 0-255 this corresponds to the ISO/IEC 8859-1 encoding. Note that the Char8 encoding is equivalent to the ASCII encoding on the Unicode codepoints 0-127. Hence, functions such as intDec can also be used for encoding Ints as a decimal number with Char8 encoded characters.

char8 :: FixedEncoding CharSource

Char8 encode a Char.

UTF-8

The UTF-8 encoding can encode all Unicode codepoints. It is equivalent to the ASCII encoding on the Unicode codepoints 0-127. Hence, functions such as intDec can also be used for encoding Ints as a decimal number with UTF-8 encoded characters.

charUtf8 :: BoundedEncoding CharSource

UTF-8 encode a Char.

Testing support

The following four functions are intended for testing use only. They are not efficient. FixedEncodings and BoundedEncodings are efficently executed by creating Builders from them using the encodeXXX functions explained at the top of this module.

evalF :: FixedEncoding a -> a -> [Word8]Source

For testing use only. Evaluate a FixedEncoding on a given value.

evalB :: BoundedEncoding a -> a -> [Word8]Source

For testing use only. Evaluate a BoundedEncoding on a given value.

showF :: FixedEncoding a -> a -> StringSource

For testing use only. Show the result of a FixedEncoding of a given value as a String by interpreting the resulting bytes as Unicode codepoints.

showB :: BoundedEncoding a -> a -> StringSource

For testing use only. Show the result of a BoundedEncoding of a given value as a String by interpreting the resulting bytes as Unicode codepoints.