Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Serializing a Data.Text value to a ByteString without unnecessary \NUL bytes

With the following code, I want to serialize a Data.Text value to a ByteString.
Unfortunately my text is prepended with unnecessary NUL bytes and an EOT byte:

GHCi, version 9.4.4: https://www.haskell.org/ghc/  :? for help
ghci> import qualified Data.Text as T
ghci> import Data.Binary
ghci> import Data.Binary.Put
ghci> let txt = T.pack "Text"
ghci> runPut $ put txt
"\NUL\NUL\NUL\NUL\NUL\NUL\NUL\EOTText"
ghci>

Questions:

  • Why are these NUL and EOT bytes generated?
  • How can I avoid them in the resulting ByteString?

PS: I the real code I put the length in front of the text

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

    foo :: Text -> ByteString
    foo txt = runPut do
        putWord32host $ T.length txt
        put txt

>Solution :

It actually already encodes the length in the binary string. Indeed, if we look at the source code, for the Text instance of Binary, we see [src]:

instance Binary Text where
    put t = put (encodeUtf8 t)
    get   = do
      bs <- get
      case decodeUtf8' bs of
        P.Left exn -> P.fail (P.show exn)
        P.Right a -> P.return a

That’s not much of a surprise, we encode it to UTF-8 which produces a ByteString, and then use put on that one. But the length is added when we put the ByteString itself. Indeed, the BinaryString instance of Binary looks like [src]:

instance Binary B.ByteString where
    put bs = put (B.length bs)
             <> putByteString bs
    get    = get >>= getByteString

The put for the ByteString produced by encodeUtf8 thus writes eight bytes to specify the size of the ByteString, this is thus the number of bytes, not (per se the same as) the number of characters in the Text.

If you would want the same effect, but without the length prefix, you can use:

import Data.Text.Encoding

runPut (putByteString (encodeUtf8 txt))

this thus omits the length header.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading