Home Convert strings with an unknown number of hex strings embedded in them to strings using regex

Questions

Convert strings with an unknown number of hex strings embedded in them to strings using regex

January 27, 2022

So I have a list of strings (content from Snort rules), and I am trying to convert the hex portions of them to UTF-8/ASCII, so I can send the content over netcat.

The method I have now works fine for strings with single hex characters (i.e. 3A), but breaks when there’s a series of hex characters (i.e. 3A 4B 00 FF)

My current solution is:

import re
import codecs

def convert_hex(match):
  string = match.group(1)
  string = string.replace(" ", "")
  decode_hex = codecs.getdecoder("hex_codec")
  try:
    result = decode_hex(string)[0]
  except:
    result = bytes.fromhex((lambda s: ("%s%s00" * (len(s)//2)) % tuple(s))(string)).decode('utf-16-le')
  return result.decode("utf-8")


strings = ['|0A|Referer|3A| res|3A|/C|3A|', 'RemoteNC Control Password|3A|', '/bbs/search.asp', 'User-Agent|3A| Mozilla/4.0 |28|compatible|3B| MSIE 5.0|3B| Windows NT 5.0|29|']

converted_strings = []

for string in strings:
    for i in range(len(string)):
        string = re.sub(r"\|(.{2})\|", convert_hex, string)
    converted_strings.append(string)

For the strings in strings, this works, but for a string like:

|08 00 00 00 27 C7 CC 6B C2 FD 13 0E|

it breaks.

I tried changing the regex to:

re.sub(r"\|.*([A-Fa-f0-9]{2}).*\|")

but that only converts the last hex.

I need this solution to work for strings like Hello|3A|World, |3A 00 FF|, and Hello|3A 00|World

I know it’s an issue with the regexp, but I’m not sure what exactly.

Any help would be much appreciated.

>Solution :

It looks like a substring is either always hex i.e. (?:[A-Fa-f0-9]{2}\s)+[A-Fa-f0-9]{2} or not hex at all between | symbols?

This works:

for string in strings:
    for i in range(len(string)):
        string = re.sub(r"(?<=\|)((?:[A-Fa-f0-9]{2}\s)*[A-Fa-f0-9]{2})(?=\|)", convert_hex, string)
    converted_strings.append(string)

(extra parentheses for a capturing group 1 – you could leave out one pair of parentheses and change your function to act on group(0) instead)

But it breaks on your example |08 00 00 00 27 C7 CC 6B C2 FD 13 0E|, as that doesn’t appear to be a valid UTF-8 encoding. The resulting error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 5: invalid continuation byte

However, a valid UTF-8 encoded multi-byte string like '|74 65 73 74 20 f0 9f 98 80|' works just fine:

import re
import codecs

def convert_hex(match):
  string = match.group(1)
  string = string.replace(" ", "")
  decode_hex = codecs.getdecoder("hex_codec")
  try:
    result = decode_hex(string)[0]
  except:
    result = bytes.fromhex((lambda s: ("%s%s00" * (len(s)//2)) % tuple(s))(string)).decode('utf-16-le')
  return result.decode("utf-8")


strings = ['|74 65 73 74 20 f0 9f 98 80|']

converted_strings = []

for string in strings:
    for i in range(len(string)):
        string = re.sub(r"(?<=\|)((?:[A-Fa-f0-9]{2}\s)*[A-Fa-f0-9]{2})(?=\|)", convert_hex, string)
    converted_strings.append(string)

print(converted_strings)

Result:

['|test 😀|']

If you don’t really need a printable representation of the data, you could just have your function return the bytes object and only apply the function to matching parts – instead of constructing a new string.

Based on what @Selcuk was saying, perhaps a result with byte-strings makes more sense – this works on all three types of input:

import re
import codecs

def convert_hex(match):
  string = match.group(1)
  string = string.replace(b" ", b"")
  decode_hex = codecs.getdecoder("hex_codec")
  try:
    result = decode_hex(string)[0]
  except:
    result = bytes.fromhex((lambda s: ("%s%s00" * (len(s)//2)) % tuple(s))(string)).decode('utf-16-le')
  return result


strings = ['|0A|Referer|3A| res|3A|/C|3A|', '|74 65 73 74 20 f0 9f 98 80|', '|08 00 00 00 27 C7 CC 6B C2 FD 13 0E|']

converted_strings = []

for string in strings:
    string = re.sub(rb"(?<=\|)((?:[A-Fa-f0-9]{2}\s)*[A-Fa-f0-9]{2})(?=\|)", convert_hex, string.encode())
    converted_strings.append(string)

print(converted_strings)

Result:

[b'|\n|Referer|:| res|:|/C|:|', b'|test \xf0\x9f\x98\x80|', b"|\x08\x00\x00\x00'\xc7\xcck\xc2\xfd\x13\x0e|"]

No encoding issues, because no encoding is chosen. (Note that I didn’t attempt to change convert_hex too much – there’s some encoding juggling in there that you may need to look at, I just got it to work for bytes)