Home Why the size of any HTML page becomes 6 bytes greater after saving it to a file?

Questions

Why the size of any HTML page becomes 6 bytes greater after saving it to a file?

March 3, 2024

I use Python’s requests module to download HTML pages.
For each URL I execute this statement response = requests.get(URL),
so the result of any GET request is written to the response variable.
I execute this statement to find out the number of bytes of the downloaded HTML page: len(response.text). My idea is to only save an HTML to the hard drive if there is no page with the same name on my hard drive or if there is a page with the same name, but the sizes are different. I execute Path(filepath).stat().st_size to find the size of the file on my hard drive if the file exists. The problem arises here. For some reason for any downloaded page the size of the file is always 6 bytes greater than the result of a call to the len() function with the text attribute of the response object. If len() returns 7282, then st_size is 7288; if len() returns 7216, then st_size is 7222 and so on. I don’t understand the reason of such behavior. I could add 6 bytes to the result of len() to compare the sizes. I guess, it’ll work, but then I won’t know the actual reason. It seems like a hack.

I tried to use curl command to download the page, the result is the same. Magical 6 bytes are added. I’ve checked 10 different pages, the difference of 6 bytes stays the same.

>Solution :

The discrepancy you’re observing is likely due to the difference between the way Python counts bytes in a string and the way your file system counts file size. Python’s len() function counts the number of characters in the string, while the file system counts the number of bytes used to store the file.

In Python, a string is a sequence of Unicode characters. When you call len() on a string, it returns the number of characters in the string, not the number of bytes.

On the other hand, when you save a string to a file, it’s stored as a sequence of bytes. Depending on the encoding used (like UTF-8), a single character can take up more than one byte.

When you check the file size with Path(filepath).stat().st_size, you get the size of the file in bytes, which includes the bytes used for encoding the characters and any potential metadata.

The reason you’re seeing a consistent difference of 6 bytes could be due to a fixed amount of metadata that’s being included in the file size but not in the string length.

To get a more accurate comparison of the file size and the string length, you could encode the string to bytes before calling len(), like this: len(response.text.encode('utf-8')). This will give you the length in bytes, which should be more comparable to the file size.