Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Python – Print(df) Only Showing First Row

I am a beginner to python. This seems like something that would have been asked but I have been trying to search for the answer for 3 days at this point and can’t find it.

I created a dataframe using pd after running pytesseract on an image. Everything is fine except one ‘minor’ thing. When I want it to show the dataframe, if the first series is ‘Date’, it shows only the first row:

df['Date'] = pd.Series(date_date)
df['In'] = pd.Series(float_in)
df['Out'] = pd.Series(float_out)

df['Date'] = df['Date'].fillna(date_date)
df['Out'] = df['Out'].fillna(0)
df['In'] = df['In'].fillna(0)


print(df)

  Date   In     Out
0 2022-05-31  0.0  7700.0

If I change the column sequence and keep column ‘Date’ on any other position, it comes out fine:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df['In'] = pd.Series(float_in)
df['Out'] = pd.Series(float_out)
df['Date'] = pd.Series(date_date)

df['Date'] = df['Date'].fillna(date_date)
df['Out'] = df['Out'].fillna(0)
df['In'] = df['In'].fillna(0)


print(df)

   In       Out       Date
0  0.0    7700.0 2022-05-31
1  0.0    4232.0 2022-05-31
2  0.0   16056.0 2022-05-31
3  0.0   80000.0 2022-05-31
4  0.0   40000.0 2022-05-31
5  0.0  105805.0 2022-05-31
6  0.0  185500.0 2022-05-31
7  0.0   52188.0 2022-05-31

Can anyone guide as to why this is happening and how to fix it? I would like the Date to remain the first column but of course I want all rows!

Thank you in advance.

Here is the complete code if that helps:

import cv2
import pytesseract
import pandas as pd
from datetime import datetime

pytesseract.pytesseract.tesseract_cmd=r'C:\Program Files\Tesseract-OCR\tesseract.exe'

img = cv2.imread("C:\\Users\\Fast Computer\\Documents\\Python test\\Images\\page-0.png")
thresh = 255

#Coordinates and ROI for Amount Out
x3,y3,w3,h3 = 577, 495, 172, 815
ROI_3 = img[y3:y3+h3,x3:x3+w3]

#Coordinates and ROI for Amount In
x4,y4,w4,h4 = 754, 495, 175, 815
ROI_4 = img[y4:y4+h4,x4:x4+w4]

#Coordinates and ROI for Date
x5,y5,w5,h5 = 833, 174, 80, 22
ROI_5 = img[y5:y5+h5,x5:x5+w5]


#OCR and convert to strings
text_amount_out = pytesseract.image_to_string(ROI_3)
text_amount_in = pytesseract.image_to_string(ROI_4)
text_date = pytesseract.image_to_string(ROI_5)

text_amount_out = text_amount_out.replace(',', '')
text_amount_in = text_amount_in.replace(',', '')

cv2.waitKey(0)
cv2.destroyAllWindows()

#Convert Strings to Lists
list_amount_out = text_amount_out.split()
list_amount_in = text_amount_in.split()
list_date = text_date.split()

float_out = []
for item in list_amount_out:
    float_out.append(float(item))

float_in = []
for item in list_amount_in:
    float_in.append(float(item))
    
date_date = datetime.strptime(text_date, '%d/%m/%Y ')


#Creating columns
df = pd.DataFrame()
df['In'] = pd.Series(float_in)
df['Out'] = pd.Series(float_out)
df['Date'] = pd.Series(date_date)

df['Date'] = df['Date'].fillna(date_date)
df['Out'] = df['Out'].fillna(0)
df['In'] = df['In'].fillna(0)


print(df)

>Solution :

Your problem lies with how you initialize and then update the pd.DataFrame().

import pandas as pd
from datetime import datetime

float_in = [0.0,0.5,1.0]
float_out = [0.0,0.5,1.0,1.5]

# this line just gives you 1 value:
date_date = datetime.strptime('01/01/2022 ', '%d/%m/%Y ')
# date_date = datetime.strptime(text_date, '%d/%m/%Y ')

# creates an empty df
df = pd.DataFrame()

print(df.shape)
# (0, 0)

Now, when you first fill the df only with a series that contains date_date, we get:

df['Date'] = pd.Series(date_date) # 1 row

print(df.shape)
# (1, 1)

print(df)
#         Date
# 0 2022-01-01

Adding any other (longer) pd.Series() to this, will not add rows to the df. Rather, it will only add the first value of that series:

df['In'] = pd.Series(float_in)

print(df)
#         Date   In
# 0 2022-01-01  0.0

One way to avoid this, is by initializing your df with an index that stretches the length of your longest list:

max_length = max(map(len, [float_in, float_out])) # 4

df = pd.DataFrame(index=range(max_length))

print(df.shape)
# (4, 0), so now we start with 4 rows

df['Date'] = pd.Series(date_date)

print(df)
#         Date
# 0 2022-01-01
# 1        NaT
# 2        NaT
# 3        NaT

df['In'] = pd.Series(float_in)
df['Out'] = pd.Series(float_out)

df['Date'] = df['Date'].fillna(date_date)
df['Out'] = df['Out'].fillna(0)
df['In'] = df['In'].fillna(0)

print(df)

        Date   In  Out
0 2022-01-01  0.0  0.0
1 2022-01-01  0.5  0.5
2 2022-01-01  1.0  1.0
3 2022-01-01  0.0  1.5
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading