Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Get data from specific sheets in google sheet and extract specific columns from it using gspread

I am trying to get specific column data out of specific sheets in one big google sheet.
For example, I have a list of sheets

Sheets = ['Sheet 1', 'Sheet 2', 'Sheet 3']

and from these sheets, I want to retrieve specific columns like

Column_headers = ['A', 'B']

What I am doing right now is getting the data from

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

import gspread
from gspread_dataframe import set_with_dataframe
import pandas as pd
pd.set_option("display.max_columns", None)
pd.set_option('display.max_rows', None)

sa = gspread.service_account(filename='file.json')
book = sa.open("book")
Sheets = ['Sheet 1', 'Sheet 2', 'Sheet 3']
Column_headers = ['A', 'B']

for i in Sheets:
    2022_sheet = book.worksheet(i)
    records = 2022_sheet.get_all_records()
    data_2022 = zip(*(e for e in zip(*record) if e[0] in Column_headers))
    getdata_2022 = pd.DataFrame(data_2022, columns = Column_headers)
    print(getdata_2022)

I am getting the following error

GSpreadException: the given 'expected_headers' are not uniques

that is because the headers are not unique (obv) that’s why I am retrieving specific columns, also I can’t understand the bit where I loop through the ‘Sheets’ to get the data only from specific sheets. Eventually the end result should be two columns ‘A’ and ‘B’ with all the data from the specific 3 sheets.

>Solution :

IIUC try doing the following:

import gspread
import gspread_dataframe as gd
import pandas as pd

# connect to the service account
sa = gspread.service_account(filename='file.json')

# open the Google Sheet
book = sa.open('book')

# The list of sheet names and columns that you want
Sheets = ['Sheet 1', 'Sheet 2', 'Sheet 3']
Column_headers = ['A', 'B']

# list comprehension using gd.get_as_dataframe 
# to create a list of DataFrames for each sheet
# then you concat all frames together into one DataFrame
df = pd.concat([gd.get_as_dataframe(book.worksheet(sheet))[Column_headers]
                for sheet in Sheets])

You may need to filter out NaNs if you have empty rows in your Google Sheet. Something like:

df = df[~df.isna().all(axis=1)]

update

replace df = pd.concat([gd.get_as_dataframe(book.worksheet(sheet))[Column_headers] for sheet in Sheets]) with the following

dfs = []
for sheet in Sheets:
    df = gd.get_as_dataframe(book.worksheet(sheet))
    try:
        df = df[Column_headers]
    except KeyError:
        df = df[['A']]
        df['B'] = np.nan
    # the commented out code below will add the sheet name to each frame
    # df['Sheet_Name'] = sheet
    # you can also remove all nan rows here if you need
    # df = df[~df.isna().all(axis=1)]
    dfs.append(df)
    
final_df = pd.concat(dfs)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading