Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to replace using for() with all() in a pandas dataframe?

I have a university activity that makes the following dataframe available:

      import pandas as pd
      from fuzzywuzzy import fuzz
      from fuzzywuzzy import process

      df1 = pd.DataFrame({'id': [0,1,2,3],
                          'name_city': ['RIO DE JANEIRO', 'SAO PAULO', 
                                        'ITU', 'CURITIBA'],
                          'city':['RIO JANEIRO', 'SAOO PAULO', 
                                  'FLORIANOPOLIS', 'BELO HORIZONTE']})

      print(df1)

            id  name_city             city
            0   RIO DE JANEIRO     RIO JANEIRO
            1   SAO PAULO          SAOO PAULO
            2   ITU                FLORIANOPOLIS
            3   CURITIBA           BELO HORIZONTE

I need to assess what is the similarity between the names of cities. So I’m using the fuzzy library and I made the following code:

      for i in range(0, len(df1)):
         df1['value_fuzzy'].iloc[i] = fuzz.ratio(df1['name_city'].iloc[i], df1['city'].iloc[i]

      This code works perfectly, the output is as desired:

      print(df1)

           id   name_city           city               value_fuzzy
           0    RIO DE JANEIRO     RIO JANEIRO             88
           1    SAO PAULO          SAOO PAULO              95
           2    ITU                FLORIANOPOLIS           12
           3    CURITIBA           BELO HORIZONTE          27

However, I would like to replace the for. So I tried to use all() which I found this example: Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

        df1['value_fuzzy'] = fuzz.ratio(df1['name_city'].all(), df1['city'].all())

But this code with all() returns wrong output:

       id   name_city        city          value_fuzzy
       0    RIO DE JANEIRO  RIO JANEIRO     27
       1    SAO PAULO       SAOO PAULO      27
       2    ITU             FLORIANOPOLIS   27
       3    CURITIBA        BELO HORIZONTE  27

Is there another way to generate the desired output without using for()?

>Solution :

You can’t use fuzz.ratio this way directly, the function is not vectorial.
You need to pass it to apply:

df1['value_fuzzy'] = df1.apply(lambda r: fuzz.ratio(r['name_city'], r['city']), axis=1)

NB. Note that whatever you do, unless the function is specifically rewritten to handle vectorial input, that won’t improve efficiency

output:

   id       name_city            city  value_fuzzy
0   0  RIO DE JANEIRO     RIO JANEIRO           88
1   1       SAO PAULO      SAOO PAULO           95
2   2             ITU   FLORIANOPOLIS           12
3   3        CURITIBA  BELO HORIZONTE           27
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading