Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Split strings in a Series, convert to array and average the values

I have a Pandas Series that has these unique values:

array(['17', '19', '21', '20', '22', '23', '12', '13', '15', '24', '25',
       '18', '16', '14', '26', '11', '10', '12/16', '27', '10/14',
       '16/22', '16/21', '13/17', '14/19', '11/15', '10/15', '15/21',
       '13/19', '13/18', '32', '28', '12/15', '29', '42', '30', '31',
       '34', '46', '11/14', '18/25', '19/26', '17/24', '19/24', '17/23',
       '13/16', '11/16', '15/20', '36', '17/25', '19/25', '17/22',
       '18/26', '39', '41', '35', '50', '9/13', '33', '10/13', '9/12',
       '93/37', '14/20', '10/16', '14/18', '16/23', '37', '9/11', '37/94',
       '20/54', '22/31', '22/30', '23/33', '44', '40', '50/95', '38',
       '16/24', '15/23', '15/22', '18/23', '16/20', '37/98', '19/27',
       '38/88', '23/31', '14/22', '45', '39/117', '28/76', '33/82',
       '15/19', '23/30', '47', '46/115', '14/21', '17/18', '25/50',
       '12/18', '12/17', '21/28', '20/27', '26/58', '22/67', '22/47',
       '25/51', '35/83', '39/86', '31/72', '24/56', '30/80', '32/85',
       '42/106', '40/99', '30/51', '21/43', '52', '56', '25/53', '34/83',
       '30/71', '27/64', '35/111', '26/62', '32/84', '39/95', '18/24',
       '22/29', '42/97', '48', '55', '58', '39/99', '49', '43', '40/103',
       '22/46', '54/133', '25/54', '36/83', '29/72', '28/67', '35/109',
       '25/62', '14/17', '42/110', '52/119', '20/60', '46/105', '25/56',
       '27/65', '25/74', '21/49', '29/71', '26/59', '27/62'], dtype=object)

The ones that have the ‘/’, I want to split these into arrays and then average their values. One simpler but a flawed approach is to simply extract the first value:
master_data["Cmb MPG"].str.split('/').str[0].astype('int8')

However, what I truly require is the two values being averaged.
I have tried several commands and this one:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

np.array(master_data["Cmb MPG"].str.split('/')).astype('int8').mean()

Should ideally do the job, but I get a ValueError followed by a TypeError:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'list'

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
Cell In[88], line 1
----> 1 np.array(master_data["Cmb MPG"].str.split('/')).astype('int8')

ValueError: setting an array element with a sequence.

The slice() method returns a Series but it won’t proceed either with the splitting of strings.

What is required is:

'18/25' ---> [18, 25] ---> 22 (rounded)

>Solution :

I would use extractall and groupby.mean:

s = pd.Series(['10', '12/16', '27', '10/14', '16/22', '16/21', '13/17'])

out = (s.str.extractall(r'(\d+)')[0].astype(int).groupby(level=0).mean()
        .round().astype(int)
       )

You could also go with split and mean, but this generates a more expensive intermediate and will not scale as well if you have many items (1/2/3/4/5):

out = (s.str.split('/', expand=True).astype(float).mean(axis=1)
        .round().astype(int)
      )

Output:

0    10
1    14
2    27
3    12
4    19
5    18
6    15
dtype: int64
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading