Easy way to do group by with multiple output in pandas

I am a long time SAS/SQL user and have always defaulted to using SQL for my groupbys for example to do select region ,case when age < 5 then ‘Low’ when age >= 5 and age <= 10 then ‘Middle’ else ‘High’ as duration ,sum(1) as total ,sum(profit) as profit ,sum(profit)/sum(1) as avg_profit ,max(revenue) as… Read More Easy way to do group by with multiple output in pandas

May 24, 2024 MRLeave a comment

How to select the scala dataframe column with special character in it?

I am reading a json file where the key is having come special character. E.g [{ "ABB/aws:1.0/CustomerId:2.0": [{ "id": 20, "namehash": "de8cfcde-95c5-47ac-a544-13db50557eaa" }] }] I am creating a scala dataframe and then trying to select the column using spark.sql "ABB/aws:1.0/CustomerId:2.0". Thats when its complaining about special character. dataframe looks like this >Solution : Use backtick… Read More How to select the scala dataframe column with special character in it?

March 18, 2024 MRLeave a comment

Back-ticks in DataFrame.colRegex?

For PySpark, I find back-ticks enclosing regular expressions for DataFrame.colRegex() here, here, and in this SO question. Here is the example from the DataFrame.colRegex doc string: df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["Col1", "Col2"]) df.select(df.colRegex("`(Col1)?+.+`")).show() +—-+ |Col2| +—-+ | 1| | 2| | 3| +—-+ The answer to the SO question doesn’t show… Read More Back-ticks in DataFrame.colRegex?

October 31, 2023 MRLeave a comment

map columns of two dataframes based on array intersection of their individual columns and based on highest common element match Pyspark/Pandas

I have a dataframe df1 like this: A B AA [a,b,c,d] BB [a,f,g,c] CC [a,b,l,m] And another one as df2 like: C D XX [a,b,c,n] YY [a,m,r,s] UU [e,h,I,j] I want to find out and map column C of df2 with column A of df1 based on the highest element match between the items of… Read More map columns of two dataframes based on array intersection of their individual columns and based on highest common element match Pyspark/Pandas

October 17, 2023 MRLeave a comment

Regular Expression – have at least n different digits

I want to use regular expression to check if the numbers have more than 2 different digits. For example, AB1000002 is allowed but AB1000000 is not allowed. My question is similar to this one but seems to be more complicated. Reference: Regular Expression- have different digits Thanks in advance! I am not good at coding,… Read More Regular Expression – have at least n different digits

October 6, 2023 MRLeave a comment

how to specify different types of DataFrames in python?

Let’s say that I have a Pyspark DataFrame which I consider is "Users". Then I have another one which I consider "Cars". Now lets say that I have a function which return a dataframe of type "Cars". Usually I see code like this: def get_cars() -> Dataframe: pass However "Dataframe" is not very expressive….is too… Read More how to specify different types of DataFrames in python?

September 16, 2023 MRLeave a comment

pyspark split a Column of variable length Array type into two smaller arrays

If I have a Column of Array type of variable lengths such as: [ [1,2,3,4,6] ] [ [0,4,5,4,6,8,9,1] ] [ [1,2,3,4,6,2,4,5,6] ] … How can I split this such that the first index is seperated from the rest such as: [ [1] ], [ [2,3,4,6] ] [ [0] ], [ [4,5,4,6,8,9,1] ] [ [1] ],… Read More pyspark split a Column of variable length Array type into two smaller arrays

August 28, 2023 MRLeave a comment

PySpark – Erratic behaviour of SampleBy

I’m having an apparent erratic behaviour when using the PySpark SQL ‘sampleBy’ function. Just to understand how it works, I’m trying to apply the stratified sampling over a sample of 100 numbers of values (0, 1 and 2) which are distributed nearly 1/3rd each. I’m applying a fractioning of 10% for the value zero and… Read More PySpark – Erratic behaviour of SampleBy

July 19, 2023 MRLeave a comment

How can I make only one file in spark to s3?

I have lots of csv files. After using spark sql, I want to make the one csv file. For example I have news1.csv, news2.csv, news3.csv, ect in S3. I download into spark sql from s3, and createDataframe. After using spark sql, I want to upload s3 with only one csv file. At first I tried… Read More How can I make only one file in spark to s3?

May 28, 2023 MRLeave a comment

pyspark dataframe limiting on multiple columns

I wonder if anyone point me in the right direction with the following problem. In a rather large pyspark dataframe with about 50 odd columns, two of them represent say ‘make’ and ‘model’. Something like 21234234322(unique id) .. .. .. Nissan Navara .. .. .. 73647364736 .. .. .. BMW X5 .. .. .. What… Read More pyspark dataframe limiting on multiple columns

May 22, 2023 MRLeave a comment

Dev solutions

Solutions for development problems

Tag: pyspark