Pyspark Compare column strings, grouping if alphabetic character sets are same, but avoid similar words?

I’m working on a project where I have a pyspark dataframe of two columns (word, word count) that are string and bigint respectively. The dataset is dirty such that some words have a non-letter character attached to them (ex. ‘date’, ‘[date’, ‘date]’ and ‘_date’ are all separate items but should be just ‘date’) print(dirty_df.schema) output—>… Read More Pyspark Compare column strings, grouping if alphabetic character sets are same, but avoid similar words?

R – how to replace strings in a column based on 2 or more values

I am trying to replace a string for specific groups of customer ids/order dates. The example below may better illustrate my question. I have a dataframe: customerid <- c("A1", "A1", "A2", "A2", "A3", "A3", "A3", "A4") orderdate <- c("2018-09-14", "2018-09-14", "2018-09-15", "2018-09-15", "2020-08-21", "2020-08-21","2020-08-21", "2018-08-10") status <- c("review", "review", "review", "negative", "positive", "review", "review", "review")… Read More R – how to replace strings in a column based on 2 or more values

Pandas data frame with time-series data – grouping without aggregating data

I have the following pandas dataframe: import pandas as pd df4 = pd.DataFrame({‘timestamp’:[‘2022-10-01 01:00:00’, ‘2022-10-02 01:00:00’, ‘2022-10-03 01:00:00’, ‘2022-10-04 01:00:00’, ‘2022-10-05 01:00:00’, ‘2022-10-01 02:00:00’, ‘2022-10-02 02:00:00’, ‘2022-10-03 02:00:00’, ‘2022-10-04 02:00:00’, ‘2022-10-05 02:00:00’], ‘A’: [1,2,3,4,5,6,7,8,9,10], ‘B’: [10,9,8,7,6,5,4,3,2,1]} ) df4[‘timestamp’] = df4[‘timestamp’].astype(‘datetime64’) df4 that gives the following data frame: | timestamp | A| B | |——————–|–| –|… Read More Pandas data frame with time-series data – grouping without aggregating data

Filtering DataFrame on groups where all elements of one group fullfills a one of various conditions

I need to filter a data frame with different groups. The data frame looks as follows: df = pd.DataFrame({"group":[1,1,1, 2,2,2,2, 3,3,3, 4,4], "percentage":[70,70,70, 45,80,60,70, 71,85,90, np.nan, np.nan]}) My goal is to return a data frame containing only groups that satisfy one of the two following conditions: All observations of the group have percentage > 70… Read More Filtering DataFrame on groups where all elements of one group fullfills a one of various conditions