How to filter a list of dataframes based on a unique count of categorical factors in each dataframe?

Advertisements

I have a dataframe that I split into a list of dataframes based on a categorical variable in the dataframe:

list <- split(mpg, mpg$manufacturer)

I want to filter the list to only include dataframes where one of the categorical columns in each dataframe contain at least 5 unique factors, and remove those with less than 5.
I have tried lapply and filter over the dataset, but the result is filtering each dataframe, not the list entirely, as well as:
filteredlist <- lapply(list, function(x) length(unique(x$class) >= 5))
and am stumped.

Thanks, Any help would be appreciated!

>Solution :

First let’s take a look at how many unique classes there are:

sapply(list, \(x) length(unique(x$class)))
   #    audi  chevrolet      dodge       ford      honda    hyundai       jeep land rover    lincoln 
   #       2          3          3          3          1          2          1          1          1 
   # mercury     nissan    pontiac     subaru     toyota volkswagen 
   #       1          3          1          3          4          3 

So, with this data, the >= 5 isn’t a great example because it will have 0 results. Let’s do >= 3 so we can expect a non-empty result.

## with Filter
filteredlist <- Filter(list, f = function(x) length(unique(x$class)) >= 3)
length(filteredlist)
# [1] 7

## or with sapply and `[`
sapply_filter = list[sapply(list, \(x) length(unique(x$class))) >= 3]
length(sapply_filter)
# [1] 7

Note that in your attempt lapply(list, function(x) length(unique(x$class) >= 5)) you have a parentheses typo, you want length(unique()) >= 5) not length(unique(...) >= 5))

Leave a ReplyCancel reply