I’m trying to find the frequency of strings from the field "Select Investors" on this website https://www.cbinsights.com/research-unicorn-companies
Is there a way to pull out the frequency of each of the comma separated strings?
For example, how frequent does the term "Sequoia Capital China" show up?
>Solution :
# Extract data
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)
first_df = df[0]
all_investor = []
for i in first_df[column]:
all_investor += str(i).lower().split(',')
# Calculate frequency
for string in all_investor:
string = string.strip()
column = "Select Investors"
frequency = first_df[column].apply(
lambda x: string in str(x).lower()).sum()
print(string, frequency)
Output:
andreessen horowitz 41
new enterprise associates 21
battery ventures 14
index ventures 30
dst global 19
ribbit capital 8
forerunner ventures 4
crosslink capital 4
homebrew 2
sequoia capital 115
thoma bravo 3
softbank 50
tencent holdings 28
lightspeed india partners 4
sequoia capital india 25
ggv capital 14
....