Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Find characters with only dashes or only periods in r

I have a big dataframe (70k rows by 200k columns) with some of the row names having dashes, some having periods, and some having both, something like this:

df <- data.frame(cell1 = c(0,1,2,3,4,5,6), cell2 = c(0,1,2,3,4,5,6))
rownames(df) <- c("CMP21-97G8.1", "RP11-34P13.7", "HLA.A", "HLA-A", "HLA-E", "HLA.E", "RP11.442N24--B.1")

                   cell1 cell2
CMP21-97G8.1         0     0
RP11-34P13.7         1     1
HLA.A                2     2
HLA-A                3     3
HLA-E                4     4
HLA.E                5     5
RP11.442N24--B.1     6     6

I want to make three df subgroups where one subgroup has the rownames with only periods (HLA.A/HLA.E), one with dash-only rownames (HLA-A/HLA-E), and one with both (CMP21-97G8.1/RP11-34P13.7/RP11.442N24--B.1). Something like this:

df1
                 cell1 cell2
CMP21-97G8.1         0     0
RP11-34P13.7         1     1
RP11.442N24--B.1     6     6

df2
                 cell1 cell2
HLA.A                2     2
HLA.E                5     5

df3
                 cell1 cell2
HLA-A                3     3
HLA-E                4     4

When I try to look for periods and dashes though, they always seem to be "lazy", as in, it just looks to see if it has a period or a dash and it doesn’t discriminate against cases with both.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

#looking for either or. Returns all types mentioned
df <- df[grepl("[-]|[.]",rownames(df)),]
#tries to look for only containing both. Returns all types mentioned
df <- df[grepl("[^-]*-([^.]+).*",rownames(df)),]
#returns nothing
df <- df[grepl("[-]&[.]",rownames(df)),]
df <- df[grepl("[-]&&[.]",rownames(df)),]

Hopefully this makes sense and thanks for reading!

>Solution :

You can use the following to get the first dataframe:

df1 <- df[grepl("-[^.]*\\.|\\.[^-]*-",rownames(df)),]

Output:

> df1
                 cell1 cell2
CMP21-97G8.1         0     0
RP11-34P13.7         1     1
RP11.442N24--B.1     6     6

The -[^.]*\\.|\\.[^-]*- regex matches two substrings, either a string between - and . or between . and -.

The second dataframe can be obtained with:

df2 <- df[grepl("^[^-.]*\\.[^-]*$", rownames(df)),]

Here, ^[^-.]*\.[^-]*$ matches a full string that contains no hyphens and at least one dot.

See the output:

> df2
      cell1 cell2
HLA.A     2     2
HLA.E     5     5

And the following to get the third dataframe:

df3 <- df[grepl("^[^-.]*-[^.]*$", rownames(df)),]

See the output:

> df3
      cell1 cell2
HLA-A     3     3
HLA-E     4     4

Here, ^[^-.]*-[^.]*$ matches a full string that contains no dots and at least one hyphen.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading