So I have a text and I want to get the value and Index of words wich are written down two times or more than two times (not in a row).
Soo I thought about RegEx, but I’m very new in this topic so I really don’t know how to write a pattern that’s working.
Programming language is C#
>Solution :
You can use regular expressions in order to words words: if we define word as
word is a non-empty sequence of Unicode letters
the pattern will be \p{L}+. Test yourself:
using System.Text.RegularExpressions;
...
string text =
"some words, with punctuations, with digits and Russian Words: слово и слово";
var words = Regex.Matches(text, @"\p{L}+");
// some, words, with, punctuations, with, digits, and,
// Russian, Words, слово, и, слово
Console.Write(string.Join(", ", words));
Then query these words with a help of Linq to find out repeated ones:
using System.Linq;
using System.Text.RegularExpressions;
...
string text =
"some words, with punctuations, with digits and Russian Words: слово и слово";
...
string[] words = Regex
.Matches(text, @"\p{L}+")
.Cast<Match>()
.GroupBy(match => match.Value, StringComparer.OrdinalIgnoreCase)
.Where(group => group.Count() > 1)
.Select(group => group.Key)
.ToArray();
// Let's have a look
Console.Write(string.Join(", ", words));
Outcome:
words, with, слово