Extracting speaker interventions from a text using R? Or something else?












0















We're working on a text mining project for school on the proportion of environment-oriented speech in Quebec's National Assembly. We wanna extract a list of every speaker's interventions throughout the years.



Our documents are all formatted this way:



Mr. Smith : Blablabla

Mrs. Jones : Blablabla


What I would like to do is write the simplest thing possible that would allow me to extract these interventions. I'm thinking something along the lines of:



"Every time you see [Mr. **** : ] OR [Mrs. **** : ], extract ALL the text until you see another occurrence of [Mr. **** : ] OR [Mrs. **** : ]. And, ideally, extract all the Mr. Smiths and the Mrs. Joneses and the Mr. Williams in separate files while keeping track of which file the interventions came from.



I started writing a very basic gsub line which allowed me to replace the occurrences I wanted to replace with an @, only to realize I don't want to replace them completely but rather maybe just add an @ in front which would probably make it easier to write something that would just separate the @s in distinct files.



gsub("(Mr.|Mrs.)\s\w*\s:\s", "@", test)


I've just started teaching myself R for this project and I need some insight on how I should proceed next. Or should I use something else instead?










share|improve this question

























  • Probably tokenize words and then cumsum(grepl(...)), like how chapters are IDed here: tidytextmining.com/tidytext.html#tidyausten

    – alistaire
    Nov 19 '18 at 4:36











  • Can you provide a link to an actual document?

    – hrbrmstr
    Nov 19 '18 at 8:16











  • The "." you use in your regex is actually a metacharacter; as a metacharacter it does not mean "period" but "anything". If you do want to include the period in your regex as a period you have to escape it thus: "\."

    – Chris Ruehlemann
    Nov 19 '18 at 10:52
















0















We're working on a text mining project for school on the proportion of environment-oriented speech in Quebec's National Assembly. We wanna extract a list of every speaker's interventions throughout the years.



Our documents are all formatted this way:



Mr. Smith : Blablabla

Mrs. Jones : Blablabla


What I would like to do is write the simplest thing possible that would allow me to extract these interventions. I'm thinking something along the lines of:



"Every time you see [Mr. **** : ] OR [Mrs. **** : ], extract ALL the text until you see another occurrence of [Mr. **** : ] OR [Mrs. **** : ]. And, ideally, extract all the Mr. Smiths and the Mrs. Joneses and the Mr. Williams in separate files while keeping track of which file the interventions came from.



I started writing a very basic gsub line which allowed me to replace the occurrences I wanted to replace with an @, only to realize I don't want to replace them completely but rather maybe just add an @ in front which would probably make it easier to write something that would just separate the @s in distinct files.



gsub("(Mr.|Mrs.)\s\w*\s:\s", "@", test)


I've just started teaching myself R for this project and I need some insight on how I should proceed next. Or should I use something else instead?










share|improve this question

























  • Probably tokenize words and then cumsum(grepl(...)), like how chapters are IDed here: tidytextmining.com/tidytext.html#tidyausten

    – alistaire
    Nov 19 '18 at 4:36











  • Can you provide a link to an actual document?

    – hrbrmstr
    Nov 19 '18 at 8:16











  • The "." you use in your regex is actually a metacharacter; as a metacharacter it does not mean "period" but "anything". If you do want to include the period in your regex as a period you have to escape it thus: "\."

    – Chris Ruehlemann
    Nov 19 '18 at 10:52














0












0








0








We're working on a text mining project for school on the proportion of environment-oriented speech in Quebec's National Assembly. We wanna extract a list of every speaker's interventions throughout the years.



Our documents are all formatted this way:



Mr. Smith : Blablabla

Mrs. Jones : Blablabla


What I would like to do is write the simplest thing possible that would allow me to extract these interventions. I'm thinking something along the lines of:



"Every time you see [Mr. **** : ] OR [Mrs. **** : ], extract ALL the text until you see another occurrence of [Mr. **** : ] OR [Mrs. **** : ]. And, ideally, extract all the Mr. Smiths and the Mrs. Joneses and the Mr. Williams in separate files while keeping track of which file the interventions came from.



I started writing a very basic gsub line which allowed me to replace the occurrences I wanted to replace with an @, only to realize I don't want to replace them completely but rather maybe just add an @ in front which would probably make it easier to write something that would just separate the @s in distinct files.



gsub("(Mr.|Mrs.)\s\w*\s:\s", "@", test)


I've just started teaching myself R for this project and I need some insight on how I should proceed next. Or should I use something else instead?










share|improve this question
















We're working on a text mining project for school on the proportion of environment-oriented speech in Quebec's National Assembly. We wanna extract a list of every speaker's interventions throughout the years.



Our documents are all formatted this way:



Mr. Smith : Blablabla

Mrs. Jones : Blablabla


What I would like to do is write the simplest thing possible that would allow me to extract these interventions. I'm thinking something along the lines of:



"Every time you see [Mr. **** : ] OR [Mrs. **** : ], extract ALL the text until you see another occurrence of [Mr. **** : ] OR [Mrs. **** : ]. And, ideally, extract all the Mr. Smiths and the Mrs. Joneses and the Mr. Williams in separate files while keeping track of which file the interventions came from.



I started writing a very basic gsub line which allowed me to replace the occurrences I wanted to replace with an @, only to realize I don't want to replace them completely but rather maybe just add an @ in front which would probably make it easier to write something that would just separate the @s in distinct files.



gsub("(Mr.|Mrs.)\s\w*\s:\s", "@", test)


I've just started teaching myself R for this project and I need some insight on how I should proceed next. Or should I use something else instead?







r text-mining






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 19 '18 at 10:39









snoram

7,001832




7,001832










asked Nov 19 '18 at 4:12









François CôtéFrançois Côté

62




62













  • Probably tokenize words and then cumsum(grepl(...)), like how chapters are IDed here: tidytextmining.com/tidytext.html#tidyausten

    – alistaire
    Nov 19 '18 at 4:36











  • Can you provide a link to an actual document?

    – hrbrmstr
    Nov 19 '18 at 8:16











  • The "." you use in your regex is actually a metacharacter; as a metacharacter it does not mean "period" but "anything". If you do want to include the period in your regex as a period you have to escape it thus: "\."

    – Chris Ruehlemann
    Nov 19 '18 at 10:52



















  • Probably tokenize words and then cumsum(grepl(...)), like how chapters are IDed here: tidytextmining.com/tidytext.html#tidyausten

    – alistaire
    Nov 19 '18 at 4:36











  • Can you provide a link to an actual document?

    – hrbrmstr
    Nov 19 '18 at 8:16











  • The "." you use in your regex is actually a metacharacter; as a metacharacter it does not mean "period" but "anything". If you do want to include the period in your regex as a period you have to escape it thus: "\."

    – Chris Ruehlemann
    Nov 19 '18 at 10:52

















Probably tokenize words and then cumsum(grepl(...)), like how chapters are IDed here: tidytextmining.com/tidytext.html#tidyausten

– alistaire
Nov 19 '18 at 4:36





Probably tokenize words and then cumsum(grepl(...)), like how chapters are IDed here: tidytextmining.com/tidytext.html#tidyausten

– alistaire
Nov 19 '18 at 4:36













Can you provide a link to an actual document?

– hrbrmstr
Nov 19 '18 at 8:16





Can you provide a link to an actual document?

– hrbrmstr
Nov 19 '18 at 8:16













The "." you use in your regex is actually a metacharacter; as a metacharacter it does not mean "period" but "anything". If you do want to include the period in your regex as a period you have to escape it thus: "\."

– Chris Ruehlemann
Nov 19 '18 at 10:52





The "." you use in your regex is actually a metacharacter; as a metacharacter it does not mean "period" but "anything". If you do want to include the period in your regex as a period you have to escape it thus: "\."

– Chris Ruehlemann
Nov 19 '18 at 10:52












1 Answer
1






active

oldest

votes


















0














If you don't want to replace the speaker names, you can use what is called 'positive look ahead', like this:



# some example data:
bla <- c("Mr. X : blablabla bla bla bla. Mrs. Y : bla bla blablablab Mr. XY : bla bla balblabla blabl abl" )

# replace with look ahead:
gsub("(?=(Mr.|Mrs.))", "@ ", bla, perl = T)
"@ Mr. X : blablabla bla bla bla. @ Mrs. Y : bla bla blablablab @ Mr. XY : bla bla balblabla blabl abl"


The @ is a good starting point for extracting the individual interventions. This can be done thus:



pattern <- "@.[^@]*" 
matches <- gregexpr(pattern, bla)
interventions <- regmatches(bla, matches)
interventions <- unlist(interventions)
interventions
[1] "@ Mr. X : blablabla bla bla bla. " "@ Mrs. Y : bla bla blablablab " "@ Mr. XY : bla bla balblabla blabl abl"





share|improve this answer


























  • Thank you so much, this worked! I had to work a bit on my regex so it only caught what I needed, but your lines for extracting the interventions worked perfectly! Now my colleague and I may proceed to the next step, which is extracting the interventions into separate files for each speaker. We'll try to find it out by ourselves but if you have any idea of how we should proceed, feel free to guide us :)

    – François Côté
    Nov 23 '18 at 20:21











  • You can proceed thus: 1. split interventions into 2 columns (1 for speaker ID, 1 for intervention) using the package stringr: require(stringr) dat <- data.frame( speaker_name = str_extract(interventions, "@.*[a-z|A-Z].*:"), intervention_text = str_extract(interventions, ":.*[a-z|A-Z].*") ) # 2. remove "@" & ":": dat$speaker_name <- gsub("@ ", "", dat$speaker_name, perl = T) dat$intervention_text <- gsub(": ", "", dat$intervention_text, perl = T) # 3.use by to split dat into dfs per distinct ID: df_list <- by(dat, dat$speaker_name, function(unique) unique) df_list

    – Chris Ruehlemann
    Nov 23 '18 at 23:17













  • Step 2. can be skipped by using positive lookbehind in Step 1.: dat <- data.frame( speaker_name = str_extract(interventions, "(?<=@).*[a-z|A-Z].*:"), # intervention_text = str_extract(interventions, "(?<=:).*[a-z|A-Z].*") )

    – Chris Ruehlemann
    Nov 26 '18 at 18:08













  • After a lot of trials of errors I went with this: require(stringr) dat <- data.frame(speaker_name=str_extract(interventions, "((?<=@)(.*?):)"), intervention_text=str_extract(interventions, "(?<=:).*[a-z|A-Z].*")) df_list <- by(dat, dat$speaker_name, function(unique) unique) which is pretty much exactly what you had given me, except in order to capture only the first colon, I had to improve the regex to this ((?<=@)(.*?):) which I adapted from this stackoverflow.com/a/42457235/10665396.

    – François Côté
    Dec 12 '18 at 5:27













  • So I take it you're fine? Or is there anything you need help with?

    – Chris Ruehlemann
    Dec 12 '18 at 7:27











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53368202%2fextracting-speaker-interventions-from-a-text-using-r-or-something-else%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














If you don't want to replace the speaker names, you can use what is called 'positive look ahead', like this:



# some example data:
bla <- c("Mr. X : blablabla bla bla bla. Mrs. Y : bla bla blablablab Mr. XY : bla bla balblabla blabl abl" )

# replace with look ahead:
gsub("(?=(Mr.|Mrs.))", "@ ", bla, perl = T)
"@ Mr. X : blablabla bla bla bla. @ Mrs. Y : bla bla blablablab @ Mr. XY : bla bla balblabla blabl abl"


The @ is a good starting point for extracting the individual interventions. This can be done thus:



pattern <- "@.[^@]*" 
matches <- gregexpr(pattern, bla)
interventions <- regmatches(bla, matches)
interventions <- unlist(interventions)
interventions
[1] "@ Mr. X : blablabla bla bla bla. " "@ Mrs. Y : bla bla blablablab " "@ Mr. XY : bla bla balblabla blabl abl"





share|improve this answer


























  • Thank you so much, this worked! I had to work a bit on my regex so it only caught what I needed, but your lines for extracting the interventions worked perfectly! Now my colleague and I may proceed to the next step, which is extracting the interventions into separate files for each speaker. We'll try to find it out by ourselves but if you have any idea of how we should proceed, feel free to guide us :)

    – François Côté
    Nov 23 '18 at 20:21











  • You can proceed thus: 1. split interventions into 2 columns (1 for speaker ID, 1 for intervention) using the package stringr: require(stringr) dat <- data.frame( speaker_name = str_extract(interventions, "@.*[a-z|A-Z].*:"), intervention_text = str_extract(interventions, ":.*[a-z|A-Z].*") ) # 2. remove "@" & ":": dat$speaker_name <- gsub("@ ", "", dat$speaker_name, perl = T) dat$intervention_text <- gsub(": ", "", dat$intervention_text, perl = T) # 3.use by to split dat into dfs per distinct ID: df_list <- by(dat, dat$speaker_name, function(unique) unique) df_list

    – Chris Ruehlemann
    Nov 23 '18 at 23:17













  • Step 2. can be skipped by using positive lookbehind in Step 1.: dat <- data.frame( speaker_name = str_extract(interventions, "(?<=@).*[a-z|A-Z].*:"), # intervention_text = str_extract(interventions, "(?<=:).*[a-z|A-Z].*") )

    – Chris Ruehlemann
    Nov 26 '18 at 18:08













  • After a lot of trials of errors I went with this: require(stringr) dat <- data.frame(speaker_name=str_extract(interventions, "((?<=@)(.*?):)"), intervention_text=str_extract(interventions, "(?<=:).*[a-z|A-Z].*")) df_list <- by(dat, dat$speaker_name, function(unique) unique) which is pretty much exactly what you had given me, except in order to capture only the first colon, I had to improve the regex to this ((?<=@)(.*?):) which I adapted from this stackoverflow.com/a/42457235/10665396.

    – François Côté
    Dec 12 '18 at 5:27













  • So I take it you're fine? Or is there anything you need help with?

    – Chris Ruehlemann
    Dec 12 '18 at 7:27
















0














If you don't want to replace the speaker names, you can use what is called 'positive look ahead', like this:



# some example data:
bla <- c("Mr. X : blablabla bla bla bla. Mrs. Y : bla bla blablablab Mr. XY : bla bla balblabla blabl abl" )

# replace with look ahead:
gsub("(?=(Mr.|Mrs.))", "@ ", bla, perl = T)
"@ Mr. X : blablabla bla bla bla. @ Mrs. Y : bla bla blablablab @ Mr. XY : bla bla balblabla blabl abl"


The @ is a good starting point for extracting the individual interventions. This can be done thus:



pattern <- "@.[^@]*" 
matches <- gregexpr(pattern, bla)
interventions <- regmatches(bla, matches)
interventions <- unlist(interventions)
interventions
[1] "@ Mr. X : blablabla bla bla bla. " "@ Mrs. Y : bla bla blablablab " "@ Mr. XY : bla bla balblabla blabl abl"





share|improve this answer


























  • Thank you so much, this worked! I had to work a bit on my regex so it only caught what I needed, but your lines for extracting the interventions worked perfectly! Now my colleague and I may proceed to the next step, which is extracting the interventions into separate files for each speaker. We'll try to find it out by ourselves but if you have any idea of how we should proceed, feel free to guide us :)

    – François Côté
    Nov 23 '18 at 20:21











  • You can proceed thus: 1. split interventions into 2 columns (1 for speaker ID, 1 for intervention) using the package stringr: require(stringr) dat <- data.frame( speaker_name = str_extract(interventions, "@.*[a-z|A-Z].*:"), intervention_text = str_extract(interventions, ":.*[a-z|A-Z].*") ) # 2. remove "@" & ":": dat$speaker_name <- gsub("@ ", "", dat$speaker_name, perl = T) dat$intervention_text <- gsub(": ", "", dat$intervention_text, perl = T) # 3.use by to split dat into dfs per distinct ID: df_list <- by(dat, dat$speaker_name, function(unique) unique) df_list

    – Chris Ruehlemann
    Nov 23 '18 at 23:17













  • Step 2. can be skipped by using positive lookbehind in Step 1.: dat <- data.frame( speaker_name = str_extract(interventions, "(?<=@).*[a-z|A-Z].*:"), # intervention_text = str_extract(interventions, "(?<=:).*[a-z|A-Z].*") )

    – Chris Ruehlemann
    Nov 26 '18 at 18:08













  • After a lot of trials of errors I went with this: require(stringr) dat <- data.frame(speaker_name=str_extract(interventions, "((?<=@)(.*?):)"), intervention_text=str_extract(interventions, "(?<=:).*[a-z|A-Z].*")) df_list <- by(dat, dat$speaker_name, function(unique) unique) which is pretty much exactly what you had given me, except in order to capture only the first colon, I had to improve the regex to this ((?<=@)(.*?):) which I adapted from this stackoverflow.com/a/42457235/10665396.

    – François Côté
    Dec 12 '18 at 5:27













  • So I take it you're fine? Or is there anything you need help with?

    – Chris Ruehlemann
    Dec 12 '18 at 7:27














0












0








0







If you don't want to replace the speaker names, you can use what is called 'positive look ahead', like this:



# some example data:
bla <- c("Mr. X : blablabla bla bla bla. Mrs. Y : bla bla blablablab Mr. XY : bla bla balblabla blabl abl" )

# replace with look ahead:
gsub("(?=(Mr.|Mrs.))", "@ ", bla, perl = T)
"@ Mr. X : blablabla bla bla bla. @ Mrs. Y : bla bla blablablab @ Mr. XY : bla bla balblabla blabl abl"


The @ is a good starting point for extracting the individual interventions. This can be done thus:



pattern <- "@.[^@]*" 
matches <- gregexpr(pattern, bla)
interventions <- regmatches(bla, matches)
interventions <- unlist(interventions)
interventions
[1] "@ Mr. X : blablabla bla bla bla. " "@ Mrs. Y : bla bla blablablab " "@ Mr. XY : bla bla balblabla blabl abl"





share|improve this answer















If you don't want to replace the speaker names, you can use what is called 'positive look ahead', like this:



# some example data:
bla <- c("Mr. X : blablabla bla bla bla. Mrs. Y : bla bla blablablab Mr. XY : bla bla balblabla blabl abl" )

# replace with look ahead:
gsub("(?=(Mr.|Mrs.))", "@ ", bla, perl = T)
"@ Mr. X : blablabla bla bla bla. @ Mrs. Y : bla bla blablablab @ Mr. XY : bla bla balblabla blabl abl"


The @ is a good starting point for extracting the individual interventions. This can be done thus:



pattern <- "@.[^@]*" 
matches <- gregexpr(pattern, bla)
interventions <- regmatches(bla, matches)
interventions <- unlist(interventions)
interventions
[1] "@ Mr. X : blablabla bla bla bla. " "@ Mrs. Y : bla bla blablablab " "@ Mr. XY : bla bla balblabla blabl abl"






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 19 '18 at 10:36

























answered Nov 19 '18 at 10:05









Chris RuehlemannChris Ruehlemann

44329




44329













  • Thank you so much, this worked! I had to work a bit on my regex so it only caught what I needed, but your lines for extracting the interventions worked perfectly! Now my colleague and I may proceed to the next step, which is extracting the interventions into separate files for each speaker. We'll try to find it out by ourselves but if you have any idea of how we should proceed, feel free to guide us :)

    – François Côté
    Nov 23 '18 at 20:21











  • You can proceed thus: 1. split interventions into 2 columns (1 for speaker ID, 1 for intervention) using the package stringr: require(stringr) dat <- data.frame( speaker_name = str_extract(interventions, "@.*[a-z|A-Z].*:"), intervention_text = str_extract(interventions, ":.*[a-z|A-Z].*") ) # 2. remove "@" & ":": dat$speaker_name <- gsub("@ ", "", dat$speaker_name, perl = T) dat$intervention_text <- gsub(": ", "", dat$intervention_text, perl = T) # 3.use by to split dat into dfs per distinct ID: df_list <- by(dat, dat$speaker_name, function(unique) unique) df_list

    – Chris Ruehlemann
    Nov 23 '18 at 23:17













  • Step 2. can be skipped by using positive lookbehind in Step 1.: dat <- data.frame( speaker_name = str_extract(interventions, "(?<=@).*[a-z|A-Z].*:"), # intervention_text = str_extract(interventions, "(?<=:).*[a-z|A-Z].*") )

    – Chris Ruehlemann
    Nov 26 '18 at 18:08













  • After a lot of trials of errors I went with this: require(stringr) dat <- data.frame(speaker_name=str_extract(interventions, "((?<=@)(.*?):)"), intervention_text=str_extract(interventions, "(?<=:).*[a-z|A-Z].*")) df_list <- by(dat, dat$speaker_name, function(unique) unique) which is pretty much exactly what you had given me, except in order to capture only the first colon, I had to improve the regex to this ((?<=@)(.*?):) which I adapted from this stackoverflow.com/a/42457235/10665396.

    – François Côté
    Dec 12 '18 at 5:27













  • So I take it you're fine? Or is there anything you need help with?

    – Chris Ruehlemann
    Dec 12 '18 at 7:27



















  • Thank you so much, this worked! I had to work a bit on my regex so it only caught what I needed, but your lines for extracting the interventions worked perfectly! Now my colleague and I may proceed to the next step, which is extracting the interventions into separate files for each speaker. We'll try to find it out by ourselves but if you have any idea of how we should proceed, feel free to guide us :)

    – François Côté
    Nov 23 '18 at 20:21











  • You can proceed thus: 1. split interventions into 2 columns (1 for speaker ID, 1 for intervention) using the package stringr: require(stringr) dat <- data.frame( speaker_name = str_extract(interventions, "@.*[a-z|A-Z].*:"), intervention_text = str_extract(interventions, ":.*[a-z|A-Z].*") ) # 2. remove "@" & ":": dat$speaker_name <- gsub("@ ", "", dat$speaker_name, perl = T) dat$intervention_text <- gsub(": ", "", dat$intervention_text, perl = T) # 3.use by to split dat into dfs per distinct ID: df_list <- by(dat, dat$speaker_name, function(unique) unique) df_list

    – Chris Ruehlemann
    Nov 23 '18 at 23:17













  • Step 2. can be skipped by using positive lookbehind in Step 1.: dat <- data.frame( speaker_name = str_extract(interventions, "(?<=@).*[a-z|A-Z].*:"), # intervention_text = str_extract(interventions, "(?<=:).*[a-z|A-Z].*") )

    – Chris Ruehlemann
    Nov 26 '18 at 18:08













  • After a lot of trials of errors I went with this: require(stringr) dat <- data.frame(speaker_name=str_extract(interventions, "((?<=@)(.*?):)"), intervention_text=str_extract(interventions, "(?<=:).*[a-z|A-Z].*")) df_list <- by(dat, dat$speaker_name, function(unique) unique) which is pretty much exactly what you had given me, except in order to capture only the first colon, I had to improve the regex to this ((?<=@)(.*?):) which I adapted from this stackoverflow.com/a/42457235/10665396.

    – François Côté
    Dec 12 '18 at 5:27













  • So I take it you're fine? Or is there anything you need help with?

    – Chris Ruehlemann
    Dec 12 '18 at 7:27

















Thank you so much, this worked! I had to work a bit on my regex so it only caught what I needed, but your lines for extracting the interventions worked perfectly! Now my colleague and I may proceed to the next step, which is extracting the interventions into separate files for each speaker. We'll try to find it out by ourselves but if you have any idea of how we should proceed, feel free to guide us :)

– François Côté
Nov 23 '18 at 20:21





Thank you so much, this worked! I had to work a bit on my regex so it only caught what I needed, but your lines for extracting the interventions worked perfectly! Now my colleague and I may proceed to the next step, which is extracting the interventions into separate files for each speaker. We'll try to find it out by ourselves but if you have any idea of how we should proceed, feel free to guide us :)

– François Côté
Nov 23 '18 at 20:21













You can proceed thus: 1. split interventions into 2 columns (1 for speaker ID, 1 for intervention) using the package stringr: require(stringr) dat <- data.frame( speaker_name = str_extract(interventions, "@.*[a-z|A-Z].*:"), intervention_text = str_extract(interventions, ":.*[a-z|A-Z].*") ) # 2. remove "@" & ":": dat$speaker_name <- gsub("@ ", "", dat$speaker_name, perl = T) dat$intervention_text <- gsub(": ", "", dat$intervention_text, perl = T) # 3.use by to split dat into dfs per distinct ID: df_list <- by(dat, dat$speaker_name, function(unique) unique) df_list

– Chris Ruehlemann
Nov 23 '18 at 23:17







You can proceed thus: 1. split interventions into 2 columns (1 for speaker ID, 1 for intervention) using the package stringr: require(stringr) dat <- data.frame( speaker_name = str_extract(interventions, "@.*[a-z|A-Z].*:"), intervention_text = str_extract(interventions, ":.*[a-z|A-Z].*") ) # 2. remove "@" & ":": dat$speaker_name <- gsub("@ ", "", dat$speaker_name, perl = T) dat$intervention_text <- gsub(": ", "", dat$intervention_text, perl = T) # 3.use by to split dat into dfs per distinct ID: df_list <- by(dat, dat$speaker_name, function(unique) unique) df_list

– Chris Ruehlemann
Nov 23 '18 at 23:17















Step 2. can be skipped by using positive lookbehind in Step 1.: dat <- data.frame( speaker_name = str_extract(interventions, "(?<=@).*[a-z|A-Z].*:"), # intervention_text = str_extract(interventions, "(?<=:).*[a-z|A-Z].*") )

– Chris Ruehlemann
Nov 26 '18 at 18:08







Step 2. can be skipped by using positive lookbehind in Step 1.: dat <- data.frame( speaker_name = str_extract(interventions, "(?<=@).*[a-z|A-Z].*:"), # intervention_text = str_extract(interventions, "(?<=:).*[a-z|A-Z].*") )

– Chris Ruehlemann
Nov 26 '18 at 18:08















After a lot of trials of errors I went with this: require(stringr) dat <- data.frame(speaker_name=str_extract(interventions, "((?<=@)(.*?):)"), intervention_text=str_extract(interventions, "(?<=:).*[a-z|A-Z].*")) df_list <- by(dat, dat$speaker_name, function(unique) unique) which is pretty much exactly what you had given me, except in order to capture only the first colon, I had to improve the regex to this ((?<=@)(.*?):) which I adapted from this stackoverflow.com/a/42457235/10665396.

– François Côté
Dec 12 '18 at 5:27







After a lot of trials of errors I went with this: require(stringr) dat <- data.frame(speaker_name=str_extract(interventions, "((?<=@)(.*?):)"), intervention_text=str_extract(interventions, "(?<=:).*[a-z|A-Z].*")) df_list <- by(dat, dat$speaker_name, function(unique) unique) which is pretty much exactly what you had given me, except in order to capture only the first colon, I had to improve the regex to this ((?<=@)(.*?):) which I adapted from this stackoverflow.com/a/42457235/10665396.

– François Côté
Dec 12 '18 at 5:27















So I take it you're fine? Or is there anything you need help with?

– Chris Ruehlemann
Dec 12 '18 at 7:27





So I take it you're fine? Or is there anything you need help with?

– Chris Ruehlemann
Dec 12 '18 at 7:27


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53368202%2fextracting-speaker-interventions-from-a-text-using-r-or-something-else%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

鏡平學校

ꓛꓣだゔៀៅຸ໢ທຮ໕໒ ,ໂ'໥໓າ໼ឨឲ៵៭ៈゎゔit''䖳𥁄卿' ☨₤₨こゎもょの;ꜹꟚꞖꞵꟅꞛေၦေɯ,ɨɡ𛃵𛁹ޝ޳ޠ޾,ޤޒޯ޾𫝒𫠁သ𛅤チョ'サノބޘދ𛁐ᶿᶇᶀᶋᶠ㨑㽹⻮ꧬ꧹؍۩وَؠ㇕㇃㇪ ㇦㇋㇋ṜẰᵡᴠ 軌ᵕ搜۳ٰޗޮ޷ސޯ𫖾𫅀ल, ꙭ꙰ꚅꙁꚊꞻꝔ꟠Ꝭㄤﺟޱސꧨꧼ꧴ꧯꧽ꧲ꧯ'⽹⽭⾁⿞⼳⽋២៩ញណើꩯꩤ꩸ꩮᶻᶺᶧᶂ𫳲𫪭𬸄𫵰𬖩𬫣𬊉ၲ𛅬㕦䬺𫝌𫝼,,𫟖𫞽ហៅ஫㆔ాఆఅꙒꚞꙍ,Ꙟ꙱エ ,ポテ,フࢰࢯ𫟠𫞶 𫝤𫟠ﺕﹱﻜﻣ𪵕𪭸𪻆𪾩𫔷ġ,ŧآꞪ꟥,ꞔꝻ♚☹⛵𛀌ꬷꭞȄƁƪƬșƦǙǗdžƝǯǧⱦⱰꓕꓢႋ神 ဴ၀க௭எ௫ឫោ ' េㇷㇴㇼ神ㇸㇲㇽㇴㇼㇻㇸ'ㇸㇿㇸㇹㇰㆣꓚꓤ₡₧ ㄨㄟ㄂ㄖㄎ໗ツڒذ₶।ऩछएोञयूटक़कयँृी,冬'𛅢𛅥ㇱㇵㇶ𥄥𦒽𠣧𠊓𧢖𥞘𩔋цѰㄠſtʯʭɿʆʗʍʩɷɛ,əʏダヵㄐㄘR{gỚṖḺờṠṫảḙḭᴮᵏᴘᵀᵷᵕᴜᴏᵾq﮲ﲿﴽﭙ軌ﰬﶚﶧ﫲Ҝжюїкӈㇴffצּ﬘﭅﬈軌'ffistfflſtffतभफɳɰʊɲʎ𛁱𛁖𛁮𛀉 𛂯𛀞నఋŀŲ 𫟲𫠖𫞺ຆຆ ໹້໕໗ๆทԊꧢꧠ꧰ꓱ⿝⼑ŎḬẃẖỐẅ ,ờỰỈỗﮊDžȩꭏꭎꬻ꭮ꬿꭖꭥꭅ㇭神 ⾈ꓵꓑ⺄㄄ㄪㄙㄅㄇstA۵䞽ॶ𫞑𫝄㇉㇇゜軌𩜛𩳠Jﻺ‚Üမ႕ႌႊၐၸဓၞၞၡ៸wyvtᶎᶪᶹစဎ꣡꣰꣢꣤ٗ؋لㇳㇾㇻㇱ㆐㆔,,㆟Ⱶヤマފ޼ޝަݿݞݠݷݐ',ݘ,ݪݙݵ𬝉𬜁𫝨𫞘くせぉて¼óû×ó£…𛅑הㄙくԗԀ5606神45,神796'𪤻𫞧ꓐ㄁ㄘɥɺꓵꓲ3''7034׉ⱦⱠˆ“𫝋ȍ,ꩲ軌꩷ꩶꩧꩫఞ۔فڱێظペサ神ナᴦᵑ47 9238їﻂ䐊䔉㠸﬎ffiﬣ,לּᴷᴦᵛᵽ,ᴨᵤ ᵸᵥᴗᵈꚏꚉꚟ⻆rtǟƴ𬎎

Why https connections are so slow when debugging (stepping over) in Java?