| Title: | Tools for Language Data Analysis |
|---|---|
| Description: | Support functions and datasets to facilitate the analysis of linguistic data. The current focus is on the calculation of corpus-linguistic dispersion measures as described in Gries (2021) <doi:10.1007/978-3-030-46216-1_5> and Soenning (2025) <doi:10.3366/cor.2025.0326>. The most commonly used parts-based indices are implemented, including different formulas and modifications that are found in the literature, with the additional option to obtain frequency-adjusted scores. Dispersion scores can be computed based on individual count variables or a term-document matrix. |
| Authors: | Lukas Soenning [aut, cre, cph] (ORCID: <https://orcid.org/0000-0002-2705-395X>), German Research Foundation (DFG) [fnd] (ROR: <https://ror.org/018mejw64>, Grant number 548274092) |
| Maintainer: | Lukas Soenning <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.0.9000 |
| Built: | 2026-06-06 07:16:44 UTC |
| Source: | https://github.com/lsoenning/tlda |
This function adds a new column sampling_weights to a data frame. The purpose of sampling weights is to adjust for a mismatch between sample and target population with regard to the distribution of one or more categorical variables. The population distribution can be specified, and sampling weights are then calculated to work as adjustment factors. The default is to assume a balanced population distribution (all levels represented equally), but custom population distributions can be specified.
add_sampling_weights(data, variable, population_distribution = NULL)add_sampling_weights(data, variable, population_distribution = NULL)
data |
A data frame |
variable |
Character string indicating the variable for whose levels sampling weights should be calculated |
population_distribution |
List (or data frame) specifying the population distribution of the levels; default is |
This function takes as input a data frame, where observations (rows) represent a sample from a population. If the distribution of a categorical variable in the sample does not match the (known or assumed) distribution in the population, the function calculated sampling weights for the observations (rows). These account for the mismatch by up-weighting rows of those level that are underrepresented in the sample (relative to the population) and down-weighting rows belonging to levels that are overrepresented in the sample (relative to the population). If no population distribution is specified, all levels are assumed to be represented equally in the target population. Sampling weights are calculated on the basis of (i) the observed distribution of the variable in the sample, and (ii) the population distribution. For instance, if a specific subgroup (i.e. level) has a share of 10% in the sample, compared to 20% in the population, the sampling weight is 2.0 (20% divided by 10%). Sampling weights above 1 indicate up-weighting, sampling weights below 1 indicate down-weighting. The function prints out information on the sample and population distribution and the resulting weights.
A data frame
Lukas Soenning
add_sampling_weights( data = metadata_ice_gb, variable = "mode")add_sampling_weights( data = metadata_ice_gb, variable = "mode")
This dataset contains text-level frequencies for the Brown Corpus (Francis & Kučera 1979) for a set of 150 word forms. The list of items was compiled by Biber et al. (2016) for methodological purposes, that is, to study the behavior of dispersion measures in different distributional settings. The items are intended to cover a broad range of frequency and dispersion levels.
biber150_brownbiber150_brown
biber150_brownA matrix with 151 rows and 500 columns
Length of text (word_count), followed by set of 150 items in alphabetical order (a, able, ..., you, your)
500 texts, ordered by file name (e.g. "A01", "A02", ... , "R08", "R09")
While Biber et al. (2016: 446) used 153 target items, the 150 word forms included in the present data set correspond to the slightly narrower selection of forms used in Burch et al. (2017: 214-216). These 150 word forms are listed next, in alphabetical order:
a, able, actually, after, against, ah, aha, all, among, an, and, another, anybody, at, aye, be, became, been, began, bet, between, bloke, both, bringing, brought, but, charles, claimed, cor, corp, cos, da, day, decided, did, do, doo, during, each, economic, eh, eighty, england, er, erm, etcetera, everybody, fall, fig, for, forty, found, from, full, get, government, ha, had, has, have, having, held, hello, himself, hm, however, hundred, i, ibm, if, important, in, inc, including, international, into, it, just, know, large, later, latter, let, life, ltd, made, may, methods, mhm, minus, mm, most, mr, mum, new, nineteen, ninety, nodded, nought, oh, okay, on, ooh, out, pence, percent, political, presence, provides, put, really, reckon, say, seemed, seriously, sixty, smiled, so, social, somebody, system, take, talking, than, the, they, thing, think, thirteen, though, thus, time, tt, tv, twenty, uk, under, urgh, us, usa, wants, was, we, who, with, world, yeah, yes, you, your
The data are provided in the form of a term-document matrix, where rows denote the 150 items and columns denote the 500 texts in the corpus. Seven items do not occur in Brown (aha, cor, cos, ltd, mhm, nought, pence)). These are included in the term-document matrix with frequencies of 0 for all texts. Further, seven items are spelled differently in Brown (compared to the BNC, on which Biber et al.'s (2016) study is based): "u.s.a." (Brown) instead of "usa" (BNC), "inc." instead of "inc", "mr." instead of "mr", "ugh" instead of "urgh", "uh" instead of "er", "um" instead of "erm", and "hmm" instead of "hm".
The first row of the term-document matrix gives the length of the text (i.e. number of word and nonword tokens).
Biber, Douglas, Randi Reppen, Erin Schnur & Romy Ghanem. 2016. On the (non)utility of Juilland’s D to measure lexical dispersion in large corpora. International Journal of Corpus Linguistics 21(4). 439–464.
Burch, Brent, Jesse Egbert & Douglas Biber. 2017. Measuring and interpreting lexical dispersion in corpus linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science 3(2). 189–216.
Francis, W. Nelson & Henry Kučera. 1979. A Standard Corpus of Present-Day Edited American English, for Use with Digital Computers (Brown). Providence, RI: Brown University.
This dataset contains text-level frequencies for the Brown Corpus (Francis & Kučera 1979) for a set of 150 word forms. The list of items was compiled by Biber et al. (2016) for methodological purposes, that is, to study the behavior of dispersion measures in different distributional settings. The items are intended to cover a broad range of frequency and dispersion levels.
biber150_brown_genrebiber150_brown_genre
biber150_brown_genreA matrix with 151 rows and 15 columns
Size of the genre (word_count), followed by set of 150 items in alphabetical order (a, able, ..., you, your)
15 genres, ordered based on the sampling frame ("press_reportage", "press_editorial", ... ,"romance_love_story", "humour")
While Biber et al. (2016: 446) used 153 target items, the 150 word forms included in the present data set correspond to the slightly narrower selection of forms used in Burch et al. (2017: 214-216). These 150 word forms are listed next, in alphabetical order:
a, able, actually, after, against, ah, aha, all, among, an, and, another, anybody, at, aye, be, became, been, began, bet, between, bloke, both, bringing, brought, but, charles, claimed, cor, corp, cos, da, day, decided, did, do, doo, during, each, economic, eh, eighty, england, er, erm, etcetera, everybody, fall, fig, for, forty, found, from, full, get, government, ha, had, has, have, having, held, hello, himself, hm, however, hundred, i, ibm, if, important, in, inc, including, international, into, it, just, know, large, later, latter, let, life, ltd, made, may, methods, mhm, minus, mm, most, mr, mum, new, nineteen, ninety, nodded, nought, oh, okay, on, ooh, out, pence, percent, political, presence, provides, put, really, reckon, say, seemed, seriously, sixty, smiled, so, social, somebody, system, take, talking, than, the, they, thing, think, thirteen, though, thus, time, tt, tv, twenty, uk, under, urgh, us, usa, wants, was, we, who, with, world, yeah, yes, you, your
The data are provided in the form of a term-document matrix, where rows denote the 150 items and columns denote the 15 genres in the corpus. Seven items do not occur in Brown (aha, cor, cos, ltd, mhm, nought, pence)). These are included in the term-document matrix with frequencies of 0 for all texts. Further, seven items are spelled differently in Brown (compared to the BNC, on which Biber et al.'s (2016) study is based): "u.s.a." (Brown) instead of "usa" (BNC), "inc." instead of "inc", "mr." instead of "mr", "ugh" instead of "urgh", "uh" instead of "er", "um" instead of "erm", and "hmm" instead of "hm".
The first row of the term-document matrix gives the size of the genre (i.e. number of word and nonword tokens).
Biber, Douglas, Randi Reppen, Erin Schnur & Romy Ghanem. 2016. On the (non)utility of Juilland’s D to measure lexical dispersion in large corpora. International Journal of Corpus Linguistics 21(4). 439–464.
Burch, Brent, Jesse Egbert & Douglas Biber. 2017. Measuring and interpreting lexical dispersion in corpus linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science 3(2). 189–216.
Francis, W. Nelson & Henry Kučera. 1979. A Standard Corpus of Present-Day Edited American English, for Use with Digital Computers (Brown). Providence, RI: Brown University.
This dataset contains text-level frequencies for the Brown Corpus (Francis & Kučera 1979) for a set of 150 word forms. The list of items was compiled by Biber et al. (2016) for methodological purposes, that is, to study the behavior of dispersion measures in different distributional settings. The items are intended to cover a broad range of frequency and dispersion levels.
biber150_brown_macro_genrebiber150_brown_macro_genre
biber150_brown_macro_genreA matrix with 151 rows and 4 columns
Size of the macro genre (word_count), followed by set of 150 items in alphabetical order (a, able, ..., you, your)
4 macro genres, ordered based on the sampling frame ("press", "general_prose", "learned", "fiction")
While Biber et al. (2016: 446) used 153 target items, the 150 word forms included in the present data set correspond to the slightly narrower selection of forms used in Burch et al. (2017: 214-216). These 150 word forms are listed next, in alphabetical order:
a, able, actually, after, against, ah, aha, all, among, an, and, another, anybody, at, aye, be, became, been, began, bet, between, bloke, both, bringing, brought, but, charles, claimed, cor, corp, cos, da, day, decided, did, do, doo, during, each, economic, eh, eighty, england, er, erm, etcetera, everybody, fall, fig, for, forty, found, from, full, get, government, ha, had, has, have, having, held, hello, himself, hm, however, hundred, i, ibm, if, important, in, inc, including, international, into, it, just, know, large, later, latter, let, life, ltd, made, may, methods, mhm, minus, mm, most, mr, mum, new, nineteen, ninety, nodded, nought, oh, okay, on, ooh, out, pence, percent, political, presence, provides, put, really, reckon, say, seemed, seriously, sixty, smiled, so, social, somebody, system, take, talking, than, the, they, thing, think, thirteen, though, thus, time, tt, tv, twenty, uk, under, urgh, us, usa, wants, was, we, who, with, world, yeah, yes, you, your
The data are provided in the form of a term-document matrix, where rows denote the 150 items and columns denote the 4 macro genres in the corpus. Seven items do not occur in Brown (aha, cor, cos, ltd, mhm, nought, pence)). These are included in the term-document matrix with frequencies of 0 for all texts. Further, seven items are spelled differently in Brown (compared to the BNC, on which Biber et al.'s (2016) study is based): "u.s.a." (Brown) instead of "usa" (BNC), "inc." instead of "inc", "mr." instead of "mr", "ugh" instead of "urgh", "uh" instead of "er", "um" instead of "erm", and "hmm" instead of "hm".
The first row of the term-document matrix gives the size of the genre (i.e. number of word and nonword tokens).
Biber, Douglas, Randi Reppen, Erin Schnur & Romy Ghanem. 2016. On the (non)utility of Juilland’s D to measure lexical dispersion in large corpora. International Journal of Corpus Linguistics 21(4). 439–464.
Burch, Brent, Jesse Egbert & Douglas Biber. 2017. Measuring and interpreting lexical dispersion in corpus linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science 3(2). 189–216.
Francis, W. Nelson & Henry Kučera. 1979. A Standard Corpus of Present-Day Edited American English, for Use with Digital Computers (Brown). Providence, RI: Brown University.
This dataset contains text-level frequencies for ICE-GB (Nelson et al. 2002) for a set of 150 word forms. The list of items was compiled by Biber et al. (2016) for methodological purposes, that is, to study the behavior of dispersion measures in different distributional settings. The items are intended to cover a broad range of frequency and dispersion levels.
biber150_ice_gbbiber150_ice_gb
biber150_ice_gbA matrix with 151 rows and 500 columns
Length of text (word_count), followed by set of 150 items in alphabetical order (a, able, ..., you, your)
500 texts, ordered by file name ("s1a-001","s1a-002", ... , "w2f-019", "w2f-020"))
While Biber et al. (2016: 446) used 153 target items, the 150 word forms included in the present data set correspond to the slightly narrower selection of forms used in Burch et al. (2017: 214-216). These 150 word forms are listed next, in alphabetical order:
a, able, actually, after, against, ah, aha, all, among, an, and, another, anybody, at, aye, be, became, been, began, bet, between, bloke, both, bringing, brought, but, charles, claimed, cor, corp, cos, da, day, decided, did, do, doo, during, each, economic, eh, eighty, england, er, erm, etcetera, everybody, fall, fig, for, forty, found, from, full, get, government, ha, had, has, have, having, held, hello, himself, hm, however, hundred, i, ibm, if, important, in, inc, including, international, into, it, just, know, large, later, latter, let, life, ltd, made, may, methods, mhm, minus, mm, most, mr, mum, new, nineteen, ninety, nodded, nought, oh, okay, on, ooh, out, pence, percent, political, presence, provides, put, really, reckon, say, seemed, seriously, sixty, smiled, so, social, somebody, system, take, talking, than, the, they, thing, think, thirteen, though, thus, time, tt, tv, twenty, uk, under, urgh, us, usa, wants, was, we, who, with, world, yeah, yes, you, your
The data are provided in the form of a term-document matrix, where rows denote the 150 items and columns denote the 500 texts in the corpus. Four items do not occur in ICE-GB (aye, corp, ltd, tt). These are included in the term-document matrix with frequencies of 0 for all texts.
The first row of the term-document matrix gives the length of the text (i.e. number of word tokens).
Biber, Douglas, Randi Reppen, Erin Schnur & Romy Ghanem. 2016. On the (non)utility of Juilland’s D to measure lexical dispersion in large corpora. International Journal of Corpus Linguistics 21(4). 439–464.
Burch, Brent, Jesse Egbert & Douglas Biber. 2017. Measuring and interpreting lexical dispersion in corpus linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science 3(2). 189–216.
Nelson, Gerald, Sean Wallis and Bas Aarts. 2002. Exploring Natural Language: Working with the British Component of the International Corpus of English. Amsterdam: John Benjamins.
This dataset contains text-level frequencies for ICE-GB (Nelson et al. 2002) for a set of 150 word forms. The list of items was compiled by Biber et al. (2016) for methodological purposes, that is, to study the behavior of dispersion measures in different distributional settings. The items are intended to cover a broad range of frequency and dispersion levels.
biber150_ice_gb_genrebiber150_ice_gb_genre
biber150_ice_gb_genreA matrix with 151 rows and 32 columns
Size of the genre (word_count), followed by set of 150 items in alphabetical order (a, able, ..., you, your)
32 genres, ordered alphabetically ("acad_humanities", "acad_natural_sciences", "acad_social_sciences", ... ,"student_essays", "unscripted_speeches"))
While Biber et al. (2016: 446) used 153 target items, the 150 word forms included in the present data set correspond to the slightly narrower selection of forms used in Burch et al. (2017: 214-216). These 150 word forms are listed next, in alphabetical order:
a, able, actually, after, against, ah, aha, all, among, an, and, another, anybody, at, aye, be, became, been, began, bet, between, bloke, both, bringing, brought, but, charles, claimed, cor, corp, cos, da, day, decided, did, do, doo, during, each, economic, eh, eighty, england, er, erm, etcetera, everybody, fall, fig, for, forty, found, from, full, get, government, ha, had, has, have, having, held, hello, himself, hm, however, hundred, i, ibm, if, important, in, inc, including, international, into, it, just, know, large, later, latter, let, life, ltd, made, may, methods, mhm, minus, mm, most, mr, mum, new, nineteen, ninety, nodded, nought, oh, okay, on, ooh, out, pence, percent, political, presence, provides, put, really, reckon, say, seemed, seriously, sixty, smiled, so, social, somebody, system, take, talking, than, the, they, thing, think, thirteen, though, thus, time, tt, tv, twenty, uk, under, urgh, us, usa, wants, was, we, who, with, world, yeah, yes, you, your
The data are provided in the form of a term-document matrix, where rows denote the 150 items and columns denote the 32 genres in the corpus. Four items do not occur in ICE-GB (aye, corp, ltd, tt). These are included in the term-document matrix with frequencies of 0 for all texts.
The first row of the term-document matrix gives the size of the genre (i.e. number of word tokens).
Biber, Douglas, Randi Reppen, Erin Schnur & Romy Ghanem. 2016. On the (non)utility of Juilland’s D to measure lexical dispersion in large corpora. International Journal of Corpus Linguistics 21(4). 439–464.
Burch, Brent, Jesse Egbert & Douglas Biber. 2017. Measuring and interpreting lexical dispersion in corpus linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science 3(2). 189–216.
Nelson, Gerald, Sean Wallis and Bas Aarts. 2002. Exploring Natural Language: Working with the British Component of the International Corpus of English. Amsterdam: John Benjamins.
This dataset contains text-level frequencies for ICE-GB (Nelson et al. 2002) for a set of 150 word forms. The list of items was compiled by Biber et al. (2016) for methodological purposes, that is, to study the behavior of dispersion measures in different distributional settings. The items are intended to cover a broad range of frequency and dispersion levels.
biber150_ice_gb_macro_genrebiber150_ice_gb_macro_genre
biber150_ice_gb_macro_genreA matrix with 151 rows and 12 columns
Size of the macro-genre (word_count), followed by set of 150 items in alphabetical order (a, able, ..., you, your)
12 macro-genres, ordered alphabetically ("academic_writing", "creative_writing", "instructional_writing", ... ,"student_writing", "unscripted_monologues"))
While Biber et al. (2016: 446) used 153 target items, the 150 word forms included in the present data set correspond to the slightly narrower selection of forms used in Burch et al. (2017: 214-216). These 150 word forms are listed next, in alphabetical order:
a, able, actually, after, against, ah, aha, all, among, an, and, another, anybody, at, aye, be, became, been, began, bet, between, bloke, both, bringing, brought, but, charles, claimed, cor, corp, cos, da, day, decided, did, do, doo, during, each, economic, eh, eighty, england, er, erm, etcetera, everybody, fall, fig, for, forty, found, from, full, get, government, ha, had, has, have, having, held, hello, himself, hm, however, hundred, i, ibm, if, important, in, inc, including, international, into, it, just, know, large, later, latter, let, life, ltd, made, may, methods, mhm, minus, mm, most, mr, mum, new, nineteen, ninety, nodded, nought, oh, okay, on, ooh, out, pence, percent, political, presence, provides, put, really, reckon, say, seemed, seriously, sixty, smiled, so, social, somebody, system, take, talking, than, the, they, thing, think, thirteen, though, thus, time, tt, tv, twenty, uk, under, urgh, us, usa, wants, was, we, who, with, world, yeah, yes, you, your
The data are provided in the form of a term-document matrix, where rows denote the 150 items and columns denote the 12 macro-genres in the corpus. Four items do not occur in ICE-GB (aye, corp, ltd, tt). These are included in the term-document matrix with frequencies of 0 for all texts.
The first row of the term-document matrix gives the size of the genre (i.e. number of word tokens).
Biber, Douglas, Randi Reppen, Erin Schnur & Romy Ghanem. 2016. On the (non)utility of Juilland’s D to measure lexical dispersion in large corpora. International Journal of Corpus Linguistics 21(4). 439–464.
Burch, Brent, Jesse Egbert & Douglas Biber. 2017. Measuring and interpreting lexical dispersion in corpus linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science 3(2). 189–216.
Nelson, Gerald, Sean Wallis and Bas Aarts. 2002. Exploring Natural Language: Working with the British Component of the International Corpus of English. Amsterdam: John Benjamins.
This dataset contains speaker-level frequencies for the demographically sampled part of the Spoken BNC1994 (Crowdy 1995) for a set of 150 word forms. The list of items was compiled by Biber et al. (2016) for methodological purposes, that is, to study the behavior of dispersion measures in different distributional settings. The items are intended to cover a broad range of frequency and dispersion levels.
biber150_spokenBNC1994biber150_spokenBNC1994
biber150_spokenBNC1994A matrix with 151 rows and 1,017 columns
Total number of words by speaker (word_count), followed by set of 150 items in alphabetical order (a, able, ..., you, your)
1,405 speakers, ordered by ID ("PS002","PS003", ... , "PS6SM", "PS6SN"))
While Biber et al. (2016: 446) used 153 target items, the 150 word forms included in the present data set correspond to the slightly narrower selection of forms used in Burch et al. (2017: 214-216). These 150 word forms are listed next, in alphabetical order:
a, able, actually, after, against, ah, aha, all, among, an, and, another, anybody, at, aye, be, became, been, began, bet, between, bloke, both, bringing, brought, but, charles, claimed, cor, corp, cos, da, day, decided, did, do, doo, during, each, economic, eh, eighty, england, er, erm, etcetera, everybody, fall, fig, for, forty, found, from, full, get, government, ha, had, has, have, having, held, hello, himself, hm, however, hundred, i, ibm, if, important, in, inc, including, international, into, it, just, know, large, later, latter, let, life, ltd, made, may, methods, mhm, minus, mm, most, mr, mum, new, nineteen, ninety, nodded, nought, oh, okay, on, ooh, out, pence, percent, political, presence, provides, put, really, reckon, say, seemed, seriously, sixty, smiled, so, social, somebody, system, take, talking, than, the, they, thing, think, thirteen, though, thus, time, tt, tv, twenty, uk, under, urgh, us, usa, wants, was, we, who, with, world, yeah, yes, you, your
The data are provided in the form of a term-document matrix, where rows denote the 150 items and columns denote 1,017 speakers in the demographically sampled part of the corpus. This dataset only includes speakers for whom information on both age and sex are available.
The first row of the term-document matrix gives the total number of words (i.e. number of word tokens) the speaker contributed to the corpus.
Biber, Douglas, Randi Reppen, Erin Schnur & Romy Ghanem. 2016. On the (non)utility of Juilland’s D to measure lexical dispersion in large corpora. International Journal of Corpus Linguistics 21(4). 439–464.
Burch, Brent, Jesse Egbert & Douglas Biber. 2017. Measuring and interpreting lexical dispersion in corpus linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science 3(2). 189–216.
Crowdy, Steve. 1995. The BNC spoken corpus. In Geoffrey Leech, Greg Myers & Jenny Thomas (eds.), Spoken English on Computer: Transcription, Mark-Up and Annotation, 224–234. Harlow: Longman.
This dataset contains speaker-level frequencies for the Spoken BNC2014 (Love et al. 2017) for a set of 150 word forms. The list of items was compiled by Biber et al. (2016) for methodological purposes, that is, to study the behavior of dispersion measures in different distributional settings. The items are intended to cover a broad range of frequency and dispersion levels.
biber150_spokenBNC2014biber150_spokenBNC2014
biber150_spokenBNC2014A matrix with 151 rows and 668 columns
Total number of words by speaker (word_count), followed by set of 150 items in alphabetical order (a, able, ..., you, your)
668 speakers, ordered by ID ("S0001","S0002", ... , "S0691", "S0692"))
While Biber et al. (2016: 446) used 153 target items, the 150 word forms included in the present data set correspond to the slightly narrower selection of forms used in Burch et al. (2017: 214-216). These 150 word forms are listed next, in alphabetical order:
a, able, actually, after, against, ah, aha, all, among, an, and, another, anybody, at, aye, be, became, been, began, bet, between, bloke, both, bringing, brought, but, charles, claimed, cor, corp, cos, da, day, decided, did, do, doo, during, each, economic, eh, eighty, england, er, erm, etcetera, everybody, fall, fig, for, forty, found, from, full, get, government, ha, had, has, have, having, held, hello, himself, hm, however, hundred, i, ibm, if, important, in, inc, including, international, into, it, just, know, large, later, latter, let, life, ltd, made, may, methods, mhm, minus, mm, most, mr, mum, new, nineteen, ninety, nodded, nought, oh, okay, on, ooh, out, pence, percent, political, presence, provides, put, really, reckon, say, seemed, seriously, sixty, smiled, so, social, somebody, system, take, talking, than, the, they, thing, think, thirteen, though, thus, time, tt, tv, twenty, uk, under, urgh, us, usa, wants, was, we, who, with, world, yeah, yes, you, your
The data are provided in the form of a term-document matrix, where rows denote the 150 items and columns denote the 668 speakers in the corpus. Speakers with the label "UNKFEMALE", "UNKMALE", and "UNKMULTI" are not included in the dataset.
The first row of the term-document matrix gives the total number of words (i.e. number of word tokens) the speaker contributed to the corpus.
Biber, Douglas, Randi Reppen, Erin Schnur & Romy Ghanem. 2016. On the (non)utility of Juilland’s D to measure lexical dispersion in large corpora. International Journal of Corpus Linguistics 21(4). 439–464.
Burch, Brent, Jesse Egbert & Douglas Biber. 2017. Measuring and interpreting lexical dispersion in corpus linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science 3(2). 189–216.
Love, Robbie, Claire Dembry, Andrew Hardie, Vaclav Brezina & Tony McEnery. 2017. The Spoken BNC2014: Designing and building a spoken corpus of everyday conversations. International Journal of Corpus Linguistics, 22(3), 319–344.
This function calculates a number of parts-based dispersion measures and allows the user to choose the directionality of scaling, i.e. whether higher values denote a more even or a less even distribution. It also offers the option of calculating frequency-adjusted dispersion scores.
disp( subfreq, partsize, directionality = "conventional", freq_adjust = FALSE, freq_adjust_method = "even", unit_interval = TRUE, digits = NULL, verbose = TRUE, print_score = TRUE, suppress_warning = FALSE )disp( subfreq, partsize, directionality = "conventional", freq_adjust = FALSE, freq_adjust_method = "even", unit_interval = TRUE, digits = NULL, verbose = TRUE, print_score = TRUE, suppress_warning = FALSE )
subfreq |
A numeric vector of subfrequencies, i.e. the number of occurrences of the item in each corpus part |
partsize |
A numeric vector specifying the size of the corpus parts |
directionality |
Character string indicating the directionality of scaling. See details below. Possible values are |
freq_adjust |
Logical. Whether dispersion score should be adjusted for frequency (i.e. whether frequency should be 'partialed out'); default is |
freq_adjust_method |
Character string indicating which method to use for devising dispersion extremes. See details below. Possible values are |
unit_interval |
Logical. Whether frequency-adjusted scores that exceed the limits of the unit interval should be replaced by 0 and 1; default is |
digits |
Rounding: Integer value specifying the number of decimal places to retain (default: no rounding) |
verbose |
Logical. Whether additional information (on directionality, formulas, frequency adjustment) should be printed; default is |
print_score |
Logical. Whether the dispersion score should be printed to the console; default is |
suppress_warning |
Logical. Whether warning messages should be suppressed; default is |
This function calculates dispersion measures based on two vectors: a set of subfrequencies (number of occurrences of the item in each corpus part) and a matching set of part sizes (the size of the corpus parts, i.e. number of word tokens).
Directionality: The scores for all measures range from 0 to 1. The conventional scaling of dispersion measures (see Juilland & Chang-Rodriguez 1964; Carroll 1970; Rosengren 1971) assigns higher values to more even/dispersed/balanced distributions of subfrequencies across corpus parts. This is the default. Gries (2008) uses the reverse scaling, with higher values denoting a more uneven/bursty/concentrated distribution; use directionality = "gries" to choose this option.
Frequency adjustment: Dispersion scores can be adjusted for frequency using the min-max transformation proposed by Gries (2022: 184-191; 2024: 196-208). The frequency-adjusted score for an item considers the lowest and highest possible level of dispersion it can obtain given its overall corpus frequency as well as the number (and size) of corpus parts. The unadjusted score is then expressed relative to these endpoints, where the dispersion minimum is set to 0, and the dispersion maximum to 1 (expressed in terms of conventional scaling). The frequency-adjusted score falls between these bounds and expresses how close the observed distribution is to the theoretical maximum and minimum. This adjustment therefore requires a maximally and a minimally dispersed distribution of the item across the parts. These hypothetical extremes can be built in different ways. The method used by Gries (2022, 2024) uses a computationally expensive procedure that finds the distribution that produces the highest value on the dispersion measure of interest. The current function constructs extreme distributions in a different way, based on the distributional features pervasiveness ("pervasive") or evenness ("even"). You can choose between these with the argument freq_adjust_method; the default is even. For details and explanations, see vignette("frequency-adjustment").
To obtain the lowest possible level of dispersion, the occurrences are either allocated to as few corpus parts as possible ("pervasive"), or they are assigned to the smallest corpus part(s) ("even").
To obtain the highest possible level of dispersion, the occurrences are either spread as broadly across corpus parts as possible ("pervasive"), or they are allocated to corpus parts in proportion to their size ("even"). The choice between these methods is particularly relevant if corpus parts differ considerably in size. See documentation for find_max_disp() and vignette("frequency-adjustment").
The following measures are computed, listed in chronological order (see details below):
(Keniston 1920)
(Juilland & Chang-Rodriguez 1964)
(Carroll 1970)
(Rosengren 1971)
(Gries 2008; modification: Egbert et al. 2020)
(Burch et al. 2017)
(Gries 2024)
In the formulas given below, the following notation is used:
the number of corpus parts
the absolute subfrequency in part
a proportional quantity; the subfrequency in part divided by the total number of occurrences of the item in the corpus (i.e. the sum of all subfrequencies)
the absolute size of corpus part
a proportional quantity; the size of corpus part divided by the size of the corpus (i.e. the sum of the part sizes)
the normalized subfrequency in part , i.e. the subfrequency divided by the size of the corpus part
a proportional quantity; the normalized subfrequency in part divided by the sum of all normalized subfrequencies
corpus frequency, i.e. the total number of occurrence of the item in the corpus
Note that the formulas cited below differ in their scaling, i.e. whether 1 reflects an even or an uneven distribution. In the current function, this behavior is overridden by the argument directionality. The specific scaling used in the formulas below is therefore irrelevant.
refers to the relative range, i.e. the proportion of corpus parts containing at least one occurrence of the item.
denotes Juilland's D and is calculated as follows (this formula uses conventional scaling); refers to the average over the normalized subfrequencies:
denotes the index proposed by Carroll (1970); the following formula uses conventional scaling:
is the dispersion measure proposed by Rosengren (1971); the formula uses conventional scaling:
represents Gries's deviation of proportions; the following formula is the modified version suggested by Egbert et al. (2020: 99); it implements conventional scaling (0 = uneven, 1 = even) and the notation refers to the value among those corpus parts that include at least one occurrence of the item.
is a measure introduced into dispersion analysis by Burch et al. (2017). The following formula is the one used by Egbert et al. (2020: 98); it relies on normalized frequencies and therefore works with corpus parts of different size. The formula represents conventional scaling (0 = uneven, 1 = even):
The current function uses a formula that may be found in Wilcox (1973: 343). It relies on the proportional values instead of the normalized subfrequencies :
Since this formula is computationally expensive, the function actually uses the computational shortcut given in Wilcox (1973: 343). Critically, the proportional quantities must first be sorted in decreasing order. Only after this rearrangement can the shortcut version be applied. We will refer to this rearranged version of as :
(Wilcox 1973: 343)
refers to a measure proposed by Gries (2020, 2021); for standardization, it uses the odds-to-probability transformation (Gries 2024: 90) and represents Gries scaling (0 = even, 1 = uneven):
A numeric vector of seven dispersion scores
Lukas Soenning
Burch, Brent, Jesse Egbert & Douglas Biber. 2017. Measuring and interpreting lexical dispersion in corpus linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science 3(2). 189–216. doi:10.1558/jrds.33066
Carroll, John B. 1970. An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. Computer Studies in the Humanities and Verbal Behaviour 3(2). 61–65. doi:10.1002/j.2333-8504.1970.tb00778.x
Egbert, Jesse, Brent Burch & Douglas Biber. 2020. Lexical dispersion and corpus design. International Journal of Corpus Linguistics 25(1). 89–115. doi:10.1075/ijcl.18010.egb
Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4). 403–437. doi:10.1075/ijcl.13.4.02gri
Gries, Stefan Th. 2020. Analyzing dispersion. In Magali Paquot & Stefan Th. Gries (eds.), A practical handbook of corpus linguistics, 99–118. New York: Springer. doi:10.1007/978-3-030-46216-1_5
Gries, Stefan Th. 2021. A new approach to (key) keywords analysis: Using frequency, and now also dispersion. Research in Corpus Linguistics 9(2). 1–33. doi:10.32714/ricl.09.02.02
Gries, Stefan Th. 2022. What do (most of) our dispersion measures measure (most)? Dispersion? Journal of Second Language Studies 5(2). 171–205. doi:10.1075/jsls.21029.gri
Gries, Stefan Th. 2024. Frequency, dispersion, association, and keyness: Revising and tupleizing corpus-linguistic measures. Amsterdam: Benjamins. doi:10.1075/scl.115
Juilland, Alphonse G. & Eugenio Chang-Rodríguez. 1964. Frequency dictionary of Spanish words. The Hague: Mouton de Gruyter. doi:10.1515/9783112415467
Keniston, Hayward. 1920. Common words in Spanish. Hispania 3(2). 85–96. doi:10.2307/331305
Lijffijt, Jefrey & Stefan Th. Gries. 2012. Correction to Stefan Th. Gries’ ‘Dispersions and adjusted frequencies in corpora’. International Journal of Corpus Linguistics 17(1). 147–149. doi:10.1075/ijcl.17.1.08lij
Rosengren, Inger. 1971. The quantitative concept of language and its relation to the structure of frequency dictionaries. Études de linguistique appliquée (Nouvelle Série) 1. 103–127.
For finer control over the calculation of several dispersion measures:
disp_R() for
disp_DP() for
disp_DA() for
disp_DKL() for
disp_DP( subfreq = c(0,0,1,2,5), partsize = rep(1000, 5), directionality = "conventional", freq_adjust = FALSE)disp_DP( subfreq = c(0,0,1,2,5), partsize = rep(1000, 5), directionality = "conventional", freq_adjust = FALSE)
This function calculates the dispersion measure . It allows the user to choose the directionality of scaling, i.e. whether higher values denote a more even or a less even distribution. It also provides the option of calculating frequency-adjusted dispersion scores.
disp_DA( subfreq, partsize, directionality = "conventional", freq_adjust = FALSE, freq_adjust_method = "even", unit_interval = TRUE, digits = NULL, verbose = TRUE, print_score = TRUE, suppress_warning = FALSE )disp_DA( subfreq, partsize, directionality = "conventional", freq_adjust = FALSE, freq_adjust_method = "even", unit_interval = TRUE, digits = NULL, verbose = TRUE, print_score = TRUE, suppress_warning = FALSE )
subfreq |
A numeric vector of subfrequencies, i.e. the number of occurrences of the item in each corpus part |
partsize |
A numeric vector specifying the size of the corpus parts |
directionality |
Character string indicating the directionality of scaling. See details below. Possible values are |
freq_adjust |
Logical. Whether dispersion score should be adjusted for frequency (i.e. whether frequency should be 'partialed out'); default is |
freq_adjust_method |
Character string indicating which method to use for devising dispersion extremes. See details below. Possible values are |
unit_interval |
Logical. Whether frequency-adjusted scores that exceed the limits of the unit interval should be replaced by 0 and 1; default is |
digits |
Rounding: Integer value specifying the number of decimal places to retain (default: no rounding) |
verbose |
Logical. Whether additional information (on directionality, formulas, frequency adjustment) should be printed; default is |
print_score |
Logical. Whether the dispersion score should be printed to the console; default is |
suppress_warning |
Logical. Whether warning messages should be suppressed; default is |
The function calculates the dispersion measure based on a set of subfrequencies (number of occurrences of the item in each corpus part) and a matching set of part sizes (the size of the corpus parts, i.e. number of word tokens). The function uses the shortcut formula ("computational" procedure) given in Wilcox (1973: 343), where is referred to as MDA.
Directionality: ranges from 0 to 1. The conventional scaling of dispersion measures (see Juilland & Chang-Rodriguez 1964; Carroll 1970; Rosengren 1971) assigns higher values to more even/dispersed/balanced distributions of subfrequencies across corpus parts. This is the default. Gries (2008) uses the reverse scaling, with higher values denoting a more uneven/bursty/concentrated distribution; use directionality = "gries" to choose this option.
Frequency adjustment: Dispersion scores can be adjusted for frequency using the min-max transformation proposed by Gries (2022: 184-191; 2024: 196-208). The frequency-adjusted score for an item considers the lowest and highest possible level of dispersion it can obtain given its overall corpus frequency as well as the number (and size) of corpus parts. The unadjusted score is then expressed relative to these endpoints, where the dispersion minimum is set to 0, and the dispersion maximum to 1 (expressed in terms of conventional scaling). The frequency-adjusted score falls between these bounds and expresses how close the observed distribution is to the theoretical maximum and minimum. This adjustment therefore requires a maximally and a minimally dispersed distribution of the item across the parts. These hypothetical extremes can be built in different ways. The method used by Gries (2022, 2024) uses a computationally expensive procedure that finds the distribution that produces the highest value on the dispersion measure of interest. The current function constructs extreme distributions in a different way, based on the distributional features pervasiveness ("pervasive") or evenness ("even"). You can choose between these with the argument freq_adjust_method; the default is even. For details and explanations, see vignette("frequency-adjustment").
To obtain the lowest possible level of dispersion, the occurrences are either allocated to as few corpus parts as possible ("pervasive"), or they are assigned to the smallest corpus part(s) ("even").
To obtain the highest possible level of dispersion, the occurrences are either spread as broadly across corpus parts as possible ("pervasive"), or they are allocated to corpus parts in proportion to their size ("even"). The choice between these methods is particularly relevant if corpus parts differ considerably in size. See documentation for find_max_disp() and vignette("frequency-adjustment").
In the formulas given below, the following notation is used:
the number of corpus parts
the normalized subfrequency in part , i.e. the number of occurrences of the item divided by the size of the part
a proportional quantity; the normalized subfrequency in part () divided by the sum of all normalized subfrequencies
The basic formula for (see Wilcox 1973: 329, 343; Burch et al. 2017: 194; Egbert et al. 2020: 98) can be applied to absolute frequencies or normalized frequencies. For dispersion analysis, absolute frequencies only make sense if the corpus parts are identical in size. Wilcox (1973: 343, 'MDA', column 1 and 2) gives both variants of the basic version. The first use of for corpus-linguistic dispersion analysis appears in Burch et al. (2017: 194), a paper that deals with equal-sized parts and therefore uses the variant for absolute frequencies. Egbert et al. (2020: 98) rely on the variant using normalized frequencies. Since this variant of the basic version of works irrespective of the length of the corpus parts (equal or variable), we will only give this version of the formula. Note that while the formula represents conventional scaling (0 = uneven, 1 = even), in the current function the directionality is controlled separately using the argument directionality.
(Egbert et al. 2020: 98)
The function uses a different version of the same formula, which relies on the proportional values instead of the normalized subfrequencies . This version yields identical results; the quantities are also the key to using the computational shortcut given in Wilcox (1973: 343), on which the calculations in the {tlda} package rely. This is the basic formula for using instead of values:
(Wilcox 1973: 343; see also Soenning 2022)
Functions for in the {tlda} package use the computational shortcut given in Wilcox (1973: 343). Critically, the proportional quantities must first be sorted in decreasing order. Only after this rearrangement can the shortcut version be applied. We will refer to this rearranged version of as :
(Wilcox 1973: 343)
A numeric value
Lukas Soenning
Burch, Brent, Jesse Egbert & Douglas Biber. 2017. Measuring and interpreting lexical dispersion in corpus linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science 3(2). 189–216. doi:10.1558/jrds.33066
Carroll, John B. 1970. An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. Computer Studies in the Humanities and Verbal Behaviour 3(2). 61–65. doi:10.1002/j.2333-8504.1970.tb00778.x
Egbert, Jesse, Brent Burch & Douglas Biber. 2020. Lexical dispersion and corpus design. International Journal of Corpus Linguistics 25(1). 89–115. doi:10.1075/ijcl.18010.egb
Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4). 403–437. doi:10.1075/ijcl.13.4.02gri
Gries, Stefan Th. 2022. What do (most of) our dispersion measures measure (most)? Dispersion? Journal of Second Language Studies 5(2). 171–205. doi:10.1075/jsls.21029.gri
Gries, Stefan Th. 2024. Frequency, dispersion, association, and keyness: Revising and tupleizing corpus-linguistic measures. Amsterdam: Benjamins. doi:10.1075/scl.115
Juilland, Alphonse G. & Eugenio Chang-Rodríguez. 1964. Frequency dictionary of Spanish words. The Hague: Mouton de Gruyter. doi:10.1515/9783112415467
Rosengren, Inger. 1971. The quantitative concept of language and its relation to the structure of frequency dictionaries. Études de linguistique appliquée (Nouvelle Série) 1. 103–127.
Soenning, Lukas. 2022. Evaluation of text-level measures of lexical dispersion: Robustness and consistency. PsyArXiv preprint. https://osf.io/preprints/psyarxiv/h9mvs/
Wilcox, Allen R. 1973. Indices of qualitative variation and political measurement. The Western Political Quarterly 26 (2). 325–343. doi:10.2307/446831
disp_DA( subfreq = c(0,0,1,2,5), partsize = rep(1000, 5), directionality = "conventional", freq_adjust = FALSE)disp_DA( subfreq = c(0,0,1,2,5), partsize = rep(1000, 5), directionality = "conventional", freq_adjust = FALSE)
for a term-document matrixThis function calculates the dispersion measure . It allows the user to choose the directionality of scaling, i.e. whether higher values denote a more even or a less even distribution. It also provides the option of calculating frequency-adjusted dispersion scores.
disp_DA_tdm( tdm, row_partsize = "first", directionality = "conventional", freq_adjust = FALSE, freq_adjust_method = "even", add_frequency = TRUE, unit_interval = TRUE, digits = NULL, verbose = TRUE, print_scores = TRUE )disp_DA_tdm( tdm, row_partsize = "first", directionality = "conventional", freq_adjust = FALSE, freq_adjust_method = "even", add_frequency = TRUE, unit_interval = TRUE, digits = NULL, verbose = TRUE, print_scores = TRUE )
tdm |
A term-document matrix, where rows represent items and columns represent corpus parts; must also contain a row giving the size of the corpus parts (first or last row in the term-document matrix) |
row_partsize |
Character string indicating which row in the term-document matrix contains the size of the corpus parts. Possible values are |
directionality |
Character string indicating the directionality of scaling. See details below. Possible values are |
freq_adjust |
Logical. Whether dispersion score should be adjusted for frequency (i.e. whether frequency should be 'partialed out'); default is |
freq_adjust_method |
Character string indicating which method to use for devising dispersion extremes. See details below. Possible values are |
add_frequency |
Logical. Whether to add a column that gives the total number of occurrences of the item across a corpus parts; default is |
unit_interval |
Logical. Whether frequency-adjusted scores that exceed the limits of the unit interval should be replaced by 0 and 1; default is |
digits |
Rounding: Integer value specifying the number of decimal places to retain (default: no rounding) |
verbose |
Logical. Whether additional information (on directionality, formulas, frequency adjustment) should be printed; default is |
print_scores |
Logical. Whether the dispersion scores should be printed to the console; default is |
This function takes as input a term-document matrix and returns, for each item (i.e. each row) the dispersion measure . The rows in the input matrix represent the items, and the columns the corpus parts. Importantly, the term-document matrix must include an additional row that records the size of the corpus parts. For a proper term-document matrix, which includes all items that appear in the corpus, this can be added as a column margin, which sums the frequencies in each column. If the matrix only includes a selection of items drawn from the corpus, this information cannot be derived from the matrix and must be provided as a separate row. The function uses the shortcut formula ("computational" procedure) given in Wilcox (1973: 343), where is referred to as MDA.
Directionality: ranges from 0 to 1. The conventional scaling of dispersion measures (see Juilland & Chang-Rodriguez 1964; Carroll 1970; Rosengren 1971) assigns higher values to more even/dispersed/balanced distributions of subfrequencies across corpus parts. This is the default. Gries (2008) uses the reverse scaling, with higher values denoting a more uneven/bursty/concentrated distribution; use directionality = 'gries' to choose this option.
Frequency adjustment: Dispersion scores can be adjusted for frequency using the min-max transformation proposed by Gries (2022, 2024). The frequency-adjusted score for an item considers the lowest and highest possible level of dispersion it can obtain given its overall corpus frequency as well as the number (and size) of corpus parts. The unadjusted score is then expressed relative to these endpoints, where the dispersion minimum is set to 0, and the dispersion maximum to 1 (expressed in terms of conventional scaling). The frequency-adjusted score falls between these bounds and expresses how close the observed distribution is to the theoretical maximum and minimum. This adjustment therefore requires a maximally and a minimally dispersed distribution of the item across the parts. These hypothetical extremes can be built in different ways. The method used by Gries (2022, 2024) uses a computationally expensive procedure that finds the distribution that produces the highest value on the dispersion measure of interest. The current function constructs extreme distributions in a different way, based on the distributional features pervasiveness (pervasive) or evenness (even). You can choose between these with the argument freq_adjust_method; the default is even. For details and explanations, see vignette("frequency-adjustment").
To obtain the lowest possible level of dispersion, the occurrences are either allocated to as few corpus parts as possible (pervasive), or they are assigned to the smallest corpus part(s) (even).
To obtain the highest possible level of dispersion, the occurrences are either spread as broadly across corpus parts as possible (pervasive), or they are allocated to corpus parts in proportion to their size (even). The choice between these methods is particularly relevant if corpus parts differ considerably in size. See documentation for find_max_disp().
In the formulas given below, the following notation is used:
the number of corpus parts
the normalized subfrequency in part , i.e. the number of occurrences of the item divided by the size of the part
a proportional quantity; the normalized subfrequency in part () divided by the sum of all normalized subfrequencies
The basic formula for (see Wilcox 1973: 329, 343; Burch et al. 2017: 194; Egbert et al. 2020: 98) can be applied to absolute frequencies or normalized frequencies. For dispersion analysis, absolute frequencies only make sense if the corpus parts are identical in size. Wilcox (1973: 343, 'MDA', column 1 and 2) gives both variants of the basic version. The first use of for corpus-linguistic dispersion analysis appears in Burch et al. (2017: 194), a paper that deals with equal-sized parts and therefore uses the variant for absolute frequencies. Egbert et al. (2020: 98) rely on the variant using normalized frequencies. Since this variant of the basic version of works irrespective of the length of the corpus parts (equal or variable), we will only give this version of the formula. Note that while the formula represents conventional scaling (0 = uneven, 1 = even), in the current function the directionality is controlled separately using the argument directionality.
(Egbert et al. 2020: 98)
The function uses a different version of the same formula, which relies on the proportional values instead of the normalized subfrequencies . This version yields identical results; the quantities are also the key to using the computational shortcut given in Wilcox (1973: 343), on which the calculations in the {tlda} package rely. This is the basic formula for using instead of values:
(Wilcox 1973: 343; see also Soenning 2022)
Functions for in the {tlda} package use the computational shortcut given in Wilcox (1973: 343). Critically, the proportional quantities must first be sorted in decreasing order. Only after this rearrangement can the shortcut version be applied. We will refer to this rearranged version of as :
(Wilcox 1973: 343)
A data frame with one row per item
Lukas Soenning
Burch, Brent, Jesse Egbert & Douglas Biber. 2017. Measuring and interpreting lexical dispersion in corpus linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science 3(2). 189–216. doi:10.1558/jrds.33066
Carroll, John B. 1970. An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. Computer Studies in the Humanities and Verbal Behaviour 3(2). 61–65. doi:10.1002/j.2333-8504.1970.tb00778.x
Egbert, Jesse, Brent Burch & Douglas Biber. 2020. Lexical dispersion and corpus design. International Journal of Corpus Linguistics 25(1). 89–115. doi:10.1075/ijcl.18010.egb
Gries, Stefan Th. 2022. What do (most of) our dispersion measures measure (most)? Dispersion? Journal of Second Language Studies 5(2). 171–205. doi:10.1075/jsls.21029.gri
Gries, Stefan Th. 2024. Frequency, dispersion, association, and keyness: Revising and tupleizing corpus-linguistic measures. Amsterdam: Benjamins. doi:10.1075/scl.115
Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4). 403–437. doi:10.1075/ijcl.13.4.02gri
Juilland, Alphonse G. & Eugenio Chang-Rodríguez. 1964. Frequency dictionary of Spanish words. The Hague: Mouton de Gruyter. doi:10.1515/9783112415467
Rosengren, Inger. 1971. The quantitative concept of language and its relation to the structure of frequency dictionaries. Études de linguistique appliquée (Nouvelle Série) 1. 103–127.
Soenning, Lukas. 2022. Evaluation of text-level measures of lexical dispersion: Robustness and consistency. PsyArXiv preprint. https://osf.io/preprints/psyarxiv/h9mvs/
Wilcox, Allen R. 1973. Indices of qualitative variation and political measurement. The Western Political Quarterly 26 (2). 325–343. doi:10.2307/446831
disp_DA_tdm( tdm = biber150_spokenBNC2014[1:20,], row_partsize = "first", directionality = "conventional", freq_adjust = FALSE)disp_DA_tdm( tdm = biber150_spokenBNC2014[1:20,], row_partsize = "first", directionality = "conventional", freq_adjust = FALSE)
This function calculates the dispersion measure , which is based on the Kullback-Leibler divergence (Gries 2020, 2021, 2024). It offers three options for standardization to the unit interval [0,1] (see Gries 2024: 90-92) and allows the user to choose the directionality of scaling, i.e. whether higher values denote a more even or a less even distribution. It also offers the option of calculating frequency-adjusted dispersion scores.
disp_DKL( subfreq, partsize, directionality = "conventional", standardization = "o2p", custom_base = NULL, freq_adjust = FALSE, freq_adjust_method = "even", unit_interval = TRUE, digits = NULL, verbose = TRUE, print_score = TRUE, suppress_warning = FALSE )disp_DKL( subfreq, partsize, directionality = "conventional", standardization = "o2p", custom_base = NULL, freq_adjust = FALSE, freq_adjust_method = "even", unit_interval = TRUE, digits = NULL, verbose = TRUE, print_score = TRUE, suppress_warning = FALSE )
subfreq |
A numeric vector of subfrequencies, i.e. the number of occurrences of the item in each corpus part |
partsize |
A numeric vector specifying the size of the corpus parts |
directionality |
Character string indicating the directionality of scaling. See details below. Possible values are |
standardization |
Character string indicating which standardization method to use. See details below. Possible values are |
custom_base |
A numeric value specifying the custom base for standardization; only work with |
freq_adjust |
Logical. Whether dispersion score should be adjusted for frequency (i.e. whether frequency should be 'partialed out'); default is |
freq_adjust_method |
Character string indicating which method to use for devising dispersion extremes. See details below. Possible values are |
unit_interval |
Logical. Whether frequency-adjusted scores that exceed the limits of the unit interval should be replaced by 0 and 1; default is |
digits |
Rounding: Integer value specifying the number of decimal places to retain (default: no rounding) |
verbose |
Logical. Whether additional information (on directionality, formulas, frequency adjustment) should be printed; default is |
print_score |
Logical. Whether the dispersion score should be printed to the console; default is |
suppress_warning |
Logical. Whether warning messages should be suppressed; default is |
The function calculates the dispersion measure based on a set of subfrequencies (number of occurrences of the item in each corpus part) and a matching set of part sizes (the size of the corpus parts, i.e. number of word tokens).
Directionality: ranges from 0 to 1. The conventional scaling of dispersion measures (see Juilland & Chang-Rodriguez 1964; Carroll 1970; Rosengren 1971) assigns higher values to more even/dispersed/balanced distributions of subfrequencies across corpus parts. This is the default. Gries (2008) uses the reverse scaling, with higher values denoting a more uneven/bursty/concentrated distribution; use directionality = "gries" to choose this option.
Standardization: Irrespective of the directionality of scaling, three ways of standardizing the Kullback-Leibler divergence to the unit interval [0;1] are mentioned in Gries (2024: 90-92). The choice between these transformations can have an appreciable effect on the standardized dispersion score. In Gries (2020: 103-104), the Kullback-Leibler divergence is not standardized. In Gries (2021: 20), the transformation "base_e" is used (see (1) below), and in Gries (2024), the default strategy is "o2p", the odds-to-probability transformation (see (3) below).
Frequency adjustment: Dispersion scores can be adjusted for frequency using the min-max transformation proposed by Gries (2022: 184-191; 2024: 196-208). The frequency-adjusted score for an item considers the lowest and highest possible level of dispersion it can obtain given its overall corpus frequency as well as the number (and size) of corpus parts. The unadjusted score is then expressed relative to these endpoints, where the dispersion minimum is set to 0, and the dispersion maximum to 1 (expressed in terms of conventional scaling). The frequency-adjusted score falls between these bounds and expresses how close the observed distribution is to the theoretical maximum and minimum. This adjustment therefore requires a maximally and a minimally dispersed distribution of the item across the parts. These hypothetical extremes can be built in different ways. The method used by Gries (2022, 2024) uses a computationally expensive procedure that finds the distribution that produces the highest value on the dispersion measure of interest. The current function constructs extreme distributions in a different way, based on the distributional features pervasiveness ("pervasive") or evenness ("even"). You can choose between these with the argument freq_adjust_method; the default is even. For details and explanations, see vignette("frequency-adjustment").
To obtain the lowest possible level of dispersion, the occurrences are either allocated to as few corpus parts as possible ("pervasive"), or they are assigned to the smallest corpus part(s) ("even").
To obtain the highest possible level of dispersion, the occurrences are either spread as broadly across corpus parts as possible ("pervasive"), or they are allocated to corpus parts in proportion to their size ("even"). The choice between these methods is particularly relevant if corpus parts differ considerably in size. See documentation for find_max_disp() and vignette("frequency-adjustment").
In the formulas given below, the following notation is used:
a proportional quantity; the subfrequency in part divided by the total number of occurrences of the item in the corpus (i.e. the sum of all subfrequencies)
a proportional quantity; the size of corpus part divided by the size of the corpus (i.e. the sum of the part sizes)
The first step is to calculate the Kullback-Leibler divergence based on the proportional subfrequencies () and the size of the corpus parts ():
with
This KLD score is then standardized (i.e. transformed) to the conventional unit interval [0,1]. Three options are discussed in Gries (2024: 90-92). The following formulas represent Gries scaling (0 = even, 1 = uneven):
(1) (Gries 2021: 20), represented by the value "base_e"
(2) (Gries 2024: 90), represented by the value" "base_2"
(3) (Gries 2024: 90), represented by the value "o2p" (default)
A fourth option is available which allows the user to select a custom base for standardization (i.e. a value other than ("base_e") and ("base_2")). If the argument standardization is set to "custom", a numeric value must be supplied to the argument custom_base.
(4) (with representing a numeric base) represented by the value" "custom" and custom_base = b
A numeric value
Lukas Soenning
Carroll, John B. 1970. An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. Computer Studies in the Humanities and Verbal Behaviour 3(2). 61–65. doi:10.1002/j.2333-8504.1970.tb00778.x
Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4). 403–437. doi:10.1075/ijcl.13.4.02gri
Gries, Stefan Th. 2022. What do (most of) our dispersion measures measure (most)? Dispersion? Journal of Second Language Studies 5(2). 171–205. doi:10.1075/jsls.21029.gri
Gries, Stefan Th. 2024. Frequency, dispersion, association, and keyness: Revising and tupleizing corpus-linguistic measures. Amsterdam: Benjamins. doi:10.1075/scl.115
Juilland, Alphonse G. & Eugenio Chang-Rodríguez. 1964. Frequency dictionary of Spanish words. The Hague: Mouton de Gruyter. doi:10.1515/9783112415467
Rosengren, Inger. 1971. The quantitative concept of language and its relation to the structure of frequency dictionaries. Études de linguistique appliquée (Nouvelle Série) 1. 103–127.
disp_DKL( subfreq = c(0,0,1,2,5), partsize = rep(1000, 5), standardization = "base_e", directionality = "conventional")disp_DKL( subfreq = c(0,0,1,2,5), partsize = rep(1000, 5), standardization = "base_e", directionality = "conventional")
for a term-document matrixThis function calculates the dispersion measure , which is based on the Kullback-Leibler divergence (Gries 2020, 2021, 2024). It offers three different options for standardization to the unit interval [0,1] (see Gries 2024: 90-92) and allows the user to choose the directionality of scaling, i.e. whether higher values denote a more even or a less even distribution. It also offers the option of calculating frequency-adjusted dispersion scores.
disp_DKL_tdm( tdm, row_partsize = "first", directionality = "conventional", standardization = "o2p", custom_base = NULL, freq_adjust = FALSE, freq_adjust_method = "even", add_frequency = TRUE, unit_interval = TRUE, digits = NULL, verbose = TRUE, print_scores = TRUE )disp_DKL_tdm( tdm, row_partsize = "first", directionality = "conventional", standardization = "o2p", custom_base = NULL, freq_adjust = FALSE, freq_adjust_method = "even", add_frequency = TRUE, unit_interval = TRUE, digits = NULL, verbose = TRUE, print_scores = TRUE )
tdm |
A term-document matrix, where rows represent items and columns represent corpus parts; must also contain a row giving the size of the corpus parts (first or last row in the term-document matrix) |
row_partsize |
Character string indicating which row in the term-document matrix contains the size of the corpus parts. Possible values are |
directionality |
Character string indicating the directionality of scaling. See details below. Possible values are |
standardization |
Character string indicating which standardization method to use. See details below. Possible values are |
custom_base |
A numeric value specifying the custom base for standardization; only work with |
freq_adjust |
Logical. Whether dispersion score should be adjusted for frequency (i.e. whether frequency should be 'partialed out'); default is |
freq_adjust_method |
Character string indicating which method to use for devising dispersion extremes. See details below. Possible values are |
add_frequency |
Logical. Whether to add a column that gives the total number of occurrences of the item across a corpus parts; default is |
unit_interval |
Logical. Whether frequency-adjusted scores that exceed the limits of the unit interval should be replaced by 0 and 1; default is |
digits |
Rounding: Integer value specifying the number of decimal places to retain (default: no rounding) |
verbose |
Logical. Whether additional information (on directionality, formulas, frequency adjustment) should be printed; default is |
print_scores |
Logical. Whether the dispersion scores should be printed to the console; default is |
This function takes as input a term-document matrix and returns, for each item (i.e. each row) the dispersion measure . The rows in the input matrix represent the items, and the columns the corpus parts. Importantly, the term-document matrix must include an additional row that records the size of the corpus parts. For a proper term-document matrix, which includes all items that appear in the corpus, this can be added as a column margin, which sums the frequencies in each column. If the matrix only includes a selection of items drawn from the corpus, this information cannot be derived from the matrix and must be provided as a separate row.
Directionality: ranges from 0 to 1. The conventional scaling of dispersion measures (see Juilland & Chang-Rodriguez 1964; Carroll 1970; Rosengren 1971) assigns higher values to more even/dispersed/balanced distributions of subfrequencies across corpus parts. This is the default. Gries (2008) uses the reverse scaling, with higher values denoting a more uneven/bursty/concentrated distribution; use directionality = 'gries' to choose this option.
Standardization: Irrespective of the directionality of scaling, three ways of standardizing the Kullback-Leibler divergence to the unit interval [0;1] are mentioned in Gries (2024: 90-92). The choice between these transformations can have an appreciable effect on the standardized dispersion score. In Gries (2020: 103-104), the Kullback-Leibler divergence is not standardized. In Gries (2021: 20), the transformation 'base_e' is used (see (1) below), and in Gries (2024), the default strategy is 'o2p', the odds-to-probability transformation (see (3) below).
Frequency adjustment: Dispersion scores can be adjusted for frequency using the min-max transformation proposed by Gries (2022: 184-191; 2024: 196-208). The frequency-adjusted score for an item considers the lowest and highest possible level of dispersion it can obtain given its overall corpus frequency as well as the number (and size) of corpus parts. The unadjusted score is then expressed relative to these endpoints, where the dispersion minimum is set to 0, and the dispersion maximum to 1 (expressed in terms of conventional scaling). The frequency-adjusted score falls between these bounds and expresses how close the observed distribution is to the theoretical maximum and minimum. This adjustment therefore requires a maximally and a minimally dispersed distribution of the item across the parts. These hypothetical extremes can be built in different ways. The method used by Gries (2022, 2024) uses a computationally expensive procedure that finds the distribution that produces the highest value on the dispersion measure of interest. The current function constructs extreme distributions in a different way, based on the distributional features pervasiveness (pervasive) or evenness (even). You can choose between these with the argument freq_adjust_method; the default is even. For details and explanations, see vignette("frequency-adjustment").
To obtain the lowest possible level of dispersion, the occurrences are either allocated to as few corpus parts as possible (pervasive), or they are assigned to the smallest corpus part(s) (even).
To obtain the highest possible level of dispersion, the occurrences are either spread as broadly across corpus parts as possible (pervasive), or they are allocated to corpus parts in proportion to their size (even). The choice between these methods is particularly relevant if corpus parts differ considerably in size. See documentation for find_max_disp().
In the formulas given below, the following notation is used:
a proportional quantity; the subfrequency in part divided by the total number of occurrences of the item in the corpus (i.e. the sum of all subfrequencies)
a proportional quantity; the size of corpus part divided by the size of the corpus (i.e. the sum of the part sizes)
The first step is to calculate the Kullback-Leibler divergence based on the proportional subfrequencies () and the size of the corpus parts ():
with
This KLD score is then standardized (i.e. transformed) to the conventional unit interval [0,1]. Three options are discussed in Gries (2024: 90-92). The following formulas represents Gries scaling (0 = even, 1 = uneven):
(1) (Gries 2021: 20), represented by the value 'base_e'
(2) (Gries 2024: 90), represented by the value 'base_2'
(3) (Gries 2024: 90), represented by the value 'o2p' (default)
A data frame with one row per item
Lukas Soenning
Carroll, John B. 1970. An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. Computer Studies in the Humanities and Verbal Behaviour 3(2). 61–65. doi:10.1002/j.2333-8504.1970.tb00778.x
Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4). 403–437. doi:10.1075/ijcl.13.4.02gri
Gries, Stefan Th. 2022. What do (most of) our dispersion measures measure (most)? Dispersion? Journal of Second Language Studies 5(2). 171–205. doi:10.1075/jsls.21029.gri
Gries, Stefan Th. 2024. Frequency, dispersion, association, and keyness: Revising and tupleizing corpus-linguistic measures. Amsterdam: Benjamins. doi:10.1075/scl.115
Juilland, Alphonse G. & Eugenio Chang-Rodríguez. 1964. Frequency dictionary of Spanish words. The Hague: Mouton de Gruyter. doi:10.1515/9783112415467
Rosengren, Inger. 1971. The quantitative concept of language and its relation to the structure of frequency dictionaries. Études de linguistique appliquée (Nouvelle Série) 1. 103–127.
disp_DKL_tdm( tdm = biber150_spokenBNC2014[1:20,], row_partsize = "first", standardization = "base_e", directionality = "conventional")disp_DKL_tdm( tdm = biber150_spokenBNC2014[1:20,], row_partsize = "first", standardization = "base_e", directionality = "conventional")
This function calculates D~MB~, a generalized version of the Poisson-based dispersion measure MB proposed by Nelson (2025). It allows the user to choose the directionality of scaling, i.e. whether higher values denote a more even or a less even distribution, and it returns confidence intervals for D~MB~.
disp_DMB( subfreq, partsize, directionality = "conventional", conf_int = FALSE, conf_level = 0.95, digits = NULL, verbose = TRUE, print_score = TRUE, suppress_warning = FALSE )disp_DMB( subfreq, partsize, directionality = "conventional", conf_int = FALSE, conf_level = 0.95, digits = NULL, verbose = TRUE, print_score = TRUE, suppress_warning = FALSE )
subfreq |
A numeric vector of subfrequencies, i.e. the number of occurrences of the item in each corpus part |
partsize |
A numeric vector specifying the size of the corpus parts |
directionality |
Character string indicating the directionality of scaling. See details below. Possible values are |
conf_int |
Logical. Whether a (profile likelihood) confidence interval should be computed; default: |
conf_level |
Scalar giving the confidence level; default |
digits |
Rounding: Integer value specifying the number of decimal places to retain (default: no rounding) |
verbose |
Logical. Whether additional information (on directionality, formulas, frequency adjustment) should be printed; default is |
print_score |
Logical. Whether the dispersion score should be printed to the console; default is |
suppress_warning |
Logical. Whether warning messages should be suppressed; default is |
The function calculates the dispersion measure D~MB~ based on a set of subfrequencies (number of occurrences of the item in each corpus part) and a matching set of part sizes (the size of the corpus parts, i.e. number of word tokens). D~MB~ can be considered a generalization of the method proposed by Nelson (2025). In contrast to the original measure, MB, D~MB~ works with a pre-determined set of corpus parts, which may also differ in size. To provide this additional flexibility, D~MB~ is constructed based on a Poisson regression model that considers the corpus parts as observations and allows them to differ in length through its incorporation of an offset parameter.
Directionality: D~MB~ ranges from 0 to 1. The conventional scaling of dispersion measures (see Juilland & Chang-Rodriguez 1964; Carroll 1970; Rosengren 1971) assigns higher values to more even/dispersed/balanced distributions of subfrequencies across corpus parts. This is the default. Gries (2008) uses the reverse scaling, with higher values denoting a more uneven/bursty/concentrated distribution; use directionality = "gries" to choose this option.
A vector of numeric values
Lukas Soenning
Carroll, John B. 1970. An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. Computer Studies in the Humanities and Verbal Behaviour 3(2). 61–65. doi:10.1002/j.2333-8504.1970.tb00778.x
Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4). 403–437. doi:10.1075/ijcl.13.4.02gri
Juilland, Alphonse G. & Eugenio Chang-Rodríguez. 1964. Frequency dictionary of Spanish words. The Hague: Mouton de Gruyter. doi:10.1515/9783112415467
Nelson, Robset N. Jr. 2025. Groundhog Day is not a good model for corpus dispersion. Journal of Quantitative Linguistics 32(2). 103–127. doi:10.1080/09296174.2024.2423415
Rosengren, Inger. 1971. The quantitative concept of language and its relation to the structure of frequency dictionaries. Études de linguistique appliquée (Nouvelle Série) 1. 103–127.
disp_DMB( subfreq = c(0,0,1,2,5), partsize = rep(1000, 5), directionality = "conventional", conf_int = TRUE)disp_DMB( subfreq = c(0,0,1,2,5), partsize = rep(1000, 5), directionality = "conventional", conf_int = TRUE)
This function calculates Gries's dispersion measure DP (deviation of proportions). It offers three different formulas and allows the user to choose the directionality of scaling, i.e. whether higher values denote a more even or a less even distribution. It also offers the option of calculating frequency-adjusted dispersion scores.
disp_DP( subfreq, partsize, directionality = "conventional", formula = "egbert_etal_2020", freq_adjust = FALSE, freq_adjust_method = "even", unit_interval = TRUE, digits = NULL, verbose = TRUE, print_score = TRUE, suppress_warning = FALSE )disp_DP( subfreq, partsize, directionality = "conventional", formula = "egbert_etal_2020", freq_adjust = FALSE, freq_adjust_method = "even", unit_interval = TRUE, digits = NULL, verbose = TRUE, print_score = TRUE, suppress_warning = FALSE )
subfreq |
A numeric vector of subfrequencies, i.e. the number of occurrences of the item in each corpus part |
partsize |
A numeric vector specifying the size of the corpus parts |
directionality |
Character string indicating the directionality of scaling. See details below. Possible values are |
formula |
Character string indicating which formula to use for the calculation of DP. See details below. Possible values are |
freq_adjust |
Logical. Whether dispersion score should be adjusted for frequency (i.e. whether frequency should be 'partialed out'); default is |
freq_adjust_method |
Character string indicating which method to use for devising dispersion extremes. See details below. Possible values are |
unit_interval |
Logical. Whether frequency-adjusted scores that exceed the limits of the unit interval should be replaced by 0 and 1; default is |
digits |
Rounding: Integer value specifying the number of decimal places to retain (default: no rounding) |
verbose |
Logical. Whether additional information (on directionality, formulas, frequency adjustment) should be printed; default is |
print_score |
Logical. Whether the dispersion score should be printed to the console; default is |
suppress_warning |
Logical. Whether warning messages should be suppressed; default is |
The function calculates the dispersion measure DP based on a set of subfrequencies (number of occurrences of the item in each corpus part) and a matching set of part sizes (the size of the corpus parts, i.e. number of word tokens).
Directionality: DP ranges from 0 to 1. The conventional scaling of dispersion measures (see Juilland & Chang-Rodriguez 1964; Carroll 1970; Rosengren 1971) assigns higher values to more even/dispersed/balanced distributions of subfrequencies across corpus parts. This is the default. Gries (2008) uses the reverse scaling, with higher values denoting a more uneven/bursty/concentrated distribution; use directionality = "gries" to choose this option.
Formula: Irrespective of the directionality of scaling, four formulas for DP exist in the literature (see below for details). This is because the original version proposed by Gries (2008: 415), which is commonly denoted as (and here referenced by the value "gries_2008") does not always reach its theoretical limits of 0 and 1. For this reason, modifications have been suggested, starting with Gries (2008: 419) himself, who referred to this version as . This version is not implemented in the current package, because Lijffit & Gries (2012) updated to ensure that it also works as intended when corpus parts differ in size; this version is represented by the value "lijffit_gries_2012" and often denoted using subscript notation . Finally, Egbert et al. (2020: 99) suggest a further modification to ensure proper behavior in settings where the item occurs in only one corpus part. They label this version . In the current function, it is the default and represented by the value "egbert_etal_2020".
Frequency adjustment: Dispersion scores can be adjusted for frequency using the min-max transformation proposed by Gries (2022: 184-191; 2024: 196-208). The frequency-adjusted score for an item considers the lowest and highest possible level of dispersion it can obtain given its overall corpus frequency as well as the number (and size) of corpus parts. The unadjusted score is then expressed relative to these endpoints, where the dispersion minimum is set to 0, and the dispersion maximum to 1 (expressed in terms of conventional scaling). The frequency-adjusted score falls between these bounds and expresses how close the observed distribution is to the theoretical maximum and minimum. This adjustment therefore requires a maximally and a minimally dispersed distribution of the item across the parts. These hypothetical extremes can be built in different ways. The method used by Gries (2022, 2024) uses a computationally expensive procedure that finds the distribution that produces the highest value on the dispersion measure of interest. The current function constructs extreme distributions in a different way, based on the distributional features pervasiveness ("pervasive") or evenness ("even"). You can choose between these with the argument freq_adjust_method; the default is even. For details and explanations, see vignette("frequency-adjustment").
To obtain the lowest possible level of dispersion, the occurrences are either allocated to as few corpus parts as possible ("pervasive"), or they are assigned to the smallest corpus part(s) ("even").
To obtain the highest possible level of dispersion, the occurrences are either spread as broadly across corpus parts as possible ("pervasive"), or they are allocated to corpus parts in proportion to their size ("even"). The choice between these methods is particularly relevant if corpus parts differ considerably in size. See documentation for find_max_disp().
In the formulas given below, the following notation is used:
the number of corpus parts
a proportional quantity; the subfrequency in part divided by the total number of occurrences of the item in the corpus (i.e. the sum of all subfrequencies)
a proportional quantity; the size of corpus part divided by the size of the corpus (i.e. the sum of thea part sizes)
The value "gries_2008" implements the original version proposed by Gries (2008: 415). Note that while the following formula represents Gries scaling (0 = even, 1 = uneven), in the current function the directionality is controlled separately using the argument directionality.
(Gries 2008)
The value "lijffit_gries_2012" implements the modified version described by Lijffit & Gries (2012). Again, the following formula represents Gries scaling (0 = even, 1 = uneven), but the directionality is handled separately in the current function. The notation refers to the value of the smallest corpus part.
(Lijffijt & Gries 2012)
The value "egbert_etal_2020" (default) selects the modification suggested by Egbert et al. (2020: 99). The following formula represents conventional scaling (0 = uneven, 1 = even). The notation refers to the value among those corpus parts that include at least one occurrence of the item.
(Egbert et al. 2020)
A numeric value
Lukas Soenning
Carroll, John B. 1970. An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. Computer Studies in the Humanities and Verbal Behaviour 3(2). 61–65. doi:10.1002/j.2333-8504.1970.tb00778.x
Egbert, Jesse, Brent Burch & Douglas Biber. 2020. Lexical dispersion and corpus design. International Journal of Corpus Linguistics 25(1). 89–115. doi:10.1075/ijcl.18010.egb
Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4). 403–437. doi:10.1075/ijcl.13.4.02gri
Gries, Stefan Th. 2022. What do (most of) our dispersion measures measure (most)? Dispersion? Journal of Second Language Studies 5(2). 171–205. doi:10.1075/jsls.21029.gri
Gries, Stefan Th. 2024. Frequency, dispersion, association, and keyness: Revising and tupleizing corpus-linguistic measures. Amsterdam: Benjamins. doi:10.1075/scl.115
Juilland, Alphonse G. & Eugenio Chang-Rodríguez. 1964. Frequency dictionary of Spanish words. The Hague: Mouton de Gruyter. doi:10.1515/9783112415467
Lijffijt, Jefrey & Stefan Th. Gries. 2012. Correction to Stefan Th. Gries’ ‘Dispersions and adjusted frequencies in corpora’. International Journal of Corpus Linguistics 17(1). 147–149. doi:10.1075/ijcl.17.1.08lij
Rosengren, Inger. 1971. The quantitative concept of language and its relation to the structure of frequency dictionaries. Études de linguistique appliquée (Nouvelle Série) 1. 103–127.
disp_DP( subfreq = c(0,0,1,2,5), partsize = rep(1000, 5), directionality = "conventional", formula = "gries_2008", freq_adjust = FALSE)disp_DP( subfreq = c(0,0,1,2,5), partsize = rep(1000, 5), directionality = "conventional", formula = "gries_2008", freq_adjust = FALSE)
This function offers facilities for bootstrapping and weighting Gries's dispersion measure DP (deviation of proportions). In addition to the full functionality offered by the function disp_DP(), it can be used to obtain weighted dispersion scores as well as bootstrap confidence intervals.
disp_DP_boot( subfreq, partsize, n_boot = 500, boot_ci = FALSE, conf_level = 0.95, return_distribution = FALSE, partweight = NULL, directionality = "conventional", formula = "egbert_etal_2020", freq_adjust = FALSE, freq_adjust_method = "even", unit_interval = TRUE, digits = NULL, verbose = TRUE, print_score = TRUE, suppress_warning = FALSE )disp_DP_boot( subfreq, partsize, n_boot = 500, boot_ci = FALSE, conf_level = 0.95, return_distribution = FALSE, partweight = NULL, directionality = "conventional", formula = "egbert_etal_2020", freq_adjust = FALSE, freq_adjust_method = "even", unit_interval = TRUE, digits = NULL, verbose = TRUE, print_score = TRUE, suppress_warning = FALSE )
subfreq |
A numeric vector of subfrequencies, i.e. the number of occurrences of the item in each corpus part |
partsize |
A numeric vector specifying the size of the corpus parts |
n_boot |
Integer value specifying the number of bootstrap samples to draw; default: |
boot_ci |
Logical. Whether a percentile bootstrap confidence interval should be computed; default: |
conf_level |
Scalar giving the confidence level; default |
return_distribution |
Logical. Whether the function should return a vector of all |
partweight |
A numeric vector specifying the weights of the corpus parts; if not specified, function returns unweighted estimate |
directionality |
Character string indicating the directionality of scaling. See details below. Possible values are |
formula |
Character string indicating which formula to use for the calculation of DP. See details below. Possible values are |
freq_adjust |
Logical. Whether dispersion score should be adjusted for frequency (i.e. whether frequency should be 'partialed out'); default is |
freq_adjust_method |
Character string indicating which method to use for devising dispersion extremes. See details below. Possible values are |
unit_interval |
Logical. Whether frequency-adjusted scores that exceed the limits of the unit interval should be replaced by 0 and 1; default is |
digits |
Rounding: Integer value specifying the number of decimal places to retain (default: no rounding) |
verbose |
Logical. Whether additional information (on directionality, formulas, frequency adjustment) should be printed; default is |
print_score |
Logical. Whether the dispersion score should be printed to the console; default is |
suppress_warning |
Logical. Whether warning messages should be suppressed; default is |
This function calculates weighted dispersion measures and bootstrap confidence intervals.
Lukas Soenning
disp_DP() for finer control over the calculation of DP
disp_DP_boot( subfreq = biber150_ice_gb[3,], partsize = biber150_ice_gb[1,], digits = 2, freq_adjust = TRUE, directionality = "conventional", formula = "gries_2008")disp_DP_boot( subfreq = biber150_ice_gb[3,], partsize = biber150_ice_gb[1,], digits = 2, freq_adjust = TRUE, directionality = "conventional", formula = "gries_2008")
This function implements stratified bootstrapping (and weighting) for Gries's dispersion measure DP (deviation of proportions). In addition to the full functionality offered by the function disp_DP(), it can be used to obtain weighted dispersion scores as well as bootstrap confidence intervals.
disp_DP_sboot( text_freq, text_size, corpus_parts = NULL, n_boot = 500, boot_ci = FALSE, conf_level = 0.95, return_distribution = FALSE, partweight = NULL, directionality = "conventional", formula = "egbert_etal_2020", freq_adjust = FALSE, freq_adjust_method = "even", unit_interval = TRUE, digits = NULL, verbose = TRUE, print_score = TRUE, suppress_warning = FALSE )disp_DP_sboot( text_freq, text_size, corpus_parts = NULL, n_boot = 500, boot_ci = FALSE, conf_level = 0.95, return_distribution = FALSE, partweight = NULL, directionality = "conventional", formula = "egbert_etal_2020", freq_adjust = FALSE, freq_adjust_method = "even", unit_interval = TRUE, digits = NULL, verbose = TRUE, print_score = TRUE, suppress_warning = FALSE )
text_freq |
Integer value giving the frequency of the item in the text |
text_size |
Interger value giving the size (or length) of the text |
corpus_parts |
The corpus parts that form the basis of dispersion analysis; must be higher-level categories, above the text files |
n_boot |
Integer value specifying the number of bootstrap samples to draw; default: |
boot_ci |
Logical. Whether a percentile bootstrap confidence interval should be computed; default: |
conf_level |
Scalar giving the confidence level; default |
return_distribution |
Logical. Whether the function should return a vector of all |
partweight |
A numeric vector specifying the weights of the corpus parts; if not specified, function returns unweighted estimate |
directionality |
Character string indicating the directionality of scaling. See details below. Possible values are |
formula |
Character string indicating which formula to use for the calculation of DP. See details below. Possible values are |
freq_adjust |
Logical. Whether dispersion score should be adjusted for frequency (i.e. whether frequency should be 'partialed out'); default is |
freq_adjust_method |
Character string indicating which method to use for devising dispersion extremes. See details below. Possible values are |
unit_interval |
Logical. Whether frequency-adjusted scores that exceed the limits of the unit interval should be replaced by 0 and 1; default is |
digits |
Rounding: Integer value specifying the number of decimal places to retain (default: no rounding) |
verbose |
Logical. Whether additional information (on directionality, formulas, frequency adjustment) should be printed; default is |
print_score |
Logical. Whether the dispersion score should be printed to the console; default is |
suppress_warning |
Logical. Whether warning messages should be suppressed; default is |
This function performs stratified bootstrapping on dispersion measures. Stratified bootstrapping is used when the corpus parts represent text categories (e.g. genres, registers) that in turn consists of texts or text files. Since the resampling scheme implemented in bootstrapping should be as closely aligned with the data layout (and data-generation procedure) as closely as possible, resampling should not take place at the level of the text categories. Instead, it is the sampling units in corpus compilation – texts, text files, or speakers – that should be resampled. Stratified bootstrapping therefore respects the structure of the corpus and data.
Lukas Soenning
disp_DP() for finer control over the calculation of DP
disp_DP_sboot( text_freq = biber150_brown[87,], text_size = biber150_brown[1,], corpus_parts = as.character(metadata_brown$genre), digits = 2, freq_adjust = TRUE, directionality = "conventional", formula = "gries_2008")disp_DP_sboot( text_freq = biber150_brown[87,], text_size = biber150_brown[1,], corpus_parts = as.character(metadata_brown$genre), digits = 2, freq_adjust = TRUE, directionality = "conventional", formula = "gries_2008")
This function calculates Gries's dispersion measure DP (deviation of proportions). It offers three different formulas and allows the user to choose the directionality of scaling, i.e. whether higher values denote a more even or a less even distribution. It also offers the option of calculating frequency-adjusted dispersion scores.
disp_DP_tdm( tdm, row_partsize = "first", directionality = "conventional", formula = "egbert_etal_2020", freq_adjust = FALSE, freq_adjust_method = "even", add_frequency = TRUE, unit_interval = TRUE, digits = NULL, verbose = TRUE, print_scores = TRUE )disp_DP_tdm( tdm, row_partsize = "first", directionality = "conventional", formula = "egbert_etal_2020", freq_adjust = FALSE, freq_adjust_method = "even", add_frequency = TRUE, unit_interval = TRUE, digits = NULL, verbose = TRUE, print_scores = TRUE )
tdm |
A term-document matrix, where rows represent items and columns represent corpus parts; must also contain a row giving the size of the corpus parts (first or last row in the term-document matrix) |
row_partsize |
Character string indicating which row in the term-document matrix contains the size of the corpus parts. Possible values are |
directionality |
Character string indicating the directionality of scaling. See details below. Possible values are |
formula |
Character string indicating which formula to use for the calculation of DP. See details below. Possible values are |
freq_adjust |
Logical. Whether dispersion score should be adjusted for frequency (i.e. whether frequency should be 'partialed out'); default is |
freq_adjust_method |
Character string indicating which method to use for devising dispersion extremes. See details below. Possible values are |
add_frequency |
Logical. Whether to add a column that gives the total number of occurrences of the item across a corpus parts; default is |
unit_interval |
Logical. Whether frequency-adjusted scores that exceed the limits of the unit interval should be replaced by 0 and 1; default is |
digits |
Rounding: Integer value specifying the number of decimal places to retain (default: no rounding) |
verbose |
Logical. Whether additional information (on directionality, formulas, frequency adjustment) should be printed; default is |
print_scores |
Logical. Whether the dispersion scores should be printed to the console; default is |
This function takes as input a term-document matrix and returns, for each item (i.e. each row) the dispersion measure DP. The rows in the input matrix represent the items, and the columns the corpus parts. Importantly, the term-document matrix must include an additional row that records the size of the corpus parts. For a proper term-document matrix, which includes all items that appear in the corpus, this can be added as a column margin, which sums the frequencies in each column. If the matrix only includes a selection of items drawn from the corpus, this information cannot be derived from the matrix and must be provided as a separate row.
Directionality: DP ranges from 0 to 1. The conventional scaling of dispersion measures (see Juilland & Chang-Rodriguez 1964; Carroll 1970; Rosengren 1971) assigns higher values to more even/dispersed/balanced distributions of subfrequencies across corpus parts. This is the default. Gries (2008) uses the reverse scaling, with higher values denoting a more uneven/bursty/concentrated distribution; use directionality = "gries" to choose this option.
Formula: Irrespective of the directionality of scaling, four formulas for DP exist in the literature (see below for details). This is because the original version proposed by Gries (2008: 415), which is commonly denoted as (and here referenced by the value "gries_2008") does not always reach its theoretical limits of 0 and 1. For this reason, modifications have been suggested, starting with Gries (2008: 419) himself, who referred to this version as . This version is not implemented in the current package, because Lijffit & Gries (2012) updated to ensure that it also works as intended when corpus parts differ in size; this version is represented by the value "lijffit_gries_2012" and often denoted using subscript notation . Finally, Egbert et al. (2020: 99) suggest a further modification to ensure proper behavior in settings where the item occurs in only one corpus part. They label this version . In the current function, it is the default and represented by the value "egbert_etal_2020".
Frequency adjustment: Dispersion scores can be adjusted for frequency using the min-max transformation proposed by Gries (2022: 184-191; 2024: 196-208). The frequency-adjusted score for an item considers the lowest and highest possible level of dispersion it can obtain given its overall corpus frequency as well as the number (and size) of corpus parts. The unadjusted score is then expressed relative to these endpoints, where the dispersion minimum is set to 0, and the dispersion maximum to 1 (expressed in terms of conventional scaling). The frequency-adjusted score falls between these bounds and expresses how close the observed distribution is to the theoretical maximum and minimum. This adjustment therefore requires a maximally and a minimally dispersed distribution of the item across the parts. These hypothetical extremes can be built in different ways. The method used by Gries (2022, 2024) uses a computationally expensive procedure that finds the distribution that produces the highest value on the dispersion measure of interest. The current function constructs extreme distributions in a different way, based on the distributional features pervasiveness ("pervasive") or evenness ("even"). You can choose between these with the argument freq_adjust_method; the default is even. For details and explanations, see vignette("frequency-adjustment").
To obtain the lowest possible level of dispersion, the occurrences are either allocated to as few corpus parts as possible ("pervasive"), or they are assigned to the smallest corpus part(s) ("even").
To obtain the highest possible level of dispersion, the occurrences are either spread as broadly across corpus parts as possible ("pervasive"), or they are allocated to corpus parts in proportion to their size ("even"). The choice between these methods is particularly relevant if corpus parts differ considerably in size. See documentation for find_max_disp().
In the formulas given below, the following notation is used:
the number of corpus parts
a proportional quantity; the subfrequency in part divided by the total number of occurrences of the item in the corpus (i.e. the sum of all subfrequencies)
a proportional quantity; the size of corpus part divided by the size of the corpus (i.e. the sum of the part sizes)
The value "gries_2008" implements the original version proposed by Gries (2008: 415). Note that while the following formula represents Gries scaling (0 = even, 1 = uneven), in the current function the directionality is controlled separately using the argument directionality.
(Gries 2008)
The value "lijffit_gries_2012" implements the modified version described by Lijffit & Gries (2012). Again, the following formula represents Gries scaling (0 = even, 1 = uneven), but the directionality is handled separately in the current function. The notation refers to the value of the smallest corpus part.
(Lijffijt & Gries 2012)
The value "egbert_etal_2020" (default) selects the modification suggested by Egbert et al. (2020: 99). The following formula represents conventional scaling (0 = uneven, 1 = even). The notation refers to the value among those corpus parts that include at least one occurrence of the item.
(Egbert et al. 2020)
A data frame with one row per item
Lukas Soenning
Carroll, John B. 1970. An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. Computer Studies in the Humanities and Verbal Behaviour 3(2). 61–65. doi:10.1002/j.2333-8504.1970.tb00778.x
Egbert, Jesse, Brent Burch & Douglas Biber. 2020. Lexical dispersion and corpus design. International Journal of Corpus Linguistics 25(1). 89–115. doi:10.1075/ijcl.18010.egb
Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4). 403–437. doi:10.1075/ijcl.13.4.02gri
Gries, Stefan Th. 2022. What do (most of) our dispersion measures measure (most)? Dispersion? Journal of Second Language Studies 5(2). 171–205. doi:10.1075/jsls.21029.gri
Gries, Stefan Th. 2024. Frequency, dispersion, association, and keyness: Revising and tupleizing corpus-linguistic measures. Amsterdam: Benjamins. doi:10.1075/scl.115
Juilland, Alphonse G. & Eugenio Chang-Rodríguez. 1964. Frequency dictionary of Spanish words. The Hague: Mouton de Gruyter. doi:10.1515/9783112415467
Lijffijt, Jefrey & Stefan Th. Gries. 2012. Correction to Stefan Th. Gries’ ‘Dispersions and adjusted frequencies in corpora’. International Journal of Corpus Linguistics 17(1). 147–149. doi:10.1075/ijcl.17.1.08lij
Rosengren, Inger. 1971. The quantitative concept of language and its relation to the structure of frequency dictionaries. Études de linguistique appliquée (Nouvelle Série) 1. 103–127.
disp_DP_tdm( tdm = biber150_spokenBNC2014[1:20,], row_partsize = "first", directionality = "conventional", formula = "gries_2008", freq_adjust = FALSE)disp_DP_tdm( tdm = biber150_spokenBNC2014[1:20,], row_partsize = "first", directionality = "conventional", formula = "gries_2008", freq_adjust = FALSE)
This function calculates the dispersion measure 'range'. It offers three different versions: 'absolute range' (the number of corpus parts containing at least one occurrence of the item), 'relative range' (the proportion of corpus parts containing at least one occurrence of the item), and 'relative range with size' (relative range that takes into account the size of the corpus parts). The function also offers the option of calculating frequency-adjusted dispersion scores.
disp_R( subfreq, partsize, type = "relative", freq_adjust = FALSE, freq_adjust_method = "pervasive", unit_interval = TRUE, digits = NULL, verbose = TRUE, print_score = TRUE, suppress_warning = FALSE )disp_R( subfreq, partsize, type = "relative", freq_adjust = FALSE, freq_adjust_method = "pervasive", unit_interval = TRUE, digits = NULL, verbose = TRUE, print_score = TRUE, suppress_warning = FALSE )
subfreq |
A numeric vector of subfrequencies, i.e. the number of occurrences of the item in each corpus part |
partsize |
A numeric vector specifying the size of the corpus parts |
type |
Character string indicating which type of range to calculate. See details below. Possible values are |
freq_adjust |
Logical. Whether dispersion score should be adjusted for frequency (i.e. whether frequency should be 'partialed out'); default is |
freq_adjust_method |
Character string indicating which method to use for devising dispersion extremes. See details below. Possible values are |
unit_interval |
Logical. Whether frequency-adjusted scores that exceed the limits of the unit interval should be replaced by 0 and 1; default is |
digits |
Rounding: Integer value specifying the number of decimal places to retain (default: no rounding) |
verbose |
Logical. Whether additional information (on directionality, formulas, frequency adjustment) should be printed; default is |
print_score |
Logical. Whether the dispersion score should be printed to the console; default is |
suppress_warning |
Logical. Whether warning messages should be suppressed; default is |
The function calculates the dispersion measure 'range' based on a set of subfrequencies (number of occurrences of the item in each corpus part) and a matching set of part sizes (the size of the corpus parts, i.e. number of word tokens). Three different types of range measures can be calculated:
Absolute range: The number of corpus parts containing at least one occurrence of the item
Relative range: The proportion of corpus parts containing at least one occurrence of the item; this version of 'range' follows the conventional scaling of dispersion measures (1 = widely dispersed)
Relative range with size (see Gries 2022: 179-180; Gries 2024: 27-28): Relative range that takes into account the size of the corpus parts. Each corpus part contributes to this version of range in proportion to its size. Suppose there are 100 corpus parts, and part 1 is relatively short, accounting for 1/200 of the words in the whole corpus. If the item occurs in part 1, ordinary relative range increases by 1/100, since each part receives the same weight. Relative range with size, on the other hand, increases by 1/200, i.e. the relative size of the corpus part; this version of range weights corpus parts proportionate to their size.
Frequency adjustment: Dispersion scores can be adjusted for frequency using the min-max transformation proposed by Gries (2022: 184-191; 2024: 196-208). The frequency-adjusted score for an item considers the lowest and highest possible level of dispersion it can obtain given its overall corpus frequency as well as the number (and size) of corpus parts. The unadjusted score is then expressed relative to these endpoints, where the dispersion minimum is set to 0, and the dispersion maximum to 1 (expressed in terms of conventional scaling). The frequency-adjusted score falls between these bounds and expresses how close the observed distribution is to the theoretical maximum and minimum. This adjustment therefore requires a maximally and a minimally dispersed distribution of the item across the parts. These hypothetical extremes can be built in different ways. The method used by Gries (2022, 2024) uses a computationally expensive procedure that finds the distribution that produces the highest value on the dispersion measure of interest. The current function constructs extreme distributions in a different way, based on the distributional features pervasiveness ("pervasive") or evenness ("even"). You can choose between these with the argument freq_adjust_method; the default is even. For details and explanations, see vignette("frequency-adjustment").
To obtain the lowest possible level of dispersion, the occurrences are either allocated to as few corpus parts as possible ("pervasive"), or they are assigned to the smallest corpus part(s) ("even").
To obtain the highest possible level of dispersion, the occurrences are either spread as broadly across corpus parts as possible ("pervasive"), or they are allocated to corpus parts in proportion to their size ("even"). The choice between these methods is particularly relevant if corpus parts differ considerably in size. See documentation for find_max_disp().
A numeric value
Lukas Soenning
Gries, Stefan Th. 2022. What do (most of) our dispersion measures measure (most)? Dispersion? Journal of Second Language Studies 5(2). 171–205. doi:10.1075/jsls.21029.gri
Gries, Stefan Th. 2024. Frequency, dispersion, association, and keyness: Revising and tupleizing corpus-linguistic measures. Amsterdam: Benjamins. doi:10.1075/scl.115
disp_R( subfreq = c(0, 0, 1, 2, 5), partsize = rep(1000, 5), type = "relative", freq_adjust = FALSE)disp_R( subfreq = c(0, 0, 1, 2, 5), partsize = rep(1000, 5), type = "relative", freq_adjust = FALSE)
This function calculates the dispersion measure 'range'. It offers three different versions: 'absolute range' (the number of corpus parts containing at least one occurrence of the item), 'relative range' (the proportion of corpus parts containing at least one occurrence of the item), and 'relative range with size' (relative range that takes into account the size of the corpus parts). The function also offers the option of calculating frequency-adjusted dispersion scores.
disp_R_tdm( tdm, row_partsize = "first", type = "relative", freq_adjust = FALSE, freq_adjust_method = "pervasive", add_frequency = TRUE, unit_interval = TRUE, digits = NULL, verbose = TRUE, print_scores = TRUE )disp_R_tdm( tdm, row_partsize = "first", type = "relative", freq_adjust = FALSE, freq_adjust_method = "pervasive", add_frequency = TRUE, unit_interval = TRUE, digits = NULL, verbose = TRUE, print_scores = TRUE )
tdm |
A term-document matrix, where rows represent items and columns represent corpus parts; must also contain a row giving the size of the corpus parts (first or last row in the term-document matrix) |
row_partsize |
Character string indicating which row in the term-document matrix contains the size of the corpus parts. Possible values are |
type |
Character string indicating which type of range to calculate. See details below. Possible values are |
freq_adjust |
Logical. Whether dispersion score should be adjusted for frequency (i.e. whether frequency should be 'partialed out'); default is |
freq_adjust_method |
Character string indicating which method to use for devising dispersion extremes. See details below. Possible values are |
add_frequency |
Logical. Whether to add a column that gives the total number of occurrences of the item across a corpus parts; default is |
unit_interval |
Logical. Whether frequency-adjusted scores that exceed the limits of the unit interval should be replaced by 0 and 1; default is |
digits |
Rounding: Integer value specifying the number of decimal places to retain (default: no rounding) |
verbose |
Logical. Whether additional information (on directionality, formulas, frequency adjustment) should be printed; default is |
print_scores |
Logical. Whether the dispersion scores should be printed to the console; default is |
This function takes as input a term-document matrix and returns, for each item (i.e. each row) the dispersion measure 'range'. The rows in the input matrix represent the items, and the columns the corpus parts. Importantly, the term-document matrix must include an additional row that records the size of the corpus parts. For a proper term-document matrix, which includes all items that appear in the corpus, this can be added as a column margin, which sums the frequencies in each column. If the matrix only includes a selection of items drawn from the corpus, this information cannot be derived from the matrix and must be provided as a separate row.
Three different types of range measures can be calculated:
Absolute range: The number of corpus parts containing at least one occurrence of the item
Relative range: The proportion of corpus parts containing at least one occurrence of the item; this version of 'range' follows the conventional scaling of dispersion measures (1 = widely dispersed)
Relative range with size (see Gries 2022: 179-180; Gries 2024: 27-28): Relative range that takes into account the size of the corpus parts. Each corpus part contributes to this version of range in proportion to its size. Suppose there are 100 corpus parts, and part 1 is relatively short, accounting for 1/200 of the words in the whole corpus. If the item occurs in part 1, ordinary relative range increases by 1/100, since each part receives the same weight. Relative range with size, on the other hand, increases by 1/200, i.e. the relative size of the corpus part; this version of range weights corpus parts proportionate to their size.
Frequency adjustment: Dispersion scores can be adjusted for frequency using the min-max transformation proposed by Gries (2022: 184-191; 2024: 196-208). The frequency-adjusted score for an item considers the lowest and highest possible level of dispersion it can obtain given its overall corpus frequency as well as the number (and size) of corpus parts. The unadjusted score is then expressed relative to these endpoints, where the dispersion minimum is set to 0, and the dispersion maximum to 1 (expressed in terms of conventional scaling). The frequency-adjusted score falls between these bounds and expresses how close the observed distribution is to the theoretical maximum and minimum. This adjustment therefore requires a maximally and a minimally dispersed distribution of the item across the parts. These hypothetical extremes can be built in different ways. The method used by Gries (2022, 2024) uses a computationally expensive procedure that finds the distribution that produces the highest value on the dispersion measure of interest. The current function constructs extreme distributions in a different way, based on the distributional features pervasiveness ("pervasive") or evenness ("even"). You can choose between these with the argument freq_adjust_method; the default is "even". For details and explanations, see vignette("frequency-adjustment").
To obtain the lowest possible level of dispersion, the occurrences are either allocated to as few corpus parts as possible ("pervasive"), or they are assigned to the smallest corpus part(s) ("even").
To obtain the highest possible level of dispersion, the occurrences are either spread as broadly across corpus parts as possible ("pervasive"), or they are allocated to corpus parts in proportion to their size ("even"). The choice between these methods is particularly relevant if corpus parts differ considerably in size. See documentation for find_max_disp().
A data frame with one row per item
Lukas Soenning
Gries, Stefan Th. 2022. What do (most of) our dispersion measures measure (most)? Dispersion? Journal of Second Language Studies 5(2). 171–205. doi:10.1075/jsls.21029.gri
Gries, Stefan Th. 2024. Frequency, dispersion, association, and keyness: Revising and tupleizing corpus-linguistic measures. Amsterdam: Benjamins. doi:10.1075/scl.115
disp_R_tdm( tdm = biber150_spokenBNC2014[1:20,], row_partsize = "first", type = "relative", freq_adjust = FALSE)disp_R_tdm( tdm = biber150_spokenBNC2014[1:20,], row_partsize = "first", type = "relative", freq_adjust = FALSE)
This function calculates the dispersion measure (Rosengren 1971) and allows the user to choose the directionality of scaling, i.e. whether higher values denote a more even or a less even distribution. It also offers the option of calculating frequency-adjusted dispersion scores.
disp_S( subfreq, partsize, directionality = "conventional", freq_adjust = FALSE, freq_adjust_method = "even", unit_interval = TRUE, digits = NULL, verbose = TRUE, print_score = TRUE, suppress_warning = FALSE )disp_S( subfreq, partsize, directionality = "conventional", freq_adjust = FALSE, freq_adjust_method = "even", unit_interval = TRUE, digits = NULL, verbose = TRUE, print_score = TRUE, suppress_warning = FALSE )
subfreq |
A numeric vector of subfrequencies, i.e. the number of occurrences of the item in each corpus part |
partsize |
A numeric vector specifying the size of the corpus parts |
directionality |
Character string indicating the directionality of scaling. See details below. Possible values are |
freq_adjust |
Logical. Whether dispersion score should be adjusted for frequency (i.e. whether frequency should be 'partialed out'); default is |
freq_adjust_method |
Character string indicating which method to use for devising dispersion extremes. See details below. Possible values are |
unit_interval |
Logical. Whether frequency-adjusted scores that exceed the limits of the unit interval should be replaced by 0 and 1; default is |
digits |
Rounding: Integer value specifying the number of decimal places to retain (default: no rounding) |
verbose |
Logical. Whether additional information (on directionality, formulas, frequency adjustment) should be printed; default is |
print_score |
Logical. Whether the dispersion score should be printed to the console; default is |
suppress_warning |
Logical. Whether warning messages should be suppressed; default is |
The function calculates the dispersion measure based on a set of subfrequencies (number of occurrences of the item in each corpus part) and a matching set of part sizes (the size of the corpus parts, i.e. number of word tokens).
Directionality: ranges from 0 to 1. The conventional scaling of dispersion measures (see Juilland & Chang-Rodriguez 1964; Carroll 1970; Rosengren 1971) assigns higher values to more even/dispersed/balanced distributions of subfrequencies across corpus parts. This is the default. Gries (2008) uses the reverse scaling, with higher values denoting a more uneven/bursty/concentrated distribution; use directionality = "gries" to choose this option.
Frequency adjustment: Dispersion scores can be adjusted for frequency using the min-max transformation proposed by Gries (2022: 184-191; 2024: 196-208). The frequency-adjusted score for an item considers the lowest and highest possible level of dispersion it can obtain given its overall corpus frequency as well as the number (and size) of corpus parts. The unadjusted score is then expressed relative to these endpoints, where the dispersion minimum is set to 0, and the dispersion maximum to 1 (expressed here in terms of conventional scaling). The frequency-adjusted score falls between these bounds and expresses how close the observed distribution is to the theoretical maximum and minimum. This adjustment therefore requires a maximally and a minimally dispersed distribution of the item across the parts. These hypothetical extremes can be built in different ways. The method used by Gries (2022, 2024) uses a computationally expensive procedure that finds the distribution that produces the highest value on the dispersion measure of interest. The current function constructs extreme distributions in a different way, based on the distributional features pervasiveness ("pervasive") or evenness ("even"). You can choose between these with the argument freq_adjust_method; the default is even. For details and explanations, see vignette("frequency-adjustment").
To obtain the lowest possible level of dispersion, the occurrences are either allocated to as few corpus parts as possible ("pervasive"), or they are assigned to the smallest corpus part(s) ("even").
To obtain the highest possible level of dispersion, the occurrences are either spread as broadly across corpus parts as possible ("pervasive"), or they are allocated to corpus parts in proportion to their size ("even"). The choice between these methods is particularly relevant if corpus parts differ considerably in size. See documentation for find_max_disp() and vignette("frequency-adjustment").
In the formulas given below, the following notation is used:
the number of corpus parts
the absolute subfrequency in part
a proportional quantity; the size of corpus part divided by the size of the corpus (i.e. the sum of the part sizes)
is the dispersion measure proposed by Rosengren (1971); the formula uses conventional scaling:
A numeric value
Lukas Soenning
Carroll, John B. 1970. An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. Computer Studies in the Humanities and Verbal Behaviour 3(2). 61–65. doi:10.1002/j.2333-8504.1970.tb00778.x
Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4). 403–437. doi:10.1075/ijcl.13.4.02gri
Gries, Stefan Th. 2022. What do (most of) our dispersion measures measure (most)? Dispersion? Journal of Second Language Studies 5(2). 171–205. doi:10.1075/jsls.21029.gri
Gries, Stefan Th. 2024. Frequency, dispersion, association, and keyness: Revising and tupleizing corpus-linguistic measures. Amsterdam: Benjamins. doi:10.1075/scl.115
Juilland, Alphonse G. & Eugenio Chang-Rodríguez. 1964. Frequency dictionary of Spanish words. The Hague: Mouton de Gruyter. doi:10.1515/9783112415467
Rosengren, Inger. 1971. The quantitative concept of language and its relation to the structure of frequency dictionaries. Études de linguistique appliquée (Nouvelle Série) 1. 103–127.
disp_S( subfreq = c(0,0,1,2,5), partsize = rep(1000, 5), directionality = "conventional")disp_S( subfreq = c(0,0,1,2,5), partsize = rep(1000, 5), directionality = "conventional")
for a term-document matrixThis function calculates the dispersion measure (Rosengren 1971) and allows the user to choose the directionality of scaling, i.e. whether higher values denote a more even or a less even distribution. It also offers the option of calculating frequency-adjusted dispersion scores.
disp_S_tdm( tdm, row_partsize = "first", directionality = "conventional", freq_adjust = FALSE, freq_adjust_method = "even", unit_interval = TRUE, add_frequency = TRUE, digits = NULL, verbose = TRUE, print_scores = TRUE )disp_S_tdm( tdm, row_partsize = "first", directionality = "conventional", freq_adjust = FALSE, freq_adjust_method = "even", unit_interval = TRUE, add_frequency = TRUE, digits = NULL, verbose = TRUE, print_scores = TRUE )
tdm |
A term-document matrix, where rows represent items and columns represent corpus parts; must also contain a row giving the size of the corpus parts (first or last row in the term-document matrix) |
row_partsize |
Character string indicating which row in the term-document matrix contains the size of the corpus parts. Possible values are |
directionality |
Character string indicating the directionality of scaling. See details below. Possible values are |
freq_adjust |
Logical. Whether dispersion score should be adjusted for frequency (i.e. whether frequency should be 'partialed out'); default is |
freq_adjust_method |
Character string indicating which method to use for devising dispersion extremes. See details below. Possible values are |
unit_interval |
Logical. Whether frequency-adjusted scores that exceed the limits of the unit interval should be replaced by 0 and 1; default is |
add_frequency |
Logical. Whether to add a column that gives the total number of occurrences of the item across a corpus parts; default is |
digits |
Rounding: Integer value specifying the number of decimal places to retain (default: no rounding) |
verbose |
Logical. Whether additional information (on directionality, formulas, frequency adjustment) should be printed; default is |
print_scores |
Logical. Whether the dispersion scores should be printed to the console; default is |
This function takes as input a term-document matrix and returns, for each item (i.e. each row) the dispersion measure . The rows in the input matrix represent the items, and the columns the corpus parts. Importantly, the term-document matrix must include an additional row that records the size of the corpus parts. For a proper term-document matrix, which includes all items that appear in the corpus, this can be added as a column margin, which sums the frequencies in each column. If the matrix only includes a selection of items drawn from the corpus, this information cannot be derived from the matrix and must be provided as a separate row.
Directionality: ranges from 0 to 1. The conventional scaling of dispersion measures (see Juilland & Chang-Rodriguez 1964; Carroll 1970; Rosengren 1971) assigns higher values to more even/dispersed/balanced distributions of subfrequencies across corpus parts. This is the default. Gries (2008) uses the reverse scaling, with higher values denoting a more uneven/bursty/concentrated distribution; use directionality = 'gries' to choose this option.
Frequency adjustment: Dispersion scores can be adjusted for frequency using the min-max transformation proposed by Gries (2022: 184-191; 2024: 196-208). The frequency-adjusted score for an item considers the lowest and highest possible level of dispersion it can obtain given its overall corpus frequency as well as the number (and size) of corpus parts. The unadjusted score is then expressed relative to these endpoints, where the dispersion minimum is set to 0, and the dispersion maximum to 1 (expressed in terms of conventional scaling). The frequency-adjusted score falls between these bounds and expresses how close the observed distribution is to the theoretical maximum and minimum. This adjustment therefore requires a maximally and a minimally dispersed distribution of the item across the parts. These hypothetical extremes can be built in different ways. The method used by Gries (2022, 2024) uses a computationally expensive procedure that finds the distribution that produces the highest value on the dispersion measure of interest. The current function constructs extreme distributions in a different way, based on the distributional features pervasiveness (pervasive) or evenness (even). You can choose between these with the argument freq_adjust_method; the default is even. For details and explanations, see vignette("frequency-adjustment").
To obtain the lowest possible level of dispersion, the occurrences are either allocated to as few corpus parts as possible (pervasive), or they are assigned to the smallest corpus part(s) (even).
To obtain the highest possible level of dispersion, the occurrences are either spread as broadly across corpus parts as possible (pervasive), or they are allocated to corpus parts in proportion to their size (even). The choice between these methods is particularly relevant if corpus parts differ considerably in size. See documentation for find_max_disp().
In the formulas given below, the following notation is used:
the number of corpus parts
the absolute subfrequency in part
a proportional quantity; the size of corpus part divided by the size of the corpus (i.e. the sum of the part sizes)
is the dispersion measure proposed by Rosengren (1971); the formula uses conventional scaling:
A data frame with one row per item
Lukas Soenning
Carroll, John B. 1970. An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. Computer Studies in the Humanities and Verbal Behaviour 3(2). 61–65. doi:10.1002/j.2333-8504.1970.tb00778.x
Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4). 403–437. doi:10.1075/ijcl.13.4.02gri
Gries, Stefan Th. 2022. What do (most of) our dispersion measures measure (most)? Dispersion? Journal of Second Language Studies 5(2). 171–205. doi:10.1075/jsls.21029.gri
Gries, Stefan Th. 2024. Frequency, dispersion, association, and keyness: Revising and tupleizing corpus-linguistic measures. Amsterdam: Benjamins. doi:10.1075/scl.115
Juilland, Alphonse G. & Eugenio Chang-Rodríguez. 1964. Frequency dictionary of Spanish words. The Hague: Mouton de Gruyter. doi:10.1515/9783112415467
Rosengren, Inger. 1971. The quantitative concept of language and its relation to the structure of frequency dictionaries. Études de linguistique appliquée (Nouvelle Série) 1. 103–127.
disp_S_tdm( tdm = biber150_spokenBNC2014[1:20,], row_partsize = "first", directionality = "conventional")disp_S_tdm( tdm = biber150_spokenBNC2014[1:20,], row_partsize = "first", directionality = "conventional")
This function calculates a number of parts-based dispersion measures and allows the user to choose the directionality of scaling, i.e. whether higher values denote a more even or a less even distribution. It also offers the option of calculating frequency-adjusted dispersion scores.
disp_tdm( tdm, row_partsize = "first", directionality = "conventional", freq_adjust = FALSE, freq_adjust_method = "even", add_frequency = TRUE, unit_interval = TRUE, digits = NULL, verbose = TRUE, print_scores = TRUE, suppress_warning = FALSE )disp_tdm( tdm, row_partsize = "first", directionality = "conventional", freq_adjust = FALSE, freq_adjust_method = "even", add_frequency = TRUE, unit_interval = TRUE, digits = NULL, verbose = TRUE, print_scores = TRUE, suppress_warning = FALSE )
tdm |
A term-document matrix, where rows represent items and columns represent corpus parts; must also contain a row giving the size of the corpus parts (first or last row in the term-document matrix) |
row_partsize |
Character string indicating which row in the term-document matrix contains the size of the corpus parts. Possible values are |
directionality |
Character string indicating the directionality of scaling. See details below. Possible values are |
freq_adjust |
Logical. Whether dispersion score should be adjusted for frequency (i.e. whether frequency should be 'partialed out'); default is |
freq_adjust_method |
Character string indicating which method to use for devising dispersion extremes. See details below. Possible values are |
add_frequency |
Logical. Whether to add a column that gives the total number of occurrences of the item across a corpus parts; default is |
unit_interval |
Logical. Whether frequency-adjusted scores that exceed the limits of the unit interval should be replaced by 0 and 1; default is |
digits |
Rounding: Integer value specifying the number of decimal places to retain (default: no rounding) |
verbose |
Logical. Whether additional information (on directionality, formulas, frequency adjustment) should be printed; default is |
print_scores |
Logical. Whether the dispersion scores should be printed to the console; default is |
suppress_warning |
Logical. Whether warning messages should be suppressed; default is |
This function takes as input a term-document matrix and returns, for each item (i.e. each row) a variety of dispersion measures. The rows in the input matrix represent the items, and the columns the corpus parts. Importantly, the term-document matrix must include an additional row that records the size of the corpus parts. For a proper term-document matrix, which includes all items that appear in the corpus, this can be added as a column margin, which sums the frequencies in each column. If the matrix only includes a selection of items drawn from the corpus, this information cannot be derived from the matrix and must be provided as a separate row.
Directionality: The scores for all measures range from 0 to 1. The conventional scaling of dispersion measures (see Juilland & Chang-Rodriguez 1964; Carroll 1970; Rosengren 1971) assigns higher values to more even/dispersed/balanced distributions of subfrequencies across corpus parts. Gries (2008) uses the reverse scaling, with higher values denoting a more uneven/bursty/concentrated distribution; this is implemented by the value gries.
Frequency adjustment: Dispersion scores can be adjusted for frequency using the min-max transformation proposed by Gries (2022: 184-191; 2024: 196-208). The frequency-adjusted score for an item considers the lowest and highest possible level of dispersion it can obtain given its overall corpus frequency as well as the number (and size) of corpus parts. The unadjusted score is then expressed relative to these endpoints, where the dispersion minimum is set to 0, and the dispersion maximum to 1 (expressed in terms of conventional scaling). The frequency-adjusted score falls between these bounds and expresses how close the observed distribution is to the theoretical maximum and minimum. This adjustment therefore requires a maximally and a minimally dispersed distribution of the item across the parts. These hypothetical extremes can be built in different ways. The method used by Gries (2022, 2024) uses a computationally expensive procedure that finds the distribution that produces the highest value on the dispersion measure of interest. The current function constructs extreme distributions in a different way, based on the distributional features pervasiveness ("pervasive") or evenness ("even"). You can choose between these with the argument freq_adjust_method; the default is even. For details and explanations, see vignette("frequency-adjustment").
To obtain the lowest possible level of dispersion, the occurrences are either allocated to as few corpus parts as possible ("pervasive"), or they are assigned to the smallest corpus part(s) ("even").
To obtain the highest possible level of dispersion, the occurrences are either spread as broadly across corpus parts as possible ("pervasive"), or they are allocated to corpus parts in proportion to their size ("even"). The choice between these methods is particularly relevant if corpus parts differ considerably in size. See documentation for find_max_disp() and vignette("frequency-adjustment").
The following measures are computed, listed in chronological order (see details below):
(Keniston 1920)
(Juilland & Chang-Rodriguez 1964)
(Carroll 1970)
(Rosengren 1971)
(Gries 2008; modification: Egbert et al. 2020)
(Burch et al. 2017)
(Gries 2024)
In the formulas given below, the following notation is used:
the number of corpus parts
the absolute subfrequency in part
a proportional quantity; the subfrequency in part divided by the total number of occurrences of the item in the corpus (i.e. the sum of all subfrequencies)
the absolute size of corpus part
a proportional quantity; the size of corpus part divided by the size of the corpus (i.e. the sum of the part sizes)
the normalized subfrequency in part , i.e. the subfrequency divided by the size of the corpus part
a proportional quantity; the normalized subfrequency in part divided by the sum of all normalized subfrequencies
corpus frequency, i.e. the total number of occurrence of the item in the corpus
Note that the formulas cited below differ in their scaling, i.e. whether 1 reflects an even or an uneven distribution. In the current function, this behavior is overridden by the argument directionality. The specific scaling used in the formulas below is therefore irrelevant.
refers to the relative range, i.e. the proportion of corpus parts containing at least one occurrence of the item
denotes Juilland's D and is calculated as follows (this formula uses conventional scaling); denotes the average over the normalized subfrequencies:
denotes the index proposed by Carroll (1970); the following formula uses conventional scaling:
is the dispersion measure proposed by Rosengren (1971); the formula uses conventional scaling:
represents Gries's deviation of proportions; the following formula is the modified version suggested by Egbert et al. (2020: 99); it implements conventional scaling (0 = uneven, 1 = even) and the notation refers to the value among those corpus parts that include at least one occurrence of the item.
is a measure introduced into dispersion analysis by Burch et al. (2017). The following formula is the one used by Egbert et al. (2020: 98); it relies on normalized frequencies and therefore works with corpus parts of different size. The formula represents conventional scaling (0 = uneven, 1 = even):
The current function uses a formula that may be found in Wilcox (1973: 343). It relies on the proportional values instead of the normalized subfrequencies :
Since this formula is computationally expensive, the function actually uses the computational shortcut given in Wilcox (1973: 343). Critically, the proportional quantities must first be sorted in decreasing order. Only after this rearrangement can the shortcut version be applied. We will refer to this rearranged version of as :
(Wilcox 1973: 343)
denotes a measure proposed by Gries (2020, 2021); for standardization, it uses the odds-to-probability transformation (Gries 2024: 90) and represents Gries scaling (0 = even, 1 = uneven):
A data frame with one row per item
Lukas Soenning
Burch, Brent, Jesse Egbert & Douglas Biber. 2017. Measuring and interpreting lexical dispersion in corpus linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science 3(2). 189–216. doi:10.1558/jrds.33066
Carroll, John B. 1970. An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. Computer Studies in the Humanities and Verbal Behaviour 3(2). 61–65. doi:10.1002/j.2333-8504.1970.tb00778.x
Egbert, Jesse, Brent Burch & Douglas Biber. 2020. Lexical dispersion and corpus design. International Journal of Corpus Linguistics 25(1). 89–115. doi:10.1075/ijcl.18010.egb
Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4). 403–437. doi:10.1075/ijcl.13.4.02gri
Gries, Stefan Th. 2020. Analyzing dispersion. In Magali Paquot & Stefan Th. Gries (eds.), A practical handbook of corpus linguistics, 99–118. New York: Springer. doi:10.1007/978-3-030-46216-1_5
Gries, Stefan Th. 2021. A new approach to (key) keywords analysis: Using frequency, and now also dispersion. Research in Corpus Linguistics 9(2). 1–33. doi:10.32714/ricl.09.02.02
Gries, Stefan Th. 2022. What do (most of) our dispersion measures measure (most)? Dispersion? Journal of Second Language Studies 5(2). 171–205. doi:10.1075/jsls.21029.gri
Gries, Stefan Th. 2024. Frequency, dispersion, association, and keyness: Revising and tupleizing corpus-linguistic measures. Amsterdam: Benjamins. doi:10.1075/scl.115
Juilland, Alphonse G. & Eugenio Chang-Rodríguez. 1964. Frequency dictionary of Spanish words. The Hague: Mouton de Gruyter. doi:10.1515/9783112415467
Keniston, Hayward. 1920. Common words in Spanish. Hispania 3(2). 85–96. doi:10.2307/331305
Lijffijt, Jefrey & Stefan Th. Gries. 2012. Correction to Stefan Th. Gries’ ‘Dispersions and adjusted frequencies in corpora’. International Journal of Corpus Linguistics 17(1). 147–149. doi:10.1075/ijcl.17.1.08lij
Rosengren, Inger. 1971. The quantitative concept of language and its relation to the structure of frequency dictionaries. Études de linguistique appliquée (Nouvelle Série) 1. 103–127.
For finer control over the calculation of several dispersion measures:
disp_R_tdm() for
disp_DP_tdm() for
disp_DA_tdm() for
disp_DKL_tdm() for
disp_tdm( tdm = biber150_spokenBNC2014[1:20,], row_partsize = "first", directionality = "conventional", freq_adjust = FALSE)disp_tdm( tdm = biber150_spokenBNC2014[1:20,], row_partsize = "first", directionality = "conventional", freq_adjust = FALSE)
This function returns the (hypothetical) distribution of subfrequencies that represents the highest possible level of dispersion for a given item across a particular set of corpus parts. It requires a vector of subfrequencies and a vector of corpus part sizes. This distribution is required for the min-max transformation proposed by Gries (2022: 184-191; 2024: 196-208) to obtain frequency-adjusted dispersion scores.
find_max_disp(subfreq, partsize, freq_adjust_method = "even")find_max_disp(subfreq, partsize, freq_adjust_method = "even")
subfreq |
A numeric vector of subfrequencies, i.e. the number of occurrences of the item in each corpus part |
partsize |
A numeric vector specifying the size of the corpus parts |
freq_adjust_method |
Character string indicating which method to use for devising dispersion extremes. See details below. Possible values are |
This function creates a hypothetical distribution of the total number of occurrences of the item (i.e. the sum of its subfrequencies) across corpus parts. To obtain the highest possible level of dispersion, the argument freq_adjust_method allows the user to choose between two distributional features: pervasiveness (pervasive) or evenness (even). For details and explanations, see vignette("frequency-adjustment"). To obtain the highest possible level of dispersion, the occurrences are either spread as broadly across corpus parts as possible (pervasive), or they are allocated to corpus parts in proportion to their size (even). The choice between these methods is particularly relevant if corpus parts differ considerably in size. Since the dispersion of an item that occurs only once in the corpus (hapaxes) cannot be sensibly measured or manipulated, such items are disregarded; the function returns their observed subfrequencies.
An integer vector the same length as partsize
Lukas Soenning
Gries, Stefan Th. 2022. What do (most of) our dispersion measures measure (most)? Dispersion? Journal of Second Language Studies 5(2). 171–205. doi:10.1075/jsls.21029.gri
Gries, Stefan Th. 2024. Frequency, dispersion, association, and keyness: Revising and tupleizing corpus-linguistic measures. Amsterdam: Benjamins. doi:10.1075/scl.115
find_max_disp( subfreq = c(0,0,1,2,5), partsize = c(100, 100, 100, 500, 1000), freq_adjust_method = "pervasive")find_max_disp( subfreq = c(0,0,1,2,5), partsize = c(100, 100, 100, 500, 1000), freq_adjust_method = "pervasive")
This function takes as input a term-document matrix and returns, for each item (i.e. row), the (hypothetical) distribution of subfrequencies that represents the highest possible level of dispersion for the item across the corpus parts. This distribution is required for the min-max transformation proposed by Gries (2022: 184-191; 2024: 196-208) to obtain frequency-adjusted dispersion scores.
find_max_disp_tdm( tdm, row_partsize = "first", freq_adjust_method = freq_adjust_method )find_max_disp_tdm( tdm, row_partsize = "first", freq_adjust_method = freq_adjust_method )
tdm |
A term-document matrix, where rows represent items and columns represent corpus parts; must also contain a row giving the size of the corpus parts (first or last row in the term-document matrix) |
row_partsize |
Character string indicating which row in the term-document matrix contains the size of the corpus parts. Possible values are |
freq_adjust_method |
Character string indicating which method to use for devising dispersion extremes. See details below. Possible values are |
This function takes as input a term-document matrix and creates, for each item in the matrix, a hypothetical distribution of the total number of occurrences of the item (i.e. the sum of the subfrequencies) across corpus parts. To obtain the highest possible level of dispersion, the argument freq_adjust_method allows the user to choose between two distributional features: pervasiveness (pervasive) or evenness (even). For details and explanations, see vignette("frequency-adjustment"). To obtain the highest possible level of dispersion, the occurrences are either spread as broadly across corpus parts as possible (pervasive), or they are allocated to corpus parts in proportion to their size (even). The choice between these methods is particularly relevant if corpus parts differ considerably in size. Since the dispersion of items that occur only once in the corpus (hapaxes) cannot be sensibly measured or manipulated, such items are disregarded; the function returns their observed subfrequencies.
A matrix of integers with one row per item and one column per corpus part
Lukas Soenning
Gries, Stefan Th. 2022. What do (most of) our dispersion measures measure (most)? Dispersion? Journal of Second Language Studies 5(2). 171–205. doi:10.1075/jsls.21029.gri
Gries, Stefan Th. 2024. Frequency, dispersion, association, and keyness: Revising and tupleizing corpus-linguistic measures. Amsterdam: Benjamins.
find_max_disp_tdm( tdm = biber150_spokenBNC2014[1:10,], row_partsize = "first", freq_adjust_method = "even")find_max_disp_tdm( tdm = biber150_spokenBNC2014[1:10,], row_partsize = "first", freq_adjust_method = "even")
This function returns the (hypothetical) distribution of subfrequencies that represents the smallest possible level of dispersion for a given item across a particular set of corpus parts. It requires a vector of subfrequencies and a vector of corpus part sizes. This distribution is required for the min-max transformation proposed by Gries (2022: 184-191; 2024: 196-208) to obtain frequency-adjusted dispersion scores.
find_min_disp(subfreq, partsize, freq_adjust_method = "even")find_min_disp(subfreq, partsize, freq_adjust_method = "even")
subfreq |
A numeric vector of subfrequencies, i.e. the number of occurrences of the item in each corpus part |
partsize |
A numeric vector specifying the size of the corpus parts |
freq_adjust_method |
Character string indicating which method to use for devising dispersion extremes. See details below. Possible values are |
This function creates a hypothetical distribution of the total number of occurrences of the item (i.e. the sum of its subfrequencies) across corpus parts. To obtain the lowest possible level of dispersion, the argument freq_adjust_method allows the user to choose between two distributional features: pervasiveness (pervasive) or evenness (even). For details and explanations, see vignette("frequency-adjustment"). To obtain the lowest possible level of dispersion, the occurrences are either allocated to as few corpus parts as possible (pervasiveness), or they are assigned to the smallest corpus part(s) (even). Since the dispersion of items that occur only once in the corpus (hapaxes) cannot be sensibly measured or manipulated, such items are disregarded; the function returns their observed subfrequencies. The function reuses code segments from Gries's (2025) 'KLD4C' package (from the function most.uneven.distr()).
An integer vector the same length as partsize
Lukas Soenning
Gries, Stefan Th. 2022. What do (most of) our dispersion measures measure (most)? Dispersion? Journal of Second Language Studies 5(2). 171–205. doi:10.1075/jsls.21029.gri
Gries, Stefan Th. 2024. Frequency, dispersion, association, and keyness: Revising and tupleizing corpus-linguistic measures. Amsterdam: Benjamins. doi:10.1075/scl.115
Gries, Stefan Th. 2025. KLD4C: Gries 2024: Tupleization of corpus linguistics. R package version 1.01. (available from https://www.stgries.info/research/kld4c/kld4c.html)
find_min_disp( subfreq = c(0,0,1,2,5), partsize = rep(1000, 5), freq_adjust_method = "even")find_min_disp( subfreq = c(0,0,1,2,5), partsize = rep(1000, 5), freq_adjust_method = "even")
This function takes as input a term-document matrix and returns, for each item (i.e. row), the (hypothetical) distribution of subfrequencies that represents the smallest possible level of dispersion for the item across the corpus parts. This distribution is required for the min-max transformation proposed by Gries (2022: 184-191; 2024: 196-208) to obtain frequency-adjusted dispersion scores.
find_min_disp_tdm( tdm, row_partsize = "first", freq_adjust_method = freq_adjust_method )find_min_disp_tdm( tdm, row_partsize = "first", freq_adjust_method = freq_adjust_method )
tdm |
A term-document matrix, where rows represent items and columns represent corpus parts; must also contain a row giving the size of the corpus parts (first or last row in the term-document matrix) |
row_partsize |
Character string indicating which row in the term-document matrix contains the size of the corpus parts. Possible values are |
freq_adjust_method |
Character string indicating which method to use for devising dispersion extremes. See details below. Possible values are |
This function takes as input a term-document matrix and creates, for each item in the matrix, a hypothetical distribution of the total number of occurrences of the item (i.e. the sum of the subfrequencies) across corpus parts. To obtain the lowest possible level of dispersion, the argument freq_adjust_method allows the user to choose between two distributional features: pervasiveness (pervasive) or evenness (even). For details and explanations, see vignette("frequency-adjustment"). To obtain the lowest possible level of dispersion, the occurrences are either allocated to as few corpus parts as possible (pervasiveness), or they are assigned to the smallest corpus part(s) (even). Since the dispersion of an item that occurs only once in the corpus (hapaxes) cannot be sensibly measured or manipulated, such items are disregarded; the function returns their observed subfrequencies. The function reuses code segments from Gries's (2025) 'KLD4C' package (from the function most.uneven.distr()).
A matrix of integers with one row per item and one column per corpus part
Lukas Soenning
Gries, Stefan Th. 2022. What do (most of) our dispersion measures measure (most)? Dispersion? Journal of Second Language Studies 5(2). 171–205. doi:10.1075/jsls.21029.gri
Gries, Stefan Th. 2024. Frequency, dispersion, association, and keyness: Revising and tupleizing corpus-linguistic measures. Amsterdam: Benjamins. doi:10.1075/scl.115
Gries, Stefan Th. 2025. KLD4C: Gries 2024: Tupleization of corpus linguistics. R package version 1.01. (available from https://www.stgries.info/research/kld4c/kld4c.html)
find_min_disp_tdm( tdm = biber150_spokenBNC2014[1:10,], row_partsize = "first", freq_adjust_method = "even")find_min_disp_tdm( tdm = biber150_spokenBNC2014[1:10,], row_partsize = "first", freq_adjust_method = "even")
This function takes as input a vector of proportions (or, more generally, scores in the unit interval [0,1]) and re-expresses them using Tukey's folded power transformation. It allows the user to decide whether the transformed scores should be mapped to the [-1, +1] interval (default), or whether they may extend beyond these limits.
fpower(x, lambda = 1, scaling = "plus_minus_1")fpower(x, lambda = 1, scaling = "plus_minus_1")
x |
A numeric vector of scores in the unit interval [0,1]; 0 and 1 are allowed but throw an error message when lambda = 0 |
lambda |
Numeric value of the power transformation, which can range between 0 (limiting case: logit transformation) and 1 (no transformation) |
scaling |
Character string indicating whether scores should be re-expressed to the [-1, 1] interval ( |
This function allows the user to apply a variety of folded power transformations to quantities bounded between 0 and 1. Different values may be specified for the power of the transformation ( lambda), but only powers between 0 and 1 are supported. Two versions of the folded power transformation are available. The first maps transformed values to the [-1, +1] interval:
The second version does not impose these limits:
For lambda equal to 0, the logit transformation is implemented as a limiting case; note that input scores of 0 and 1 are not allowed when lambda is set to 0
lambda = 0.14 gives a close approximation to the probit transformation (see Fox 2016: 74) while accepting input score of 0 and 1
lambda = 1/3 implements folded cube roots
lambda = 0.41 gives a close approximation to the arcsine-square-root (or angular) transformation (see Fox 2016: 74)
lambda = 0.5 implements folded roots
This function was written with the help of ChatGPT (version GPT-5.1; OpenAI 2025)
A numeric vector
Lukas Soenning
OpenAI. (2025). ChatGPT (GPT-5.1) Large language model. https://chat.openai.com
fpower( seq(0, 1, .1), lambda = .14, scaling = "plus_minus_1")fpower( seq(0, 1, .1), lambda = .14, scaling = "plus_minus_1")
Tukey's folded power transformation
fpower_trans(lambda = 0)fpower_trans(lambda = 0)
lambda |
Numeric value of the applied power transformation, which can range between 0 (limiting case: logit transformation) and 1 (no transformation) |
This function was written with the help of ChatGPT (version GPT-5.1; OpenAI 2025)
A numeric vector
OpenAI. (2025). ChatGPT (GPT-5.1) Large language model. https://chat.openai.com
plot(fpower_trans(lambda = .5), xlim = c(0, 1))plot(fpower_trans(lambda = .5), xlim = c(0, 1))
This function takes as input a vector of transformed scores, i.e. values that were originally in the unit interval [0, 1] but which were re-expressed using Tukey's folded power transformation. It allows back-transformation of two versions of folded powers: Those that are mapped to the [-1, +1] interval and those that aren't.
invfpower(y, lambda = 1, scaling = "plus_minus_1")invfpower(y, lambda = 1, scaling = "plus_minus_1")
y |
A numeric vector of folded-power-transformed scores |
lambda |
Numeric value of the applied power transformation, which can range between 0 (limiting case: logit transformation) and 1 (no transformation) |
scaling |
Character string indicating whether scores were re-expressed to the [-1, 1] interval ( |
This function was written with the help of ChatGPT (version GPT-5.1; OpenAI 2025)
A numeric vector
Lukas Soenning
OpenAI. (2025). ChatGPT (GPT-5.1) Large language model. https://chat.openai.com
invfpower( seq(-1, 1, .1), lambda = .14, scaling = "plus_minus_1")invfpower( seq(-1, 1, .1), lambda = .14, scaling = "plus_minus_1")
This dataset provides metadata for the text files in the Brown corpus (Francis & Kučera 1979). It maps standardized file names to the textual categories macro genre and genre, and records the length of each text file (in the total number of word and nonword tokens). Macro genres and genres are ordered based on the sampling frame informing the design of the Brown family of corpora (see https://listings.lib.msu.edu/public-corpora/cd421/manuals/brown/INDEX.HTM).
metadata_brownmetadata_brown
metadata_brownA data frame with 500 rows and 4 columns:
Standardized name of the text file (e.g. "A01", "J58", "R07")
4 macro genres ("press", "general_prose", "learned", "fiction"); ordered factor
15 genres (e.g. "press_editorial", "popular_lore", "adventure_western_fiction"); ordered factor
The length of the text file, expressed as the number of (word and nonword) tokens
Francis, W. Nelson & Henry Kučera. 1979. A Standard Corpus of Present-Day Edited American English, for Use with Digital Computers (Brown). Providence, RI: Brown University.
McEnery, Tony & Andrew Hardie. 2012. Corpus linguistics. Cambridge: Cambridge University Press.
This dataset provides metadata for the text files in ICE-GB (Nelson et al. 2002). It maps standardized file names to various textual categories such as mode of production, macro genre and genre, and records the length of each text file (in the total number of word and nonword tokens). Text categories, macro genres and genres are ordered based on the sampling frame informing the design of the ICE family of corpora (see https://www.ice-corpora.uzh.ch/en/design.html).
metadata_ice_gbmetadata_ice_gb
metadata_ice_gbA data frame with 500 rows and 7 columns:
Standardized name of the text file (e.g. "s1a-001", "w1b-008", "w2d-018")
Mode of production ("spoken" vs. "written")
4 higher-level text categories ("dialogues", "monologues", "non-printed", "printed"); ordered factor
12 macro genres (e.g. "private_dialogues", "student_writing", "reportage"); ordered factor
32 genres (e.g. "phonecalls", "unscripted_speeches", "novels_short_stories"); ordered factor
Short label for the genre (see Schützler 2023: 228); ordered factor
The length of the text file, expressed as the number of (word and nonword) tokens
https://www.ice-corpora.uzh.ch/en/design.html
Greenbaum, Sidney. 1996. Introducing ICE. In Sidney Greenbaum (ed.), Comparing English worldwide: The International Corpus of English, 3–12. Clarendon Press.
Nelson, Gerald, Sean Wallis, and Bas Aarts. 2002. Exploring Natural Language: Working with the British Component of the International Corpus of English. John Benjamins.
Schützler, Ole. 2023. Concessive constructions in varieties of English. Language Science Press. doi:10.5281/zenodo.8375010
This dataset provides some metadata for speakers in the demographically sampled part of the Spoken BNC1994 (Crowdy 1995), including information on age, gender, and the total number of word tokens contributed to the corpus.
metadata_spokenBNC1994metadata_spokenBNC1994
metadata_spokenBNC1994A data frame with 1,017 rows and 7 columns:
Speaker ID (e.g. "PS002", "PS003")
Age group, based on the BNC1994 scheme ("0-14", "15-24", "25-34", "35-44", "45-59", "60+", "Unknown")
Speaker gender ("Female" vs. "Male")
Age of speaker; if actual age is not available, imputed based on age_group and age_bin
Number of word tokens the speaker contributed to the corpus
Age group, based on the BNC2014 scheme ("0-9", "10-19", "20-29", "30-39", "40-49", "50-59", "60-69", "70+")
Crowdy, Steve. 1995. The BNC spoken corpus. In Geoffrey Leech, Greg Myers & Jenny Thomas (eds.), Spoken English on Computer: Transcription, Mark-Up and Annotation, 224–234. Harlow: Longman.
This dataset provides some metadata for the speakers in the Spoken BNC2014 (Love et al. 2017), including information on age, gender, and the total number of word tokens contributed to the corpus.
metadata_spokenBNC2014metadata_spokenBNC2014
metadata_spokenBNC2014A data frame with 668 rows and 6 columns:
Speaker ID (e.g. "S0001", "S0002")
Age group, based on the BNC1994 scheme ("0-14", "15-24", "25-34", "35-44", "45-59", "60+", "Unknown")
Speaker gender ("Female" vs. "Male")
Age of speaker; if actual age is not available, imputed based on age_group and age_bin
Number of word tokens the speaker contributed to the corpus
Age group, based on the BNC2014 scheme ("0-9", "10-19", "20-29", "30-39", "40-49", "50-59", "60-69", "70+")
Love, Robbie, Claire Dembry, Andrew Hardie, Vaclav Brezina & Tony McEnery. 2017. The Spoken BNC2014: Designing and building a spoken corpus of everyday conversations. International Journal of Corpus Linguistics, 22(3), 319–344.
This function can be used when plotting dispersion scores with ggplot2. It forces the x-axis to extend from 0 to 1 and adds verbal information at the endpoints to clarify the directionality of scaling. For conventional scaling, these are "(uneven) 0" and "(even) 1", for gries scaling, these are "(even) 0" and "(uneven) 1".
scale_x_dispersion(directionality, n_breaks = 5, leading_zero = TRUE, ...)scale_x_dispersion(directionality, n_breaks = 5, leading_zero = TRUE, ...)
directionality |
Character string indicating the directionality of scaling. Must match the way the dispersion scores were calculated. See details below. Possible values are |
n_breaks |
Number of major scale breaks: Integer value specifying the number of tick marks to display (default: |
leading_zero |
Logical. Whether the tick mark labels should include a leading 0 ("0.50") or not (".50"); default is |
... |
Other arguments passed on to |
This function modifies the x-axis in a ggplot2 object. It forces the axis to extend from 0 to 1 and adds the labels "(even)" and "(uneven)" at the endpoints of the scale (0 and 1), to make clear which value (0 or 1) denotes a maximally even/dispersed/balanced distribution of subfrequencies across corpus parts. The conventional scaling of dispersion measures (see Juilland & Chang-Rodriguez 1964; Carroll 1970; Rosengren 1971) assigns higher values to more even/dispersed/balanced distributions of subfrequencies across corpus parts. In the {tlda} package, this is the default setting for all measures. Gries (2008) uses the reverse scaling, with higher values denoting a more uneven/bursty/concentrated distribution; use directionality = "gries" to choose this option. The function implements no default, so the user must specify which directionality was used when calculating the scores.
The ggplot2 function scale_x_continuous with the appropriate settings for the argument limits, breaks, and labels.
Lukas Soenning
Carroll, John B. 1970. An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. Computer Studies in the Humanities and Verbal Behaviour 3(2). 61–65. doi:10.1002/j.2333-8504.1970.tb00778.x
Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4). 403–437. doi:10.1075/ijcl.13.4.02gri
Juilland, Alphonse G. & Eugenio Chang-Rodríguez. 1964. Frequency dictionary of Spanish words. The Hague: Mouton de Gruyter. doi:10.1515/9783112415467
Rosengren, Inger. 1971. The quantitative concept of language and its relation to the structure of frequency dictionaries. Études de linguistique appliquée (Nouvelle Série) 1. 103–127.
if (require("ggplot2")) { ggplot( data = data.frame( dispersion = stats::runif(100, 0, 1)), aes(x = dispersion)) + geom_dotplot() + scale_x_dispersion( directionality = "conventional", n_breaks = 5) }if (require("ggplot2")) { ggplot( data = data.frame( dispersion = stats::runif(100, 0, 1)), aes(x = dispersion)) + geom_dotplot() + scale_x_dispersion( directionality = "conventional", n_breaks = 5) }
Position scale (x-axis) for Tukey's folded power transformation
scale_x_fpower(lambda = 0, breaks = NULL, labels = NULL, n_breaks = 6, ...)scale_x_fpower(lambda = 0, breaks = NULL, labels = NULL, n_breaks = 6, ...)
lambda |
Numeric value of the applied power transformation |
breaks |
Numeric values indicating where the tick marks should be placed |
labels |
Character vector giving the labels that should be drawn at the tick marks |
n_breaks |
Integer specifying the number of tick marks to draw |
... |
Other arguments passed on to |
This function was written with the help of ChatGPT (version GPT-5.1; OpenAI 2025)
The ggplot2 function scale_(x|y)_continuous() with the appropriate transformation
OpenAI. (2025). ChatGPT (GPT-5.1) Large language model. https://chat.openai.com
if (require("ggplot2")) { ggplot( data = data.frame( dispersion = seq(0, 1, .01)), aes(x = dispersion)) + geom_dotplot() + scale_x_fpower(lambda = .5) }if (require("ggplot2")) { ggplot( data = data.frame( dispersion = seq(0, 1, .01)), aes(x = dispersion)) + geom_dotplot() + scale_x_fpower(lambda = .5) }
This function can be used when plotting dispersion scores with ggplot2. It forces the y-axis to extend from 0 to 1 and adds verbal information at the endpoints to clarify the directionality of scaling. For conventional scaling, these are "(uneven) 0" and "(even) 1", for gries scaling, these are "(even) 0" and "(uneven) 1".
scale_y_dispersion(directionality, n_breaks = 5, leading_zero = TRUE, ...)scale_y_dispersion(directionality, n_breaks = 5, leading_zero = TRUE, ...)
directionality |
Character string indicating the directionality of scaling. Must match the way the dispersion scores were calculated. See details below. Possible values are |
n_breaks |
Number of major scale breaks: Integer value specifying the number of tick marks to display (default: |
leading_zero |
Logical. Whether the tick mark labels should include a leading 0 ("0.50") or not (".50"); default is |
... |
Other arguments passed on to |
This function modifies the y-axis in a ggplot2 object. It forces the axis to extend from 0 to 1 and adds the labels "(even)" and "(uneven)" at the endpoints of the scale (0 and 1), to make clear which value (0 or 1) denotes a maximally even/dispersed/balanced distribution of subfrequencies across corpus parts. The conventional scaling of dispersion measures (see Juilland & Chang-Rodriguez 1964; Carroll 1970; Rosengren 1971) assigns higher values to more even/dispersed/balanced distributions of subfrequencies across corpus parts. In the {tlda} package, this is the default setting for all measures. Gries (2008) uses the reverse scaling, with higher values denoting a more uneven/bursty/concentrated distribution; use directionality = "gries" to choose this option. The function implements no default, so the user must specify which directionality was used when calculating the scores.
The ggplot2 function scale_y_continuous with the appropriate settings for the argument limits, breaks, and labels.
Lukas Soenning
Carroll, John B. 1970. An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index. Computer Studies in the Humanities and Verbal Behaviour 3(2). 61–65. doi:10.1002/j.2333-8504.1970.tb00778.x
Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4). 403–437. doi:10.1075/ijcl.13.4.02gri
Juilland, Alphonse G. & Eugenio Chang-Rodríguez. 1964. Frequency dictionary of Spanish words. The Hague: Mouton de Gruyter. doi:10.1515/9783112415467
Rosengren, Inger. 1971. The quantitative concept of language and its relation to the structure of frequency dictionaries. Études de linguistique appliquée (Nouvelle Série) 1. 103–127.
if (require("ggplot2")) { ggplot( data = data.frame( frequency = c(1, 1.5, 2, 2.5, 3, 4, 5), dispersion = c(.25, .8, .34, .53, .88, .57, .9)), aes(x = frequency, y = dispersion)) + geom_point() + scale_y_dispersion( directionality = "conventional", n_breaks = 3) }if (require("ggplot2")) { ggplot( data = data.frame( frequency = c(1, 1.5, 2, 2.5, 3, 4, 5), dispersion = c(.25, .8, .34, .53, .88, .57, .9)), aes(x = frequency, y = dispersion)) + geom_point() + scale_y_dispersion( directionality = "conventional", n_breaks = 3) }
Position scale (y-axis) for Tukey's folded power transformation
scale_y_fpower(lambda = 0, breaks = NULL, labels = NULL, n_breaks = 6, ...)scale_y_fpower(lambda = 0, breaks = NULL, labels = NULL, n_breaks = 6, ...)
lambda |
Numeric value of the applied power transformation |
breaks |
Numeric values indicating where the tick marks should be placed |
labels |
Character vector giving the labels that should be drawn at the tick marks |
n_breaks |
Integer specifying the number of tick marks to draw |
... |
Other arguments passed on to |
This function was written with the help of ChatGPT (version GPT-5.1; OpenAI 2025)
The ggplot2 function scale_(x|y)_continuous() with the appropriate transformation
OpenAI. (2025). ChatGPT (GPT-5.1) Large language model. https://chat.openai.com
if (require("ggplot2")) { ggplot( data = data.frame( frequency = rnorm(101), dispersion = seq(0, 1, .01)), aes(x = frequency, y = dispersion)) + geom_point() + scale_y_fpower(lambda = .5) }if (require("ggplot2")) { ggplot( data = data.frame( frequency = rnorm(101), dispersion = seq(0, 1, .01)), aes(x = frequency, y = dispersion)) + geom_point() + scale_y_fpower(lambda = .5) }