# General type-token distribution

@article{Hidaka2013GeneralTD, title={General type-token distribution}, author={Shohei Hidaka}, journal={Biometrika}, year={2013}, volume={101}, pages={999-1002} }

We consider the problem of estimating the number of types in a corpus using the number of types observed in a sample of tokens from that corpus. We derive exact and asymptotic distributions for the number of observed types, conditioned on the number of tokens and the latent type distribution. We use the asymptotic distributions to derive an estimator of the latent number of types and validate this estimator numerically.

#### Figures from this paper

#### 6 Citations

Estimating the latent number of types in growing corpora with reduced cost-accuracy trade-off.

- Psychology, Medicine
- Journal of child language
- 2016

This study proposes a novel technique to estimate the latent number of words from a series of words uttered by children, and proposes a new sampling scheme for vocabulary assessment that has lower cost and higher accuracy compared to existing methods. Expand

Type-token models: a comparative study

- Mathematics, Computer Science
- J. Quant. Linguistics
- 2015

The type (V) – token (N) relationship has been studied for almost a century and a number of models have been developed to examine this relationship, but comparative studies have been rare. Expand

Modelling Population Size Using Horvitz-Thompson Approach Based on the Zero-Truncated Poisson Lindley Distribution

- Computer Science
- NUMTA
- 2019

The simulation results show that the Horvitz-Thompson estimator based on the zero-truncated Poisson Lindley distribution for modelling the population size provides a good fit when compared to thezero-trunked Poisson distribution. Expand

Statistical methods for biodiversity assessment

- Mathematics
- 2016

This thesis focuses on statistical methods for estimating the number of species which is a natural index for measuring biodiversity. Both parametric and nonparametric approaches are investigated for… Expand

Leveraging mutual exclusivity for faster cross-situational word learning: A theoretical analysis

- Psychology, Computer Science
- CogSci
- 2017

The 39th Annual Meeting of the Cognitive Science Society (CogSci) (London, UK, 26-29 July 2017) aims to advance the understanding of why language impairment is a major cause of disability in people with autism. Expand

Quantifying temporal trends in biodiversity , and how they vary spatially

- 2015

Guppies inhabit streams in Trinidad and habitats can be categorised into high and low predation areas. Experimental transplants of guppies from high to low predation streams were performed in 2008… Expand

#### References

SHOWING 1-10 OF 40 REFERENCES

SOME ELABORATIONS UPON GANI'S MODEL FOR THE TYPE-TOKEN RELATIONSHIP

- Mathematics
- 1981

This paper considers certain stochastic models for token and type counts in literary texts. Elaborating on some models of Gani, it is shown that reasonable fits can be obtained to some data of Yule… Expand

Good-Turing Frequency Estimation Without Tears

- Computer Science
- J. Quant. Linguistics
- 1995

The Simple Good–Turing estimator is defined, which is straightforward to use and performs well, absolutely and relative both to the approaches just discussed and to other, more sophisticated techniques. Expand

How useful is the logarithmic type/token ratio?

- Sociology
- 1971

What has been described as ‘one of the most remarkable (facts) in quantitative linguistics’ is the constancy of the logarithmic type/token ratio. If V denotes vocabulary and N text length, then log… Expand

Sampling-Based Estimation of the Number of Distinct Values of an Attribute

- Computer Science
- VLDB
- 1995

This appears to be the first extensive comparison of distinct-value estimators in either the database or statistical literature, and is certainly the first to use highlyskewed data of the sort frequently encountered in database applications. Expand

Sampling from Dirichlet partitions: estimating the number of species

- Mathematics
- 2009

The Dirichlet partition of an interval can be viewed as the generalization of several classical models in ecological statistics. We recall the unordered Ewens sampling formulae -ESF) from finite… Expand

On the relation between the type–token and species-area problems

- Mathematics
- 1982

The species-area problem in biology and the type-token problem in literary studies are analogues of one another but have nearly disjoint literatures. Here their relationship is treated, a critique of… Expand

How Variable May a Constant be? Measures of Lexical Richness in Perspective

- Mathematics, Computer Science
- Comput. Humanit.
- 1998

The results suggest that the empirical trajectories tap into a considerable amount of authorial structure without, however, guaranteeing that spatial separation implies a difference in authorship. Expand

Estimating the Number of Species: A Review

- Mathematics
- 1993

How many kinds are there? Suppose that a population is partitioned into C classes. In many situations interest focuses not on estimation of the relative sizes of the classes, but on estimation of C… Expand

Measures of Lexical Richness

- Computer Science
- 2012

Lexical richness is about the quality of vocabulary in a language sample. For some, this is equated with the variety of lexis, while for others it is a multidimensional concept.
Keywords:
… Expand

Nonparametric estimation of the number of classes in a population

- Mathematics
- 1984

On applique la methode d'Efron (1981, 1982) a la construction d'intervalles de confiance bases sur des distributions du bootstrap