Updated on 2024-09-03 GMT+08:00

Collecting Document Statistics

The function ts_stat is useful for checking your configuration and for finding stop-word candidates.

1
2
3
ts_stat(sqlquery text, [ weights text, ]
        OUT word text, OUT ndoc integer,
        OUT nentry integer) returns setof record

sqlquery is a text value containing an SQL query which must return a single tsvector column. ts_stat executes the query and returns statistics about each distinct lexeme (word) contained in the tsvector data. The columns returned are

  • word text: the value of a lexeme
  • ndoc integer: number of documents (tsvectors) the word occurred in
  • nentry integer: total number of occurrences of the word

If weights are supplied, only occurrences having one of those weights are counted. For example, to find the ten most frequent words in a document collection:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
SELECT * FROM ts_stat('SELECT to_tsvector(''english'', sr_reason_sk) FROM tpcds.store_returns WHERE sr_customer_sk < 10') ORDER BY nentry DESC, ndoc DESC, word LIMIT 10;
   word | ndoc | nentry 
------+------+--------
 32   |    2 |      2
 33   |    2 |      2
 1    |    1 |      1
 10   |    1 |      1
 13   |    1 |      1
 14   |    1 |      1
 15   |    1 |      1
 17   |    1 |      1
 20   |    1 |      1
 22   |    1 |      1
(10 rows)

The same, but counting only word occurrences with weight A or B:

1
2
3
4
SELECT * FROM ts_stat('SELECT to_tsvector(''english'', sr_reason_sk) FROM tpcds.store_returns WHERE sr_customer_sk < 10', 'a') ORDER BY nentry DESC, ndoc DESC, word LIMIT 10;
 word | ndoc | nentry 
------+------+--------
(0 rows)