Unnest + Group by + Count #1

ArturFormella · 2017-08-29T20:12:45Z

Hello!
I love this extension!

Maybe you can help me with other issue with arrays. This is the most heavy part of my queries now: unnest + group by + count, let's name it:

count_tags( integer[] ) RETURNS KeyValue[]

There is also one big problem with any optimisation - PostgreSQL prohibit functions which return a set of records.

Example:

drop TABLE if exists news;
CREATE TABLE news(
  id serial NOT NULL,
  tags integer[] NOT NULL,
  category_id integer NOT NULL,
  CONSTRAINT news_pkey PRIMARY KEY (id)
);
create index on news(category_id);

/* sample data */
insert into news (tags,category_id)
	SELECT 
		'{1}'::integer[] + (num%100)+ (num%10)+ (num%3)+ (num%2),
		(num%10)+1
	FROM generate_series(1,10000000) as n(num);
analyze news;

explain analyze
SELECT tag, count(1) as counter
FROM (
	SELECT unnest(tags) as tag
	FROM news
	where
		category_id in (5,7,9)
) f
GROUP BY f.tag;

HashAggregate  (cost=625110.57..625112.57 rows=200 width=12) (actual time=510.213..510.227 rows=33 loops=1)
  Group Key: unnest(news.tags)
  ->  Bitmap Heap Scan on news  (cost=5995.20..171060.58 rows=30270000 width=4) (actual time=20.786..305.056 rows=1500000 loops=1)
        Recheck Cond: (category_id = ANY ('{5,7,9}'::integer[]))
        Heap Blocks: exact=10310
        ->  Bitmap Index Scan on news_category_id_idx  (cost=0.00..5919.52 rows=302700 width=0) (actual time=19.191..19.191 rows=300000 loops=1)
              Index Cond: (category_id = ANY ('{5,7,9}'::integer[]))
Planning time: 0.883 ms
Execution time: 510.405 ms

The same data counted in NodeJs took 1/100 of this time. Intuition tells me there must be another way.
What do you think about that? Is it possible in C?

Thank you

The text was updated successfully, but these errors were encountered:

pjungwir · 2017-08-29T22:33:09Z

Hi, thanks for your note! I ran your code locally and get even worse performance than you show above:

                                                                      QUERY PLAN                                                                       
-------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=6139504.03..6139506.03 rows=200 width=12) (actual time=21138.722..21138.741 rows=33 loops=1)
   Group Key: unnest(news.tags)
   ->  Bitmap Heap Scan on news  (cost=59006.20..1681526.53 rows=297198500 width=4) (actual time=332.960..11076.984 rows=15000000 loops=1)
         Recheck Cond: (category_id = ANY ('{5,7,9}'::integer[]))
         Rows Removed by Index Recheck: 3581101
         Heap Blocks: exact=50353 lossy=52740
         ->  Bitmap Index Scan on news_category_id_idx  (cost=0.00..58263.20 rows=2971985 width=0) (actual time=324.073..324.073 rows=3000000 loops=1)
               Index Cond: (category_id = ANY ('{5,7,9}'::integer[]))
 Planning time: 0.153 ms
 Execution time: 21138.822 ms

Just to make sure I understand, you want to see how many news items in category 5/7/9 have each tag? For instance:

 tag | counter 
-----+---------
   8 | 1100000
  16 |  100000
  54 |  100000
  46 |  100000
  48 |  100000
  28 |  100000
  36 |  100000
  94 |  100000
   4 | 1100000
  56 |  100000
. . .

If that's right, I'm not sure how much a C function is going to help here. On the one hand, UNNEST can be pretty slow (and easily cause a spill to disk). On the other hand, you're still going to have to load 3,000,000 rows and inspect each one.

I'll think about it a bit, but at first glance I think your best best is a different table structure. (Do you really have only 10 categories?) If you come up with anything in the meantime please let me know!

ArturFormella · 2017-08-30T07:09:12Z

Hello!
Thank you for your reply.

Just to make sure I understand, you want to see how many news items in category 5/7/9 have each tag? For instance:

yes, exactly

Of course I have more categories (about 2000), and full text search and other conditions in this query. It looks like faceted search (http://akorotkov.github.io/blog/2016/06/17/faceted-search/), here was an example only :)
Loading rows from disk took ~11076 ms but counting 10061 ms (21138.741-11076.984) and that's the main problem.

My first idea was to reduce data while scanning rows instead of unnest+grouping:

create hashmap structure S (key-value)
foreach row containing tags (integer[])
foreach tag (integer)
if( S.contains(tag) )
S[tag]++
else
S[tag]=1
Return S as two-dimension array (keys,values) or JSON (or polygon?)

But I can't find a way to implement it with PostgreSQL API as an aggregate function. Does structure S must be serialized every row to one value and de-serialized for next row?

ArturFormella · 2017-10-13T13:44:49Z

Hello!
It is not easy to use map structure in C but I think I've got it.

ArturFormella@495d7df

Example usage:

explain analyze 
select key,value 
	from (
		SELECT 
			faceted_count(tags) 
		FROM public.news 
		WHERE 
			category_id in (5,7,9)
	) as xx(y), unnest(y[1:1], y[2:2] ) 
as pair(key,value);

Result is more than 50% faster than unnest+group by+count and should use less memory.

Could you check if there are no errors? Do you have any suggestions?

ArturFormella · 2017-10-14T21:17:38Z

I've made a parallel safe version of faceted_count. It's even faster.

set max_parallel_workers_per_gather to 2;
set force_parallel_mode to on;
explain analyze 
select key,value 
	from (
		SELECT 
			faceted_count(tags) 
		FROM public.news 
		WHERE 
			category_id in (5,7,9)
	) as xx(y), unnest(y[1:1], y[2:2] ) 
as pair(key,value);

https://github.com/ArturFormella/aggs_for_arrays/blob/master/faceted_count.c

Next step is to tweak up hashmap and serialization.

pjungwir · 2017-10-19T18:13:33Z

Thank you for your work on this function! I've only skimmed the code but I think it's a worthwhile addition. I'm trying to decide whether it belongs here or in aggs_for_vecs. It doesn't quite fit in either place, so I think I'll probably keep it here, despite the other functions not being really aggregates. If I come up with a better idea I'll let you know. :-) Anyway, just wanted to let you know that I'm following your progress with interest! If you'd like any help re a test file or documentation I'm happy to add those.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unnest + Group by + Count #1

Unnest + Group by + Count #1

ArturFormella commented Aug 29, 2017

pjungwir commented Aug 29, 2017

ArturFormella commented Aug 30, 2017 •

edited

Loading

ArturFormella commented Oct 13, 2017

ArturFormella commented Oct 14, 2017

pjungwir commented Oct 19, 2017

Unnest + Group by + Count #1

Unnest + Group by + Count #1

Comments

ArturFormella commented Aug 29, 2017

pjungwir commented Aug 29, 2017

ArturFormella commented Aug 30, 2017 • edited Loading

ArturFormella commented Oct 13, 2017

ArturFormella commented Oct 14, 2017

pjungwir commented Oct 19, 2017

ArturFormella commented Aug 30, 2017 •

edited

Loading