Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnest + Group by + Count #1

Open
ArturFormella opened this issue Aug 29, 2017 · 5 comments
Open

Unnest + Group by + Count #1

ArturFormella opened this issue Aug 29, 2017 · 5 comments

Comments

@ArturFormella
Copy link

Hello!
I love this extension!

Maybe you can help me with other issue with arrays. This is the most heavy part of my queries now: unnest + group by + count, let's name it:

count_tags( integer[] ) RETURNS KeyValue[]

There is also one big problem with any optimisation - PostgreSQL prohibit functions which return a set of records.

Example:

drop TABLE if exists news;
CREATE TABLE news(
  id serial NOT NULL,
  tags integer[] NOT NULL,
  category_id integer NOT NULL,
  CONSTRAINT news_pkey PRIMARY KEY (id)
);
create index on news(category_id);

/* sample data */
insert into news (tags,category_id)
	SELECT 
		'{1}'::integer[] + (num%100)+ (num%10)+ (num%3)+ (num%2),
		(num%10)+1
	FROM generate_series(1,10000000) as n(num);
analyze news;

explain analyze
SELECT tag, count(1) as counter
FROM (
	SELECT unnest(tags) as tag
	FROM news
	where
		category_id in (5,7,9)
) f
GROUP BY f.tag;
HashAggregate  (cost=625110.57..625112.57 rows=200 width=12) (actual time=510.213..510.227 rows=33 loops=1)
  Group Key: unnest(news.tags)
  ->  Bitmap Heap Scan on news  (cost=5995.20..171060.58 rows=30270000 width=4) (actual time=20.786..305.056 rows=1500000 loops=1)
        Recheck Cond: (category_id = ANY ('{5,7,9}'::integer[]))
        Heap Blocks: exact=10310
        ->  Bitmap Index Scan on news_category_id_idx  (cost=0.00..5919.52 rows=302700 width=0) (actual time=19.191..19.191 rows=300000 loops=1)
              Index Cond: (category_id = ANY ('{5,7,9}'::integer[]))
Planning time: 0.883 ms
Execution time: 510.405 ms

The same data counted in NodeJs took 1/100 of this time. Intuition tells me there must be another way.
What do you think about that? Is it possible in C?

Thank you

@pjungwir
Copy link
Owner

Hi, thanks for your note! I ran your code locally and get even worse performance than you show above:

                                                                      QUERY PLAN                                                                       
-------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=6139504.03..6139506.03 rows=200 width=12) (actual time=21138.722..21138.741 rows=33 loops=1)
   Group Key: unnest(news.tags)
   ->  Bitmap Heap Scan on news  (cost=59006.20..1681526.53 rows=297198500 width=4) (actual time=332.960..11076.984 rows=15000000 loops=1)
         Recheck Cond: (category_id = ANY ('{5,7,9}'::integer[]))
         Rows Removed by Index Recheck: 3581101
         Heap Blocks: exact=50353 lossy=52740
         ->  Bitmap Index Scan on news_category_id_idx  (cost=0.00..58263.20 rows=2971985 width=0) (actual time=324.073..324.073 rows=3000000 loops=1)
               Index Cond: (category_id = ANY ('{5,7,9}'::integer[]))
 Planning time: 0.153 ms
 Execution time: 21138.822 ms

Just to make sure I understand, you want to see how many news items in category 5/7/9 have each tag? For instance:

 tag | counter 
-----+---------
   8 | 1100000
  16 |  100000
  54 |  100000
  46 |  100000
  48 |  100000
  28 |  100000
  36 |  100000
  94 |  100000
   4 | 1100000
  56 |  100000
. . .

If that's right, I'm not sure how much a C function is going to help here. On the one hand, UNNEST can be pretty slow (and easily cause a spill to disk). On the other hand, you're still going to have to load 3,000,000 rows and inspect each one.

I'll think about it a bit, but at first glance I think your best best is a different table structure. (Do you really have only 10 categories?) If you come up with anything in the meantime please let me know!

@ArturFormella
Copy link
Author

ArturFormella commented Aug 30, 2017

Hello!
Thank you for your reply.

Just to make sure I understand, you want to see how many news items in category 5/7/9 have each tag? For instance:

yes, exactly

Of course I have more categories (about 2000), and full text search and other conditions in this query. It looks like faceted search (http://akorotkov.github.io/blog/2016/06/17/faceted-search/), here was an example only :)
Loading rows from disk took ~11076 ms but counting 10061 ms (21138.741-11076.984) and that's the main problem.

My first idea was to reduce data while scanning rows instead of unnest+grouping:

  1. create hashmap structure S (key-value)
  2. foreach row containing tags (integer[])
    foreach tag (integer)
    if( S.contains(tag) )
    S[tag]++
    else
    S[tag]=1
  3. Return S as two-dimension array (keys,values) or JSON (or polygon?)

But I can't find a way to implement it with PostgreSQL API as an aggregate function. Does structure S must be serialized every row to one value and de-serialized for next row?

@ArturFormella
Copy link
Author

Hello!
It is not easy to use map structure in C but I think I've got it.

ArturFormella@495d7df

Example usage:

explain analyze 
select key,value 
	from (
		SELECT 
			faceted_count(tags) 
		FROM public.news 
		WHERE 
			category_id in (5,7,9)
	) as xx(y), unnest(y[1:1], y[2:2] ) 
as pair(key,value);

Result is more than 50% faster than unnest+group by+count and should use less memory.

Could you check if there are no errors? Do you have any suggestions?

@ArturFormella
Copy link
Author

I've made a parallel safe version of faceted_count. It's even faster.

set max_parallel_workers_per_gather to 2;
set force_parallel_mode to on;
explain analyze 
select key,value 
	from (
		SELECT 
			faceted_count(tags) 
		FROM public.news 
		WHERE 
			category_id in (5,7,9)
	) as xx(y), unnest(y[1:1], y[2:2] ) 
as pair(key,value);

https://github.com/ArturFormella/aggs_for_arrays/blob/master/faceted_count.c

Next step is to tweak up hashmap and serialization.

@pjungwir
Copy link
Owner

Thank you for your work on this function! I've only skimmed the code but I think it's a worthwhile addition. I'm trying to decide whether it belongs here or in aggs_for_vecs. It doesn't quite fit in either place, so I think I'll probably keep it here, despite the other functions not being really aggregates. If I come up with a better idea I'll let you know. :-) Anyway, just wanted to let you know that I'm following your progress with interest! If you'd like any help re a test file or documentation I'm happy to add those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants