-
Notifications
You must be signed in to change notification settings - Fork 1
/
3IterationIdeas
51 lines (40 loc) · 2.09 KB
/
3IterationIdeas
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
Hashtags are represented in 2D Space, x-axis is min-date (first
occurance date) and y-axis is max-date (last occurance date).
If there are too many collisions, consider representing all
hashtags in a simple grid, the previous solutions should maybe provide
less overlaping edges. Hashtags
are connected with a line as thick as 1/distanceBetweenThem in n
dimensional cluster's space
Clusters:
Each hashtag is represented with an n-dmensional vector. Each week/date
(will be decided later) is a dimension, each hashtag is a further
dimension. When a hashtag is used k times in a given date, k is added
to the dimension of the vector, representing the given date, same logic
is applied when 2 hashtags are used together k times.
We then apply k-means algorithm to calculate the clusters.
Further representation details:
After we have our clusters, for each cluster, we calculate all vertex
widths - the width of each vertex is 1/distanceBetweenHashtags in the
cluster's n dimensional space. As a connection should only exist when
they have been used together, we explicitly remove the vertex if the
given 2 hashtags have not been used together.
Alternatively, we can map the distance in the clusters' n dimensional
space into a 255 bit integer, so we can use it to create rgb colours. So
bigger distance results in a more white-ish colour, which is harder to
see, 2 hashtags that are very common get a black vertex. (play with
colours, it's fun)
Implementation-specific:
- Use DB as buffer for calculations, otherwise programm will take loads
of RAM.
- Add belongsToCluster:integer attribute to each hashtag, as well as
coordinatesInClusterSpace:integer[] which is calculated as described
above and is
an array of integers.
- Add Edges table. Each entry should represent an edge between 2
hashtags and its width, so Edges(hta:varchar(255), htb:varchar(255),
width:integer).
- Add ClusterCenters Table ClusterCenters(id:integer,
coordinates:integer[])
- Execute k-means while updating the DB after each step
- Calculate edges by calculating euclidic distance between each pair of
hashtags within a cluster