Skip to content
Cory Schillaci edited this page Apr 2, 2015 · 10 revisions

Where to find the Data

We can find the livejournal data here on mercury: /var/local/destress/lj-annex/data/

It's managed with git annex. This means you can make a git clone of the /var/local/destress/lj-annex (which is also a "regular" git repo). You can do this by providing this remote: <yourname>@mercury...:/var/local/destress/lj-annex. Once that's done, if you want to use git annex, type git annex init <name> in your clone, the name will be a unique identifier that is scoped only to that clone. Git annex will automatically grab data from any remote it has access to. All you do when you want a specific file is git annex get <filename>. More info here.

Files concatenated by folder are available on Mercury at ~pierre/combined/events. A list of all the original user files is at ~pierre/combined/files.txt.

All the concatenated files were tokenized using xmltweet. The output is saved in /var/local/destress/tokenized.

What's in the Data

/events/ This folder contains 1281 folders. There is one .xml file per user in the folder corresponding to the first two characters of their userid. This file contains information about all of the events initiated by that user.

Many of the events (how many?) are tagged with a current mood, in the xml files this looks like <current_moodid><int>87</int></current_moodid>. Here 87 indicates the mood 'awake.' There are 132 moods, and it's also possible to specify a custom mood. These moods are displayed at LiveJournal Mood Tree. The csv file https://github.com/berkeley-dsc/destress/blob/master/moodids.csv contains a list of all the moodids, the mood they correspond to, the level in the tree, and the parent mood in the tree. Just for fun, the moodids range from 1 to 134 and do not include 50 or 94.

/index/ This folder contains the indices used for full-text searching via Apache Lucene. It uses the inverted index to rank the documents based on how well they match the requested query.

/meta/ This folder contains some meta-data about the dataset. Here are some guesses as to what they mean:

  • /meta/activity List of each username and the number of events(?) they initiated in the form bob|2792. One line per user. Note that the numbers shown do not always correspond to the number of xml <post>s.
  • /meta/birth User supplied birth dates, e.g. chibimateo|1983-06-11. May be partial dates, for example valdorgassel|01-16.
  • /meta/graph.db ??? Since values are all between 1 and 3577166, it is presumably some sort of connectivity graph between users. Each line is, 'userIdNumber: otherUserIdNumber1 otherUserIdNumber2 otherUserIdNumber1', e.g. '4: 109 25 110 23 107 111 90'.
  • /meta/interests Several lines per user per interest: bob|alternative education
  • /meta/location One line per user specifying country only: xxlilqt4evxx|US
  • /meta/schools List of users who identified themselves with schools. Format is userid schoolid startyear endyear, e.g. alicdeni83 82881 1988 1994. There is one line per user per school. I don't know how to decode school ids.

/names/name2id_0 Key for decoding the user numbers to user names, e.g. 1|bob or 194|liamtheruiner. User id number: 1 to 3577166. Do these user numbers show up anywhere except in /meta/graph.db?

Summary statistics

1,261,821 users

Clone this wiki locally