Deconstructing the Facebook data disaster

For HackPSU Spring 2018, we made a tool to help you understand the breadth of data Facebook collects and stores on its users. Check it out!

UPDATE: our project took 1st PLACE OVERALL!!

How it Works

In the wake of recent news of data leakage on Facebook, Facebook released a tool to allow you to download a dataset containing personal profile information stored by Facebook. Our tool parses some of this dataset and performs certain analytics to help the user make sense of the vast swaths of data.

To download the data:

From your news feed, click on settings under the top-right corner drop-down:

Click the hyperlink that reads "Download a copy of my Facebook data".

Unzip the downloaded file.

To use our tool: Be sure that the unzipped directory containing your Facebook data is in the same directory as the python source file fb.py. Run the following command in a command window or terminal:

$ python fb.py --dir <facebook_data_dir_name>

When the script finishes running, several CSV files will be placed in your current directory. These files can easily be ingested into Splunk using the provided XML, creating a dashboard with nicely organized metrics:

Our analytics

Facebook keeps a record of every IP address that has been associated with your account. We geolocated these and placed them on a map, giving a fairly accurate picture of where the user has lived and travelled:

Using information from the user-agent string of the connecting web browser, we are able to determine with high confidence the type of device used to connect to Facebook. If compromised, this information could be leveraged to launch tailored attacks or ad campaigns based on the types of devices that a user has:

Facebook keeps a log of every friend you have, every friend you have removed, every friend request you have blocked, and outstanding friend requests that have not been accepted or rejected. Aggregating this data has the capacity to reveal major life events, such as an uptick in new friends when the user moves, starts college, or joins a new social community:

The spikes in the graph tend to correspond to major life events.

See this guide for help installing Splunk.

Challenges & Limitations

Facebook delivers the data as a .htm file. As a result, parsing the data into an ingestable format is an involved task. Sometimes, parsing logic is reusable; sometimes it isn't. A small change to the .htm file could result in a substantial amount of code changes required to parse the data.
Splunk has an ingest limit of 500 Mb/day for free accounts, so we were limited in the amount of data that we could analyze in our 24-hour timeframe.
Since we only used our own Facebook data, and we are similar in many demographics (age, gender, vocation..), we ended up analyzing a fairly homogenous dataset.
Our Splunk instance is managed on one of our own laptops, so processing power for advanced data analytics is limited.
Due to the serious privacy concerns associated with this data, we limited the ability to drilldown with certain features (location, friends' names)

Future Directions

We made the parsing code modular, so new users can quickly use our tool by simply downloading their data directory and receive the appropriate output CSV files.
More complex mathematical operations could provide further insights on time-series data (clustering, classification, etc.)
Analyzing more sections of Facebook data would involve writing parsing logic for the appropriate .htm files and creating new Splunk rules to manage the ingest of the appropriate parsed .csv files.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
pics		pics
Facebook_Data.xml		Facebook_Data.xml
README.md		README.md
fb.py		fb.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deconstructing the Facebook data disaster

How it Works

Our analytics

Challenges & Limitations

Future Directions

About

Releases

Packages

Languages

runt1me/hackpsu-2018

Folders and files

Latest commit

History

Repository files navigation

Deconstructing the Facebook data disaster

How it Works

Our analytics

Challenges & Limitations

Future Directions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages