Skip to content
This repository has been archived by the owner on Jul 14, 2020. It is now read-only.

facebook #69

Open
wants to merge 13 commits into
base: master
Choose a base branch
from
100 changes: 100 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
*.csv

# Vim
[._]*.s[a-w][a-z]
[._]s[a-w][a-z]
# session
Session.vim
.netrwhist
*~
tags

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*,cover
.hypothesis/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# IPython Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# dotenv
.env

# virtualenv
venv/
ENV/

# Spyder project settings
.spyderproject

# Rope project settings
.ropeproject
64 changes: 49 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,13 @@
# Facebook Page Post Scraper

This is a fork of Max Woolf's [facebook-page-post-scraper](https://github.com/minimaxir/facebook-page-post-scraper).

It only works on Python 3.

This version allows you to specify the page/group you wish to scrape and where you want CSV files to be stored through command-line arguments.

It also separates your App ID and App secret from the code; now, you have to store these credentials in a separate file.

![](/examples/fb_scraper_data.png)

A tool for gathering *all* the posts and comments of a Facebook Page (or Open Facebook Group) and related metadata, including post message, post links, and counts of each reaction on the post. All this data is exported as a CSV, able to be imported into any data analysis program like Excel.
Expand All @@ -10,24 +18,52 @@ The purpose of the script is to gather Facebook data for semantic analysis, whic

## Usage

### Scrape Posts form Public Page
To scrape posts from a page:

`python3 run.py --page <page name> --cred <path to credential file> --posts-output <filepath>`

To scrape both posts and comments:

```
python3 run.py --page <page name> --cred <path to credential file> --posts-output <filepath> \
--scrape-comments --comments-output <filepath>
```

To scrape from a group, change `--page` to `--group`.

To skip downloading statuses and retrieve comments using an existing CSV file, use the `--use-existing-posts-csv` command:

```
python3 run.py --page <page name> --cred <path to credential file> --posts-output <filepath> \
--scrape-comments --comments-output <filepath> --use-existing-posts-csv
```


The Page data scraper is implemented as a Python 2.7 script in `get_fb_posts_fb_page.py`; fill in the App ID and App Secret of a Facebook app you control (I strongly recommend creating an app just for this purpose) and the Page ID of the Facebook Page you want to scrape at the beginning of the file. Then run the script.
### Credential file format

The `-cred` command-line argument specifies where your credential file is located.

**Do not share this file with anyone.**

It should look something like this:

```
app_id = "111111111111111"
app_secret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
```

You need an App ID and App Secret of a Facebook app you control (I strongly recommend creating an app just for this purpose) and the Page ID of the Facebook Page you want to scrape.

Example CSVs for CNN, NYTimes, and BuzzFeed data are not included in this repository due to size, but you can download [CNN data here](https://dl.dropboxusercontent.com/u/2017402/cnn_facebook_statuses.csv.zip) [2.7MB ZIP], [NYTimes data here](https://dl.dropboxusercontent.com/u/2017402/nytimes_facebook_statuses.csv.zip) [4.9MB ZIP], and [BuzzFeed data here](https://dl.dropboxusercontent.com/u/2017402/buzzfeed_facebook_statuses.csv.zip) [2.1MB ZIP].

### Scrape Posts from Open Group
### Getting the numeric group ID

To get data from an Open Group, use the `get_fb_posts_fb_group.py` script with the App ID and App Secret filled in the same way. However, the `group_id` is a *numeric ID*. For groups without a custom username, the ID will be in the address bar; for groups with custom usernames, to get the ID, do a View Source on the Group Page, search for "entity_id", and use the number to the right of that field. For example, the `group_id` of [Hackathon Hackers](https://www.facebook.com/groups/hackathonhackers/) is 759985267390294.
For groups without a custom username, the ID will be in the address bar; for groups with custom usernames, to get the ID, do a View Source on the Group Page, search for "entity_id", and use the number to the right of that field. For example, the `group_id` of [Hackathon Hackers](https://www.facebook.com/groups/hackathonhackers/) is 759985267390294.

![](/examples/entity.png)

You can download example data for [Hackathon Hackers here](https://dl.dropboxusercontent.com/u/2017402/759985267390294_facebook_statuses.csv.zip) [4.7MB ZIP]

### Scrape Comments From Page/Group Posts

To scrape all the user comments from the posts, create a CSV using either of the above scripts, then run the `get_fb_comments_from_fb.py` script, specifying the Page/Group as the `file_id`. The output includes the original `status_id` where the comment is located so you can map the comment to the original Post with a `JOIN` or `VLOOKUP`, and also a `parent_id` if the comment is a reply to another comment.

Keep in mind that large pages such as CNN have *millions* of comments, so be careful! (scraping throughput is approximately 87k comments/hour)

## Privacy
Expand All @@ -38,16 +74,14 @@ Note that this script, and any variant of this script, *cannot* be used to scrap

## Maintainer

* Max Woolf ([@minimaxir](http://minimaxir.com))

For more information on how the script was originally created, and some tips on how to create similar scrapers yourself, see my blog post [How to Scrape Data From Facebook Page Posts for Statistical Analysis](http://minimaxir.com/2015/07/facebook-scraper/).
* Koh Wei Jie

## Credits

Peeter Tintis, whose [fork](https://github.com/Digitaalhumanitaaria/facebook-page-post-scraper/blob/master/get_fb_posts_fb_page.py) of this repo implements code for finding separate reaction counts per [this Stack Overflow answer](http://stackoverflow.com/a/37239851).
This is a fork of Max Woolf's code at https://github.com/minimaxir/facebook-page-post-scraper

## License
Parts of this README were copied verbatim.

MIT
## License

If you do find this script useful, a link back to this repository would be appreciated. Thanks!
Be aware that this is a fork of Max Woolf's MIT-licensed code.
1 change: 1 addition & 0 deletions credentials.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

Loading