- zh-google-styleguide - Google 开源项目风格指南.
- protobuf - Protocol Buffers - Google's data interchange format.
- gflags - Commandline flags module for C++.
- glog - Logging library for C++.
- gtest - Google C++ Testing Framework.
- googlemock - Google C++ Mocking Framework.
- leveldb - A fast and lightweight key/value database library by Google. cpy-leveldb - Python bindings for LevelDB using leveldb c api.
- The Chromium Projects - The Chromium projects include Chromium and Chromium OS, the open-source projects behind the Google Chrome browser and Google Chrome OS, respectively.
- toft - C++ Base Library for Linux server side development.
- thirdparty - Put thirdparty library here for toft ant foxy. chen3feng
- folly - Folly is an open-source C++ library developed and used at Facebook.
- darts-clone - A clone of the Darts (Double-ARray Trie System).
- Darts - Double-ARray Trie System. 中文翻译文档
- sparsehash - An extremely memory-efficient hash_map implementation。
- cityhash - The CityHash family of hash functions.
- stringencoders - A collection of high performance c-string transformations, frequently 2x faster than standard implementations (if they exist at all).
- Numpy - NumPy is the fundamental package for scientific computing with Python.
- NLTK - NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets and tutorials supporting research and development in Natural Language Processing. NLTK Book
- jieba - 结巴中文分词.
- gensim - Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.
- LTP - 语言技术平台(Language Technology Platform,LTP)是哈工大社会计算与信息检索研究中心历时十年研制的一整套开放中文自然语言处理系统。
- Stanford CoreNLP - Stanford CoreNLP provides a set of natural language analysis tools which can take raw English language text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and word dependencies, and indicate which noun phrases refer to the same entities.
- openNLP - The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text.
- SRILM - SRILM is a toolkit for building and applying statistical language models (LMs), primarily for use in speech recognition, statistical tagging and segmentation, and machine translation.
- IRSTLM - The IRST Language Modeling Toolkit features algorithms and data structures suitable to estimate, store, and access very large LMs.
- KenLM - KenLM estimates unpruned language models with modified Kneser-Ney smoothing.
- Moses - Moses is a statistical machine translation system that allows you to automatically train translation models for any language pair.
- GIZA++ - GIZA++ is a statical machine translation toolkit that is used to train IBM Models 1-5 and an HMM word alignment model.
- genius - genius中文分词,是基于crf条件随机场的分组件.
- sego - Go中文分词.
- pinyin - Go语言汉字转拼音工具.
- ReVerb - ReVerb is a program that automatically identifies and extracts binary relationships from English sentences. ReVerb is designed for Web-scale information extraction, where the target relations cannot be specified in advance and speed is important.
- Statistical natural language processing and corpus-based computational linguistics: An annotated list of resources - 斯坦福自然语言组的NLP及计算语言学的资料汇总:包括各种工具,代码,语料库,字典,课程的链接及简单介绍。http://t.cn/zOfVAzs
- webdict - WEBDICT 词表计划目标是通过机器学习算法以及人工标注构建一个包含大量网络词汇的、无版权限制的中文词库,从而提高中文网络文本自然语言分析以及开源中文输入法的效果。http://webdict.info/
- sego - Go中文分词 词典用前缀树实现, 分词器算法为基于词频的最短路径加动态规划。支持普通和搜索引擎两种分词模式,支持用户词典、词性标注,可运行JSON RPC服务。
- Lemur - The Lemur Project develops search engines, browser toolbars, text analysis tools, and data resources that support research and development of information retrieval and text mining software.
- Lucene - The Apache Lucene project develops open-source search software.
- Solr - Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest internet sites.
- gensim - Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.
- wukong - 悟空全文搜索引擎.
- Scrapy - a fast high-level screen scraping and web crawling framework for Python.
- distribute_crawler - 使用scrapy,redis, mongodb,graphite实现的一个分布式网络爬虫,底层存储mongodb集群,分布式使用redis实现, 爬虫状态显示使用graphite实现。
- LASSO - LASSO is a parallel machine learning system that learns a regression model from large data. It works in either of two modes: IPM-mode and MPI-mode.
- libsvm - A Library for Support Vector Machines. 支持向量机通俗导论(理解SVM的三层境界) 来自研究者July. 在本文中,你将看到,理解SVM分三层境界, 第一层: 了解SVM(你只需要对SVM有个大致的了解,知道它是个什么东西便已足够); 第二层: 深入SVM(你将跟我一起深入SVM的内部原理,通晓其各处脉络,以为将来运用它时游刃有余); 第三层: 证明SVM(当你了解了所有的原理之后,你会有大笔一挥,尝试证明它的冲动)。
- liblinear - A Library for Large Linear Classification.
- RankLib - RankLib is a library of learning to rank algorithms.
- svmlight - SVMlight is an implementation of Support Vector Machines (SVMs) in C.
- plda - A parallel C++ implementation of fast Gibbs sampling of Latent Dirichlet Allocation
- GibbsLDA++ - A C/C++ implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling technique for parameter estimation and inference.
- Yahoo_LDA - Yahoo!'s topic modelling framework using Latent Dirichlet Allocation
- word2vec - Tool for computing continuous distributed representations of words. Parallelizing word2vec in Python
- Maximum Entropy Modeling Toolkit for Python and C++ - This package provides a (Conditional) Maximum Entropy Modeling Toolkit for Python and C++.
- maxent - A simple C++ library for maximum entropy classification.
- easyME - This is a simple implementation of Maximum Entropy model. Algorithms implemented include: GIS, SCGIS, LBFGS, Gaussian smoothing and Exponential smoothing.
- libLBFGS - This library is a C port of the implementation of Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) method written by Jorge Nocedal.
- OWL-QN - The Orthant-Wise Limited-memory Quasi-Newton algorithm (OWL-QN) is a numerical optimization procedure for finding the optimum of an objective of the form {smooth function} plus {L1-norm of the parameters}. It has been used for training log-linear models (such as logistic regression) with L1-regularization.
- CRF++ - CRF++ is a simple, customizable, and open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data. CRF++ is designed for generic purpose and will be applied to a variety of NLP tasks, such as Named Entity Recognition, Information Extraction and Text Chunking.
- CRFsuite - A fast implementation of Conditional Random Fields (CRFs).
- Wapiti - Wapiti is a very fast toolkit for segmenting and labeling sequences with discriminative models. It is based on maxent models, maximum entropy Markov models and linear-chain CRF and proposes various optimization and regularization methods to improve both the computational complexity and the prediction performance of standard models.
- sofia-ml - Suite of Fast Incremental Algorithms for Machine Learning. Includes methods for learning classification and ranking models, using Pegasos SVM, SGD-SVM, ROMMA, Passive-Aggressive Perceptron, Perceptron with Margins, and Logistic Regression.
- mahout - The Apache Mahout machine learning library's goal is to build scalable machine learning libraries.
- MLTK - MLTK -- the Machine Learning Toolkit -- is a suite of C++ open source modules of Machine Learning.
- FP-growth - An implementation of the FP-growth algorithm in pure Python.
- MLcomp - MLcomp is a free website for objectively comparing machine learning programs across various datasets for multiple problem domains.
- PyBrain - PyBrain is a modular Machine Learning Library for Python. Its goal is to offer flexible, easy-to-use yet still powerful algorithms for Machine Learning Tasks and a variety of predefined environments to test and compare your algorithms. PyBrain is short for Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network Library. In fact, we came up with the name first and later reverse-engineered this quite descriptive "Backronym".
- parameter_server - A distributed machine learning framework.
- vowpal_wabbit - John Langford's original release of Vowpal Wabbit -- a fast online learning algorithm.
- Theano - Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently.
- Caffe - Caffe is a deep learning framework developed with cleanliness, readability, and speed in mind. It was created by Yangqing Jia during his PhD at UC Berkeley, and is in active development by the Berkeley Vision and Learning Center (BVLC) and by community contributors. Caffe is released under the BSD 2-Clause license.
- protobuf - Protocol Buffers - Google's data interchange format.
- jsoncpp - JSON data format manipulation library.
- tinyxml2 - TinyXML-2 is a simple, small, efficient, C++ XML parser that can be easily integrating into other programs.
- thrift - The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.
- MySQL++ - MySQL++ is a C++ wrapper for MySQL’s C API.
- MongodDB - MongoDB (from "humongous") is an open-source document database, and the leading NoSQL database. Written in C++.
- memcached - Free & open source, high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.
- leveldb - A fast and lightweight key/value database library by Google.
- SSDB - A fast NoSQL database server with zset data type, an alternative to Redis. SSDB is a high performace key-value(key-string, key-zset, key-hashmap) NoSQL persistent storage server, using Google LevelDB as storage engine. SSDB is stable, production-ready and is widely used by many Internet companies such as QIHU 360.
- RocksDB - RocksDB is an embeddable persistent key-value store for fast storage. RocksDB can also be the foundation for a client-server database but our current focus is on embedded workloads. RocksDB builds on LevelDB to be scalable to run on servers with many CPU cores, to efficiently use fast storage, to support IO-bound, in-memory and write-once workloads, and to be flexible to allow for innovation.
- fatcache - Memcache on SSD. Think of fatcache as a cache for your big data.
- THUIRDB - THUIRDB是一个C++语言实现的基础库,用于在单机上实现高性能key-value持久化存储和高速查询。THUIRDB Paper
- thrift - The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.
- server1 - a c++ network server/client framework.
- muduo-protorpc - Google Prorobuf RPC based on Muduo.
- Flask - Flask is a microframework for Python based on Werkzeug and Jinja2. It's intended for getting started very quickly and was developed with best intentions in mind. 中文docs
- Bootstrap - Sleek, intuitive, and powerful front-end framework for faster and easier web development.
- Django - Django is a high-level Python Web framework that encourages rapid development and clean, pragmatic design.
- Hadoop - The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
- ZooKeeper - ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
- Storm - Distributed and fault-tolerant realtime computation. Storm 维基 - 提供了有关 Storm、它的理论基础的大量优秀文档,以及有关获取 Storm 和设置新项目的各种教程。您还将找到一些有关 Storm 的许多方面的实用文档,包括 Storm 在本地模式、集群模式和在 Amazon 上的使用。 GitHub 上提供了 Storm 的一个 thorough class tree exists,详细介绍了 Storm 的类和接口。 使用 Twitter Storm 处理实时的大数据 - 流式处理大数据简介 简介: Storm 是一个开源的、大数据处理系统,与其他系统不同,它旨在用于分布式实时处理且与语言无关。了解 Twitter Storm、它的架构,以及批处理和流式处理解决方案的发展形势。 Storm 入门教程 - 来自量子恒道官方博客 storm-starter - Learn to use Storm! StreamCpp - A small C++ wrapper for Storm. Some documentation can be found at http://demeter.inf.ed.ac.uk/cross/stormcpp.html storm-kafka - storm-kafka provides a regular spout implementation and a TransactionalSpout implementation for Apache Kafka 0.7.
- Spark - Lightning-Fast Cluster Computing.
- Puppet - Puppet is IT automation software that helps system administrators manage infrastructure throughout its lifecycle, from provisioning and configuration to orchestration and reporting. Using Puppet, you can easily automate repetitive tasks, quickly deploy critical applications, and proactively manage change, scaling from 10s of servers to 1000s, on-premise or in the cloud.
- Skynet - Skynet is a framework for distributed services in Go.
- Kafka - 分布式消息队列系统,A high-throughput distributed messaging system. Kafka paper: Building LinkedIn’s Real-time Activity Data Pipeline
Kafka Clients
librdkafka kafka-python Kafka papers and presentations - METAQ - METAQ 是 alibaba 公司开发的 一款完全的队列模型消息中间件,服务器使用Java语言编写,可在多种软硬件平台上部署。客户端支持Java、C++编程语言。单台服务器可支持1万以上个消息队列,通过扩容服务器,队列数几乎可任意横向扩展。每个队列都是持久化、长度无限(取决于磁盘空间大小)、并且可从队列任意位置开始消费。
- Celery --- Distributed Task Queue - 这个框架几乎是 Python 下异步消息架构的终极解决方案.
- mapreduce-lite - A C++ implementaton of MapReduce without distributed filesystem.
- GraphChi - GraphChi[huahua] is a spin-off of the GraphLab[rador's retriever] project. GraphChi can run very large graph computations on just a single machine, by using a novel algorithm for processing the graph from disk (SSD or hard drive). Programs for GraphChi are written in similar vertex-centric model as GraphLab. GraphChi runs vertex-centric programs asynchronously (i.e changes written to edges are immediately visible to subsequent computation), and in parallel. GraphChi also supports streaming graph updates and changing the graph structure while computing. GraphChi ppt. GraphChi Paper. GraphChi Video. GraphChi's C++ version. -disk-based large-scale graph computation. Big Data - small machine.
- Giraph - Large-scale graph processing on Hadoop.
- Celery --- Distributed Task Queue - Celery is a simple, flexible and reliable distributed system to process vast amounts of messages, while providing operations with the tools required to maintain such a system. It’s a task queue with focus on real-time processing, while also supporting task scheduling. 这个框架几乎是 Python 下异步消息架构的终极解决方案.
- re2 - an efficient, principled regular expression library.
- SCons - SCons is an Open Source software construction tool—that is, a next-generation build tool. Think of SCons as an improved, cross-platform substitute for the classic Make utility with integrated functionality similar to autoconf/automake and compiler caches such as ccache. In short, SCons is an easier, more reliable and faster way to build software.
- CMake - the cross-platform, open-source build system.
- blade - Blade is designed to be a modernize building system. Mac OS X port of Typhoon Blade
- bobo - Bobo is an easy to use building tool inspired by blade.
- rietveld - Code Review, hosted on Google App Engine.
- Review Board - Take the pain out of code review.
- spf13-vim - spf13-vim is a distribution of vim plugins and resources for Vim, GVim and MacVim. It is a completely cross platform distribution that stays true to the feel of vim while providing modern features like a plugin management system, autocomplete, tags and tons more.
- Maximum Awesome - Config files for vim and tmux, lovingly tended by a small subculture of peace-loving hippies. Built for Mac OS X.
- VimClojure - A filetype, syntax and indent plugin for Clojure.
- glog - Leveled execution logs for Go.
- groupcache - groupcache is a caching and cache-filling library, intended as a replacement for memcached in many cases.
- go-slab - A slab allocator library in the Go Programming Language.
- Go语言资料收集 -
- pycrumbs - Bits and Bytes of Python from the Internet.
- Docker - Docker is an open-source project to easily create lightweight, portable, self-sufficient containers from any application. The same container that a developer builds and tests on a laptop can run at scale, in production, on VMs, bare metal, OpenStack clusters, public clouds and more. Docker 是一个开源自动化部署引擎,它可以将任何应用封装成一个简单、便携、不依赖于其他组件的容器,从而轻松地将其部署在各种虚拟环境中,以便进行各种调试。它既保证了应用的私有性,同时缩短了调试部署的周期,使得测试-封装-部署变得更加容易和便捷。不过现在Docker还在加紧开发中,相信等它开发完毕后,它会给人们的开发带来前所未有的便捷。
- Valgrind - Valgrind is an instrumentation framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect many memory management and threading bugs, and profile your programs in detail. You can also use Valgrind to build new tools.