Kenny Bastani: A Docker Image for Graph Analytics on Neo4j with A...

Kenny Bastani: A Docker Image for Graph Analytics on Neo4j with A...: I've just released a useful new Docker image for graph analytics on a Neo4j graph database with Apache Spark GraphX. This image deploy...

Named Entity Extraction

A Survey of named entity recognition and classification
Evaluation of Named Entity Extraction Systems
NERD: A Framework for Unifying Named Entity Recognition and Disambiguation Extraction Tools
NERD: Evaluating Named Entity Recognition Tools in the Web of Data
NERD: an open source platform for extracting and disambiguating named entities in very diverse documents
NERD Ontology
Unsupervised Named-Entity Extraction from the Web: An Experimental Study

Stanford Named Entity Recognizer (Conditional Random Field) Whitepaper
GATE (General Architecture for Text Engineering) ANNIE (A Nearly-New Information Extraction) System
Illinois Named Entity Tagger
Balie: Multilingual Information Extraction from Text with Machine Learning and Natural Language Techniques
Mallet: Machine Learning

Apache Nutch (Web Crawler); Bixo (Web Mining); Behemoth (Hadoop Document Analysis); Apache OpenNLP (Natural Language Processing); Apache Stanbol (Semantic Content Management); Apache Tika (Metadata and text extraction); Apache UIMA (Unstructured Information Management Architecture); Apache Mahout (Machine Learning); Apache Avro (Data Serialization); Apache SOLR/Lucene; Apache Clerezza (OSGi RESTful Web framework, Triplestore DB); Apache Jena (Semantic Web: RDF, Triplestore DB, OWL); Fedora (Flexible Extensible Digital Object Repository Architecture), Apache Ambari

Maui (Topic Indexing); Weka (Data Mining); LingPipe; FreeLing; OpenCalais; DBpediaSpotlight

Alchemy API; Evri API; Web ARChive (WARC) format

HBase  Bigtable: A Distributed Storage System for Structured Data , Apache Phoenix

Docker and DevOps


Docker Basics  (Tutorial)
Getting started with Docker
Docker User Guide
Dockerizing Applications
Docker Network Configuration
Working with Containers; Automatically Start Containers
Docker Run Reference
Launching Containers with Fleet; Fleet Configuration and API
Getting Started with Etcd; Etcd Configuration
Getting started with system
Working with Docker Images
Google Compute Engine: Container Images

Microservices in a Nutshell

The following is an except from an article that originally appeared on Martin Fowler's website.  

"Microservices" - yet another new term on the crowded streets of software architecture. Although our natural inclination is to pass such things by with a contemptuous glance, this bit of terminology describes a style of software systems that we are finding more and more appealing. We've seen many projects use this style in the last few years, and results so far have been positive, so much so that for many of our colleagues this is becoming the default style for building enterprise applications. Sadly, however, there's not much information that outlines what the microservice style is and how to do it.
In short, the microservice architectural style is an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API. These services are built around business capabilities and independently deployable by fully automated deployment machinery. There is a bare mininum of centralized management of these services, which may be written in different programming languages and use different data storage technologies.

To start explaining the microservice style it's useful to compare it to the monolithic style: a monolithic application built as a single unit. Enterprise Applications are often built in three main parts: a client-side user interface (consisting of HTML pages and javascript running in a browser on the user's machine) a database (consisting of many tables inserted into a common, and usually relational, database management system), and a server-side application. The server-side application will handle HTTP requests, execute domain logic, retrieve and update data from the database, and select and populate HTML views to be sent to the browser. This server-side application is a monolith - a single logical executable. Any changes to the system involve building and deploying a new version of the server-side application.

Such a monolithic server is a natural way to approach building such a system. All your logic for handling a request runs in a single process, allowing you to use the basic features of your language to divide up the application into classes, functions, and namespaces. With some care, you can run and test the application on a developer's laptop, and use a deployment pipeline to ensure that changes are properly tested and deployed into production. You can horizontally scale the monolith by running many instances behind a load-balancer.

Monolithic applications can be successful, but increasingly people are feeling frustrations with them - especially as more applications are being deployed to the cloud . Change cycles are tied together - a change made to a small part of the application, requires the entire monolith to be rebuilt and deployed. Over time it's often hard to keep a good modular structure, making it harder to keep changes that ought to only affect one module within that module. Scaling requires scaling of the entire application rather than parts of it that require greater resource.




These frustrations have led to the microservice architectural style: building applications as suites of services. As well as the fact that services are independently deployable and scalable, each service also provides a firm module boundary, even allowing for different services to be written in different programming languages. They can also be managed by different teams.

We do not claim that the microservice style is novel or innovative, its roots go back at least to the design principles of Unix. But we do think that not enough people consider a microservice architecture and that many software developments would be better off if they used it.

For more information: 
James and Martin’s article goes on to define what a microservice architecture is by laying out 9 common characteristics, discussing its relationship with Service-Oriented Architecture, and considering whether this style is the future of enterprise software. Read it here: martinfowler.com/articles/microservices.html.

James Lewis is a Principal Consultant at ThoughtWorks and member of the Technology Advisory Board. James' interest in building applications out of small collaborating services stems from a background in integrating enterprise systems at scale. He's built a number of systems using microservices and has been an active participant in the growing community for a couple of years.

Martin Fowler is an author, speaker, and general loud-mouth on software development. He's long been puzzled by the problem of how componentize software systems, having heard more vague claims than he's happy with. He hopes that microservices will live up to the early promise its advocates have found.

Pattern: Microservices Architecture
The Scale Cube
SRP: The Single Responsibility Principle (.pdf)
Decomposing Applications for deployability and scalability



Building microservices with Spring Boot: Part 1. Part 2, Part 3 (Deploying Spring Boot-based microservices  with Docker)

A Quick Introduction to CoreOS
An Introduction to CoreOS System Components
CoreOSKubernetesFleetEtcd
CoreOS Continued:etcd
Running Kubernetes on CoreOS Part 1, Part 2
CoreOS Contined: Fleet and Docker
Launching Containers with fleet
Deploying a NodeJS Application using Docker
Deploying Docking Containers on CoreOS using Fleet
Running CoreOS on Vagrant
Running CoreOS on Google Compute Engine
Running CoreOS on EC2

Toolkit: Spray; Akka; Scala; Clojure; Spring; Dropwizard (Jetty (Web Server), Jersey (RESTful), Jackson (JSON), JDBI (SQL), Logback , Yammer metrics, Guava (Core libraries), Hibernate Validator; NodeJS; Play; Python; GitHub

Apache Hue
Elasticsearch

Outlier Analysis

Outliers



Many machine learning algorithms are sensitive to the range and distribution of attribute values in the input data. Outliers in input data can skew and mislead the training process of machine learning algorithms resulting in longer training times, less accurate models and ultimately poorer results.

Even before predictive models are prepared on training data, outliers can result in misleading representations and in turn misleading interpretations of collected data. Outliers can skew the summary distribution of attribute values in descriptive statistics like mean and standard deviation and in plots such as histograms and scatterplots, compressing the body of the data.

Finally, outliers can represent examples of data instances that are relevant to the problem such as anomalies in the case of fraud detection and computer security.

Outlier Modeling



Outliers are extreme values that fall a long way outside of the other observations. For example, in a normal distribution, outliers may be values on the tails of the distribution.

The process of identifying outliers has many names in data mining and machine learning such as outlier mining, outlier modeling and novelty detection and anomaly detection.

In his book Outlier Analysis, Aggarwal provides a useful taxonomy of outlier detection methods, as follows:

Extreme Value Analysis: Determine the statistical tails of the underlying distribution of the data. For example, statistical methods like the z-scores on univariate data.

Probabilistic and Statistical Models:  Determine unlikely instances from a probabilistic model of the data. For example, gaussian mixture models optimized using expectation-maximization.

Linear Models: Projection methods that model the data into lower dimensions using linear correlations. For example, principle component analysis and data with large residual errors may be outliers.

Proximity-based Models: Data instances that are isolated from the mass of the data as determined by cluster, density or nearest neighbor analysis.
Information Theoretic Models: Outliers are detected as data instances that increase the complexity (minimum code length) of the dataset.

High-Dimensional Outlier Detection: Methods that search subspaces for outliers give the breakdown of distance based measures in higher dimensions (curse of dimensionality).

Aggarwal comments that the interpretability of an outlier model is critically important. Context or rationale is required around decisions why a specific data instance is or is not an outlier.

In his contributing chapter to Data Mining and Knowledge Discovery Handbook, Irad Ben-Gal proposes a taxonomy of outlier models as univariate or multivariate and parametric and nonparametric. This is a useful way to structure methods based on what is known about the data. For example:

Are you considered with outliers in one or more than one attributes (univariate or multivariate methods)?
Can you assume a statistical distribution from which the observations were sampled or not (parametric or nonparametric)?

Get Started



There are many methods and much research put into outlier detection. Start by making some assumptions and design experiments where you can clearly observe the effects of the those assumptions against some performance or accuracy measure.

I recommend working through a stepped process from extreme value analysis, proximity methods and projection methods.

Extreme Value Analysis



You do not need to know advanced statistical methods to look for, analyze and filter out outliers from your data. Start out simple with extreme value analysis.

Focus on univariate methods
Visualize the data using scatterplots, histograms and box and whisker plots and look for extreme values
Assume a distribution (Gaussian) and look for values more than 2 or 3 standard deviations from the mean or 1.5 times from the first or third quartile
Filter out outliers candidate from training dataset and assess your models performance

Proximity Methods



Once you have explore simpler extreme value methods, consider moving onto proximity-based methods.

Use clustering methods to identify the natural clusters in the data (such as the k-means algorithm)
Identify and mark the cluster centroids
Identify data instances that are a fixed distance or percentage distance from cluster centroids
Filter out outliers candidate from training dataset and assess your models performance

Projection Methods



Projection methods are relatively simple to apply and quickly highlight extraneous values.

Use projection methods to summarize your data to two dimensions (such as PCA, SOM or Sammon’s mapping)
Visualize the mapping and identify outliers by hand
Use proximity measures from projected values or codebook vectors to identify outliers
Filter out outliers candidate from training dataset and assess your models performance

Methods Robust to Outliers



An alternative strategy is to move to models that are robust to outliers. There are robust forms of regression that minimize the median least square errors rather than mean (so-called robust regression), but are more computationally intensive. There are also methods like decision trees that are robust to outliers.

You could spot check some methods that are robust to outliers. If there are significant model accuracy benefits then there may be an opportunity to model and filter out outliers from your training data.

Resources



There are a lot of webpages that discuss outlier detection, but I recommend reading through a good book on the subject, something more authoritative. Even looking through introductory books on machine learning and data mining won’t be that useful to you. For a classical treatment of outliers by statisticians, check out:

Robust Regression and Outlier Detection by Rousseeuw and Leroy published in 2003
Outliers in Statistical Data by Barnett and Lewis, published in 1994
Identification of Outliers a monograph by Hawkins published in 1980

For a modern treatment of outliers by data mining community, see:

Outlier Analysis by Aggarwal, published in 2013

Chapter 7 by Irad Ben-Gal in Data Mining and Knowledge Discovery Handbook edited by Maimon and Rokach, published in 2010

Additional Content:

Tools:

ISODEPTH Algorithm
FDC (Fast Computation of 2-Dimension Depth Contours

The Intelligence Community LLC


Published on Sep 23, 2014
Since 2008 we have been assembling an online community of national security professionals. We've crowdsourced open source analysis, held career networking events, hosted a summit for leaders from across the world, and much more. Along the way we've seen some important trends and needs among job seekers, small businesses, and the government, so we're creating a new web based tool for you that we believe will be a game changer for our industry. www.TheIntelligenceCommunity.com will become an online marketplace for freelance jobs in national security. Job seekers, employers and government leaders will be able to connect in this new platform for a more adaptable bench of worldwide talent.

For press inquiries, email press@theintellcomm.com or call 1-800-619-7650 
http://thndr.it/1ruwCB1

Raytheon RIOT software

9 Truths about Big Data: From Patterns to Sense by Seth Grimes

http://breakthroughanalysis.com/2013/08/27/9-truths-about-big-data/

Related: Text Analytics 2014

Shodan: The scariest search engine on the Internet By David Goldman

"When people don't see stuff on Google, they think no one can find it. That's not true."

That's according to John Matherly, creator of Shodan, the scariest search engine on the Internet.
Unlike Google, which crawls the Web looking for websites, Shodan navigates the Internet's back channels. It's a kind of "dark" Google, looking for the servers, webcams, printers, routers and all the other stuff that is connected to and makes up the Internet. 
Shodan runs 24/7 and collects information on about 500 million connected devices and services each month.
It's stunning what can be found with a simple search on Shodan. Countless traffic lights,security cameras, home automation devices and heating systems are connected to the Internet and easy to spot.
Shodan searchers have found control systems for a water park, a gas station, a hotel wine cooler and a crematorium. Cybersecurity researchers have even located command and control systems for nuclear power plants and a particle-accelerating cyclotron by using Shodan.
What's really noteworthy about Shodan's ability to find all of this -- and what makes Shodan so scary -- is that very few of those devices have any kind of security built into them.
"It's a massive security failure," said HD Moore, chief security officer of Rapid 7, who operates a private version of a Shodan-like database for his own research purposes.
A quick search for "default password" reveals countless printers, servers and system control devices that use "admin" as their user name and "1234" as their password. Many more connected systems require no credentials at all -- all you need is a Web browser to connect to them.
In a talk given at last year's Defcon cybersecurity conference, independent security penetration tester Dan Tentler demonstrated how he used Shodan to find control systems for evaporative coolers, pressurized water heaters, and garage doors.
He found a car wash that could be turned on and off and a hockey rink in Denmark that could be defrosted with a click of a button. A city's entire traffic control system was connected to the Internet and could be put into "test mode" with a single command entry. And he also found a control system for a hydroelectric plant in France with two turbines generating 3 megawatts each.
Scary stuff, if it got into the wrong hands.
"You could really do some serious damage with this," Tentler said, in an understatement.
So why are all these devices connected with few safeguards? Some things that are designed to be connected to the Internet, such as door locks that can be controlled with your iPhone, are generally believed to be hard to find. Security is an afterthought.
A bigger issue is that many of these devices shouldn't even be online at all. Companies will often buy systems that can enable them to control, say, a heating system with a computer. How do they connect the computer to the heating system? Rather than connect them directly, many IT departments just plug them both into a Web server, inadvertently sharing them with the rest of the world.
"Of course there's no security on these things," said Matherly, "They don't belong on the Internet in the first place."
The good news is that Shodan is almost exclusively used for good.
Matherly, who completed Shodan more than three years ago as a pet project, has limited searches to just 10 results without an account, and 50 with an account. If you want to see everything Shodan has to offer, Matherly requires more information about what you're hoping to achieve -- and a payment.
Penetration testers, security professionals, academic researchers and law enforcement agencies are the primary users of Shodan. Bad actors may use it as a starting point, Matherly admits. But he added that cybercriminals typically have access to botnets -- large collections of infected computers -- that are able to achieve the same task without detection.
To date, most cyberattacks have focused on stealing money and intellectual property. Bad guys haven't yet tried to do harm by blowing up a building or killing the traffic lights in a city.
Security professionals are hoping to avoid that scenario by spotting these unsecured, connected devices and services using Shodan, and alerting those operating them that they're vulnerable. In the meantime, there are too many terrifying things connected to the Internet with no security to speak of just waiting to be attacked. 
Source: http://money.cnn.com/2013/04/08/technology/security/shodan/

Related: Cosm, ioBridge, ThingWorx

Pneuron and Hadoop: Pointing to the elephant in the room. : Pneuron Blog

Pneuron and Hadoop: Pointing to the elephant in the room. : Pneuron Blog

Related: Data Location, Loaction, Location: Hadoop/YARN