As the amount of cyber data continues to grow, cyber network defenders are faced with increasing amounts of data they must analyze to ensure the security of their networks. In addition, new types of attacks are constantly being created and executed globally. Current rules-based approaches are effective at characterizing and flagging known attacks, but they typically fail when presented with a new attack or new types of data. By comparison, unsupervised machine learning offers distinct advantages by not requiring labeled data to learn from large amounts of network traffic. In this paper, we present a natural language-based technique (suffix trees) as applied to cyber anomaly detection. We illustrate one methodology to generate a language using cyber data features, and our experimental results illustrate positive preliminary results in applying this technique to flow-type data. As an underlying assumption to this work, we make the claim that malicious cyber actors leave observables in the data as they execute their attacks. This work seeks to identify those artifacts and exploit them to identify a wide range of cyber attacks without the need for labeled ground-truth data.




An implementation of multiple maps t-distributed stochastic neighbor embedding (t-SNE) in R. Multiple maps t-SNE is a method for projecting high-dimensional data into several low-dimensional maps such that metric space properties are better preserved than they would be by a single map. Multiple maps t-SNE with only one map is equivalent to standard t-SNE. When projecting onto more than one map, multiple maps t-SNE estimates a set of latent weights that allow each point to contribute to one or more maps depending on similarity relationships in the original data. This implementation is a port of the original Matlab library by Laurens van der Maaten.




Event data provide high-resolution and high-volume information about political events and have supported a variety of research efforts across fields within and beyond political science. While these datasets are machine coded from vast amounts of raw text input, the necessary dictionaries require substantial prior knowledge and human effort to produce and update, effectively limiting the application of automated event-coding solutions to those domains for which dictionaries already exist. I introduce a novel method for generating dictionaries appropriate for event coding given only a small sample dictionary. This technique leverages recent advances in natural language processing and machine learning to reduce the prior knowledge and researcher-hours required to go from defining a new domain-of-interest to producing structured event data that describe that domain. I evaluate the method via actor-country classification and demonstrate the method’s ability to generalize to new domains with the production of a novel event dataset on cybersecurity.




We evaluate methods for applying unsupervised anomaly detection to cybersecurity applications on computer network traffic data, or flow. We borrow from the natural language processing literature and conceptualize flow as a sort of “language” spoken between machines. Five sequence aggregation rules are evaluated for their efficacy in flagging multiple attack types in a labeled flow dataset, CICIDS2017. For sequence modeling, we rely on long short-term memory (LSTM) recurrent neural networks (RNN). Additionally, a simple frequency-based model is described and its performance with respect to attack detection is compared to the LSTM models. We conclude that the frequency-based model tends to perform as well as or better than the LSTM models for the tasks at hand, with a few notable exceptions.




We show that a recurrent neural network is able to learn a model to represent sequences of communications between computers on a network and can be used to identify outlier network traffic. Defending computer networks is a challenging problem and is typically addressed by manually identifying known malicious actor behavior and then specifying rules to recognize such behavior in network communications. However, these rule-based approaches often generalize poorly and identify only those patterns that are already known to researchers. An alternative approach that does not rely on known malicious behavior patterns can potentially also detect previously unseen patterns. We tokenize and compress netflow into sequences of “words” that form “sentences” representative of a conversation between computers. These sentences are then used to generate a model that learns the semantic and syntactic grammar of the newly generated language. We use Long-Short-Term Memory (LSTM) cell Recurrent Neural Networks (RNN) to capture the complex relationships and nuances of this language. The language model is then used predict the communications between two IPs and the prediction error is used as a measurement of how typical or atyptical the observed communication are. By learning a model that is specific to each network, yet generalized to typical computer-to-computer traffic within and outside the network, a language model is able to identify sequences of network activity that are outliers with respect to the model. We demonstrate positive unsupervised attack identification performance (AUC 0.84) on the ISCX IDS dataset which contains seven days of network activity with normal traffic and four distinct attack patterns.




There has been much disagreement about the relationship between civil wars and state economic performance. While civil war is often associated with poor economic performance, some states have managed robust growth despite periods of domestic armed conflict. We find this disagreement results from not accounting for the spatial distribution of conflict within a country. A robust literature in economics stresses the role major cities play in economic growth. We hypothesize that the economic impact of civil conflict is contingent on the conflict’s location relative to major urban centers within a state. We use subnational data on the location of conflict relative to urban areas to test the impact of domestic conflict on annual gross domestic product growth. In doing so, we bridge the economic development literature on the importance of cities with extant literature on the effect of armed conflict to provide a novel explanation for the paradox of high macroeconomic growth in conflict-ridden countries.




TaPvex is a trained word2vec model of part-of-speech-tagged and named-entity-tagged words and phrases. The model was trained on a large corpus of English language news text from the early 2010s. Words have been tagged using Stanford CoreNLP to include named entities (NER) and Penn Treebank parts-of-speech (POS). Tagged words have been concatenated into n-gram phrases.




If you like the theme featured on this blog, you can use it too. I’ve published the theme at https://github.com/benradford/Slate-and-Simple-Jekyll-Theme.




In a shocking turn of events, Donald Trump won the 2016 U.S. presidential election. After the shock (and distress) subsided, some (myself included) were left wondering if this twist ending was the work of hackers manipulating electronic voting machines. It is certainly possible to hack e-voting machines. And there is evidence that a state actor did engage in an effort to manipulate the U.S. election. Putting the two together, some experts have advised the Clinton campaign that votes cast on e-voting machines should be audited for the possibility that these machines were compromised by foreign agents. These experts claim to have found evidence that Trump performed disproportionately well in certain counties where electronic voting machines were used. Nate Silver of 538 disagrees. In this article, I provide the data and code necessary to test this hypothesis and to potentially uncover evidence of manipulated, hacked, or otherwise unfair e-voting machines. While I find no evidence that e-voting machines were biased in favor of either candidate, I encourage others to build on this analysis and verify (or contradict) these results.




Politifact is a website the rates the truthfulness of public statements. Generally they focus on statements by political leaders but they’ll also tackle celebrities, chain emails, and bloggers from time to time. Statements are rated on a five point scale: “Pants on Fire!,” “False,” “Mostly False,” “Half-True,” “Mostly True,” and “True.” The script below will scrape Politifact and save every statement in a gzipped CSV file. The data is structured as who, when, validity, subjects, statement.