An implementation of multiple maps t-distributed stochastic neighbor embedding (t-SNE) in R. Multiple maps t-SNE is a method for projecting high-dimensional data into several low-dimensional maps such that metric space properties are better preserved than they would be by a single map. Multiple maps t-SNE with only one map is equivalent to standard t-SNE. When projecting onto more than one map, multiple maps t-SNE estimates a set of latent weights that allow each point to contribute to one or more maps depending on similarity relationships in the original data. This implementation is a port of the original Matlab library by Laurens van der Maaten.
Event data provide high-resolution and high-volume information about political events and have supported a variety of research efforts across fields within and beyond political science. While these datasets are machine coded from vast amounts of raw text input, the necessary dictionaries require substantial prior knowledge and human effort to produce and update, effectively limiting the application of automated event-coding solutions to those domains for which dictionaries already exist. I introduce a novel method for generating dictionaries appropriate for event coding given only a small sample dictionary. This technique leverages recent advances in natural language processing and machine learning to reduce the prior knowledge and researcher-hours required to go from defining a new domain-of-interest to producing structured event data that describe that domain. I evaluate the method via actor-country classification and demonstrate the method’s ability to generalize to new domains with the production of a novel event dataset on cybersecurity.
We evaluate methods for applying unsupervised anomaly detection to cybersecurity applications on computer network traffic data, or flow. We borrow from the natural language processing literature and conceptualize flow as a sort of “language” spoken between machines. Five sequence aggregation rules are evaluated for their efficacy in flagging multiple attack types in a labeled flow dataset, CICIDS2017. For sequence modeling, we rely on long short-term memory (LSTM) recurrent neural networks (RNN). Additionally, a simple frequency-based model is described and its performance with respect to attack detection is compared to the LSTM models. We conclude that the frequency-based model tends to perform as well as or better than the LSTM models for the tasks at hand, with a few notable exceptions.
We show that a recurrent neural network is able to learn a model to represent sequences of communications between computers on a network and can be used to identify outlier network traffic. Defending computer networks is a challenging problem and is typically addressed by manually identifying known malicious actor behavior and then specifying rules to recognize such behavior in network communications. However, these rule-based approaches often generalize poorly and identify only those patterns that are already known to researchers. An alternative approach that does not rely on known malicious behavior patterns can potentially also detect previously unseen patterns. We tokenize and compress netflow into sequences of “words” that form “sentences” representative of a conversation between computers. These sentences are then used to generate a model that learns the semantic and syntactic grammar of the newly generated language. We use Long-Short-Term Memory (LSTM) cell Recurrent Neural Networks (RNN) to capture the complex relationships and nuances of this language. The language model is then used predict the communications between two IPs and the prediction error is used as a measurement of how typical or atyptical the observed communication are. By learning a model that is specific to each network, yet generalized to typical computer-to-computer traffic within and outside the network, a language model is able to identify sequences of network activity that are outliers with respect to the model. We demonstrate positive unsupervised attack identification performance (AUC 0.84) on the ISCX IDS dataset which contains seven days of network activity with normal traffic and four distinct attack patterns.
There has been much disagreement about the relationship between civil wars and state economic performance. While civil war is often associated with poor economic performance, some states have managed robust growth despite periods of domestic armed conflict. We find this disagreement results from not accounting for the spatial distribution of conflict within a country. A robust literature in economics stresses the role major cities play in economic growth. We hypothesize that the economic impact of civil conflict is contingent on the conflict’s location relative to major urban centers within a state. We use subnational data on the location of conflict relative to urban areas to test the impact of domestic conflict on annual gross domestic product growth. In doing so, we bridge the economic development literature on the importance of cities with extant literature on the effect of armed conflict to provide a novel explanation for the paradox of high macroeconomic growth in conflict-ridden countries.