DeepLearningforReal-TimeCrimeForecastingandItsTernarization∗论文

Deep Learning for Real-Time Crime Forecasting and Its Ternarization^∗

Bao WANG¹ Penghang YIN¹ Andrea Louise BERTOZZI¹ P. Jeffrey BRANTINGHAM² Stanley Joel OSHER¹ Jack XIN³

(Dedicated to Professor Andrew J. Majda on the occasion of his 70th birthday)

Abstract Real-time crime forecasting is important. However, accurate prediction of when and where the next crime will happen is difficult. No known physical model provides a reasonable approximation to such a complex system. Historical crime data are sparse in both space and time and the signal of interests is weak. In this work, the authors first present a proper representation of crime data. The authors then adapt the spatial temporal residual network on the well represented data to predict the distribution of crime in Los Angeles at the scale of hours in neighborhood-sized parcels. These experiments as well as comparisons with several existing approaches to prediction demonstrate the superiority of the proposed model in terms of accuracy. Finally, the authors present a ternarization technique to address the resource consumption issue for its deployment in real world. This work is an extension of our short conference proceeding paper [Wang, B., Zhang, D., Zhang,D. H., et al., Deep learning for real time Crime forecasting, 2017, arXiv: 1707.03340].

Keywords Crime representation, Spatial-temporal deep learning, Real-time forecasting, Ternarization

1 Introduction

Forecasting crime at hourly or even finer temporal scales in micro-geographic regions is an important scientific and practical problem. Anticipating where and when crime is most likely to occur creates novel opportunities to prevent crime. However, accurate crime forecasting at fine spatial temporal scales is very challenging. The occurrence of crime depends on complex factors, many of which cannot be described quantitatively. Statistically, crime is extremely stochastic and sparse in both space and time (see [21]). Recent efforts have been devoted to the mathematical and statistical modeling of crime. Short et al. introduced a novel partial differential equations (PDE for short) model to simulate crime hotspots and analyzed the regime for different dynamical patterns (see [25—27]). The PDE model provides a macroscale description which can be regarded as a continuum limit of the microscopic random walk. Considering crime as self-exciting, Mohler et al. adapted the epidemic type aftershock sequence (ETAS for short)model to crime modeling (see [20, 24]). The ETAS model provides a microscopic representation of the crime events with predictive power. Such point process idea have been extended to study other crime problems such as crime missing data reconstruction (see [28]). Another class of crime predictors uses autoregressive integrated moving average (ARIMA for short) or other simple statistical models (see [1, 8]). The aforementioned models are built only on historical data. There is also interesting work on crime prediction using social network data, e.g., Twitter(see [2, 31]).

Deep learning has recently been used for crime modeling and forecasting. In our previous work, we considered real-time crime forecasting at fine spatial scale (see [30]). Kang et al studied the crime forecasting problem by transforming it into binary classification problem(see [14]). The key idea is to have a convolutional neural network (CNN for short) learn the features for crime forecasting with inputs of historical data, weather, geographical information,etc. Finally, they apply a support vector machine (SVM for short) to classify the region into crime or no crime with a posterior probability. This is an interesting idea, but not an optimal approach. Consider two regions. One region always has one crime happen with certainty. The other region has many crimes to happen, but only with 90 percent probability. Based on the classification approach, the first region would be flagged for patrol over the second. This model does not fully model the fine scale spatial temporal patterns in the crime data.

Recent advances in deep learning techniques has made forecasting of complex spatial temporal crime patterns more tractable (see [10—11, 13, 16, 19, 33]). Some of the most successful applications include citywide traffic flow forecasting, motion prediction and human object interaction modeling. Zhang et al. [33] create an ensemble of residual networks (see [9]) to study the traffic flow, their network is called ST-ResNet. The key idea is to map the traffic flow at each time slot to an image and explicit specify the dependencies. Their model gave excellent traffic flow forecasting in Beijing and New York city. Jain et al. [13] proposed a jointly trainable neural network structure, called a structural recurrent neural network (SRNN for short), which is a feed-forward arrangement of RNN units. The SRNN gives state-of-the-art motion forecasting.Moreover, the SRNN is scalable to massive data sets. For periodic motion forecasting, Holden et al. [11] proposed a phased-function neural network for character control, their techniques have been successfully used in the gaming industry.

Despite CNNs’ superior performance in various real-world applications including crime prediction, their memory and energy consumptions can be a problem, especially when deployed on mobile devices with limited resources, due to the huge number of floating-point parameters in the models. Recent efforts have been made to develop quantization techniques (see [5—6, 17,23, 32, 35]) for training CNNs with low precision parameters. Thus we are able to compress the model size and speed up computation during inference. For example, in binary weight neural networks (BNNs for short) (see [5—6, 23]), the weights in the same fully-connected or convolutional layer are restricted to have the same magnitude. For a layer with n binary weights, the storage of these parameters only requires the memory for one 32-bit floating-point number and n 1-bit binary numbers (i.e., ±1) instead of that for n 32-bit weights, resulting in approximately 32× memory savings. Moreover, at inference time, the need for floating-point multiplications can be eliminated by leveraging the distributive law during forward propagation, which enables faster deployment and substantial energy savings. More precisely, in BNNs, a weight filter matrix or a 4 dimensional tensor, can be expressed as

where α ＞0 is the layer-wise scaling factor and B has the same size as W but only contains entries ±1. Given input I, the forward propagation calls for evaluating

where ∗denotes the convolution operation or matrix-vector multiplication. Note that the computation of B ∗I involves additions and subtractions only. Unfortunately, weight binarization often leads to nonnegligible loss of prediction accuracy (see [5—6, 17, 23]). TNNs strike a balance between the accuracy and memory storage. Compared to BNNs, Ternary weight neural networks (TNNs for short) (see [17, 32, 35]) own an extra state 0 for the weights and thus enjoy a larger model capacity. TNNs benefit in the same way as BNNs do from quantization.Thanks to sparsity, a number of additions/subtractions can be further dropped from forward propagation.To store ternary numbers, we need 2-bit representation which results in 16× model compression rate. BNNs or TNNs theoretically achieve up to 32× faster convolutional operations during forward propagation at inference time. This speedup can be further boosted by specialized AI chips particularly for low-bit operations. More importantly, compared with full-precision models, TNNs can achieve nearly lossless accuracy in benchmark tests such as MNIST and CIFAR10 (see [17]). Other methods for training general low-bit CNNs have also been proposed(see [32, 34]).

1.2.3 性状测定方法在油菜成熟准备收割前，对B、C、D、E 4个地块中倒伏情况进行统计。油菜成熟收割时，每个处理中随机选出3株油菜，使用卷尺、游标卡尺、量角器等工具测量其株高、茎粗、分枝角度、冠幅等多个性状，做好原始数据记载。

In this paper, we study the crime forecasting at small spatial and hourly temporal scales. We adapt the ST-ResNet structure for our purposes. Compared to the traffic flow data handled by ST-ResNet, crime data is more challenging. Crime data has much less spatial temporal regularity, i.e., the number of events in adjacent time intervals and spatial cells differ hugely.Crime data are very sparse in both space and time. Crime types are also diverse (see [21]).Our contribution is four-fold. First, we select the appropriate spatial temporal scales at which crimes are predictable. We explore the suitable representation for the spatial temporal crime distribution. Second, we provide different approaches for data regularization in both spatial and temporal dimensions to further enhance the predictable signals. Third, we adapt the deep learning architecture for crime forecasting. Fourth, we study the ternarization of our ST-ResNet model.

All crime historical data is provided by Los Angeles Police Department (LAPD for short).

2 Data Representation

2.1 Data set description

学生笔下的诗句比原诗更灵动，更有趣。有了范文引路，学生写起来言之有物，言之有序，既有仿效，又有创新，更重要的是通过仿写，学生掌握了作者的巧妙构思，为以后的写作打下了一定基础。

Crime Data For a simple yet effective demonstration of our framework, we consider all the crimes recorded in LA over the last six months of 2015 without distinguishing their types.In total there were 104,957 crime events. The crime time and location information is used in our forecasting paradigm. Each crime is associated with two times: Start and end times. To avoid ambiguity, we regard the start time of each event as the associated time slot. Geographically, the latitude and longitude intervals spanned by these crimes are [33.3427^°, 34.6837^°] and[-118.8551^°,-117.7157^°], respectively. The spatial crime distribution is highly heterogeneous;a large portion of the area contains little or no crime. Therefore, we only consider the crimes that happened within the region [33.6927^°, 34.3837^°]× [-118.7051^°,-118.1157^°], this selected region contains more than 95 percent of the total crimes. Nevertheless, there is still spatial redundancy in this data embedding. In our study, we partition this selected region into a 16 × 16 lattice. Each grid cell represents approximately 17.8 km² land area. Figure 1 shows the crime distribution at 1:00 p.m on Dec 20th, 2015. The left panel is the crime distribution over the whole LA area. The right panel depicts the crimes in the restricted region.

Figure 1 Crime distribution at 1:00 p.m, Dec 20th, 2015. Chart (a) depicts crime distribution over the whole LA area; chart (b) depicts crime distribution over the selected region.The units are described in Section 2.

We organize the paper as follows: In Section 2, we discuss crime data sets and preprocessing techniques. In Section 3, we discuss the deep learning algorithms and network structures for crime forecasting. Forecasting results and comparisons with some other methods are presented in Sections 4 and 5, respectively. In Section 6, we explore the ternarization of the ST-ResNet to reduce the model size and speed up forecasting. In Section 7, we summarize this paper’s contribution and discuss future work.

the set of ternary weights for the i-th layer.

We consider crime forecasting in Los Angeles (LA for short). In our protocol, historical crime, weather and historical holiday data are the key ingredients. Since holiday records are easy to obtain, we only provide brief descriptions of the other two data sets.

2.2 Data preprocessing

Charts (a) and (c) of Figure 2 show crime intensity functions in the whole LA and a randomly selected grid over the last two weeks of the year 2015. The intensity functions show low regularity in the temporal dimension. However, the hourly crime time series indicate strongly predictable signals; obviously, the time series over the whole domain is periodic with a period of 24 hours. For selected grid cells, the periodic patterns still exists, but the time series much more irregular. Deep learning uses combinations of simple linear and nonlinear continuous functions to form a dynamical system, thus approximating the complex input signal. Since deep learning models are essentially continuous, we need to enhance the regularity of the time series data,especially for the grid-wise crime intensity functions. To address this, we map the original crime intensity function {X(t)} to {Y (t)} via a diurnal periodic integral mapping:

for t within the time interval (nT,(n+1)T ]. As demonstrated in charts (b) and (d) of Figure 2,after integration, the regularity of the original time series improves dramatically. The periodic signal is amplified.

他慢腾腾地登上一个小丘，看了看周围的地形。既没有树木，也没有小树丛，什么都没有，只看到一望无际的灰色苔藓，偶尔有点灰色的岩石，几片灰色的小湖，几条灰色的小溪，算是一点变化点缀。天空是灰色的。没有太阳，也没有太阳的影子。他不知道哪儿是北方，他已经忘掉了昨天晚上他是怎样取道走到这里的。不过他并没有迷失方向。

Figure 2 Chart (a) depicts the hourly crime intensity of the last two weeks of 2015 over the whole LA area; chart (b) draws the cumulated crime intensity corresponding to (a).Charts (c) and (d) plot crime density and diurnal cumulated crime intensity on the grid with longitude and latitude ranged [33.9519^°, 33.9951^°]×[-118.2635^°,-118.2262^°], respectively.Units: x-axis: Time; y-axis: Number of crimes.

To resolve the lack of spatial regularity, we use a super resolution technique at each time step; e.g., bilinear and cubic spline interpolation. For computational efficiency, we resolve by a factor of 2 in each dimension of the spatial domain. In Figure 3 we see that the bilinear spline super resolution significantly improves spatial regularity. A merit of this preprocessing is that it improves the signal without losing information associated with the crime data.

Figure 3 Cumulated crime intensity at 11:00 p.m, Dec 31st, 2015. Chart (a) depicts crime distribution over the selected area; chart (b) provides the mesh plot of the chart (a); chart(c) depicts super resolution version of chart (a); and chart (d) is mesh plot of chart (c).

3 Models and Algorithms

3.1 Mathematical problem formulation

For the sake of simplicity, in this work we do not consider the crime type forecasting problem. In our protocol, we only consider how many crimes will happen in the next time step in each grid cell. Mathematically, our paradigm can be formulated as: Given the historical data{(X_t, E_t)}_{t=1,2,···,n} and future external features {E_n+1}, predict X_n+1, where X₁, X₂,··· , X_n+1 are the tensors representing the crime spatial distributions at times t₁, t₂,··· , t_n+1. E₁, E₂,··· ,E_n+1 are the external features that affect the crimes (e.g., holiday, time, weather). The entire procedures of our crime predictor can be formulated by the pseudo code described in Algorithm 1. In Algorithm 1, S and I denote spatial super-resolution and temporal diurnal integration operators, respectively.

共识机制使用资源换取效率的过程并不是凭空产生的，而是有着一定的前提。当前共识机制信任来源总结如下，其中一种共识机制不一定基于全部信任来源：

Algorithm 1 Real Time Spatial Temporal Deep Learning Crime Predictor

where |X_n+1|₊ is the positive part of X_n+1, i.e.,

3.2 ST-ResNet structure

We test two different deep neural network structures. The first structure is adapted from[33]. The second structure, which excludes convolution, is equivalent to an ensemble of residual networks to learn the time series on each grid, without considering the transition of crimes between different grids. The first model is more realistic. Through convolutional layers, crime dynamics and influences among different grids can be captured. In both networks, all features are fused with the crime data via a parametric-matrix based fusion technique used in [33]. The detailed description of the network structure can be found in [33]. We implement our method using Keras (see [4]) on top of Theano (see [29]) software.

Our models incorporate external features such as weather and holidays. Due to the periodic pattern and self-exciting property of crimes (see [20]), we adopt nearby, periodic and trend features. The time spacing of these features are at hourly, daily and weekly levels, respectively.For each category of these dependencies, we employ the three nearest previous spatial distributions of crimes. For instance, suppose we wish to predict the crime distribution at t_n+1, the past crime distributions: X_n, X_n-1, X_n-2, X_n-24, X_n-48, X_n-72, X_n-168, X_n-336, X_n-504 are utilized as features. We believe that longer dependencies produce better results. We let the algorithm learn the dependencies automatically in a RNN fashion.

4 Results on Crime Forecasting

Figure 4 Structure of the deep neural network model with convolution.

We ran experiments on the last six months crime data of 2015 over LA. The last two weeks data is used to test the model. The remaining data is used for training and validating the models,where the validation ratio is 20 percent. We use 6 layers of residual units, the number selected by trial-and-error, to assemble the ST-ResNet, which is a good compromise between model complexity and accuracy. In the training period, we first run 200 epochs to train the network with a separated validation set to ensure our models do not over-fit. Subsequently, we schedule another 50 epochs on the combination of the training and validation sets to fine tune the model.All the experiments are carried out with a single Nvidia Quadro-K4000 graphics card. To speed up the training process, we make use of the deep neural network library cuDNN (see [3]). The size of the convolution filters are fixed to 3 × 3. The learning rate is chosen to be 0.0005. The ADAM optimizer is used to optimize the loss function.

We use the root mean square error (RMSE for short) between prediction and ground truth as our measure of accuracy of the predictions. RMSE is defined as:

where N is the total number of grids that we partition the restricted area into, T is the number of time slots considered, I_itand are the exact and predicted crime intensity in grid i at time t, respectively. When considering the accuracy of the prediction in a single grid cell, we do not need to sum over the index N. Table 1 lists the RMSEs between the predictions and the ground truth cumulative intensity functions with different setups of the network and different treatments of the input data.

具象表现绘画教学方法理论是司徒立先生从贾克梅蒂等艺术家的绘画经验总结而来，立足于当代世界文化共通性的理论体系，强调中国传统文化和西方文化的钩连，让艺术回到本真存在的起点。具象表现绘画理论要义有三个方面：

We consider different experimental setups to validate the importance of the signal enhancement treatments. For super resolution comparisons, we consider three cases, namely super resolution in space and time, super resolution in space only, and no super resolution. Bilinear interpolation is employed for all signal enhancements. As demonstrated in Table 1, the best results come from using both spatial and temporal super resolutions. In general, these signal enhancement techniques improve model performance. To test the influence of model complex-ity, we considered different number of filters in the convolutional layers. We list the results for different number of neurons in Table 1. In general, performance increases with the more neurons involved. These filters capture different scales of the spatial temporal features of the training data set. Currently, the maximum number of filters (64×64) is set by the capacity of our graphics card. We believe the model can give even better results with more filters, since they can capture more detailed information about the spatial temporal distribution of crime.The optimal results obtained when we use convolutional layers on the super resolved signals on both space and time, which gives RMSE 0.207 in the prediction. The performance shows that convolutional layers captures the spatial influence of crimes, as it is known that crime is self-exciting in both space and time (see [20]). Without convolutional layers, each grid is basically treated independently, which leads to an inefficient model.

Table 1 Performance of ST-Resnet on the crime forecasting under different settings. Units for Training and Test Error columns: Number of crimes.

In Figure 5 we show sample snapshots in time. It is easy to see all crime hot spots are captured. The ST-ResNet gives satisfactory results in the cases with or without convolutional layers.

△求购全本《金瓶梅词话》（影印本，原书为明万历年间刊本，现藏于美国，1933年曾以《古佚小说刊行会》名义影印），兰陵笑笑生撰著，香港太平书局1982年8月（第一版）。要求内页完整，封面无破损。价格可谈。联系人：弋冰，电话：027-82625791

For a given grid, the crime intensity over a given time interval is also accurately predicted. We randomly select two grids with longitude and latitude ranges [33.9519^°, 33.9951^°] ×[-118.2635^°,-118.2262^°] and [34.0382^°, 34.0814^°] × [-118.4472^°,-118.4104^°], respectively. As shown in Figure 6, the maximum difference between the ground truth and the prediction in crime intensity is 3 crimes in absolute value. These results quantitatively confirm our predictions are accurate. The RMSE of the prediction over crime intensity functions are 0.665 and 0.551, respectively; 0.750, 0.443 over the cumulated intensity functions. For the first grid, there are 131 hourly time slots with crimes over the last two weeks of 2015. Our predictor gives 148 candidates, the intersection with the ground truth is 106. For the second grid, there are 99 hourly time slots with crimes. The prediction gives 104 slots, 69 of them lie in the ground truth set.

Figure 5 Predicted vs. exact crime spatial distribution. Panels (a), (b) plot the crime spatial distribution at 1 p.m. of Dec 19, 27, 2015, respectively. Panels (c), (d) are the predicted results without convolution layers. (e), (f) are the predicted results with convolution layers.

Figure 6 Predicted vs. exact crime intensity in two randomly selected grid cells area over the last two weeks of 2015. Charts (a) and (b) are prediction on the crime intensity and cumulated intensity functions on the grid [34.0382^°, 34.0814^°]×[-118.2635^°,-118.2262^°],respectively. Charts (c) and (d) are the corresponding intensity and cumulated intensity prediction over the grid [34.0382^°, 34.0814^°] × [-118.4472^°,-118.4104^°]. Units: x-axis:Time; y-axis: Number of crimes.

The key step for solving (6.1) lies in the ternarization of some given floating-point vector. To this end, we seek to minimize the Euclidean distance between and W_i:

5 Comparison Between Different Methods

In this section, we compare our approach with several existing methods for crime forecasting.In total, we compare our deep learning approach to ARIMA (see [1]), k nearest neighbor (KNN for short) and historical average (HA for short). We brief summarize these methods in the following:

In PC patients who have not received chemotherapeutical agents, transitory painkillers and not a definitive pain management approach will be a logical option, while waiting for the response to the treatment.

· HA: In this simple empirical model, at each time slot, we regard the historical average at that specific hour as the prediction. This is a parameter free model. However, the daily crime volatility cannot be captured by this model.

· KNN: In this model, we use the average number of crimes in the closest previous time steps to forecast the number of events at the next time step. The only parameter is k, which represents the number of nearest previous steps involved in the prediction. The parameter k can be determined by simple cross validation. Here we adopt five-fold cross validation to determine this parameter. It is found that when k equal to one, KNN provides the best results.Lag forecasting is the main drawback of this model.

· ARIMA: The general model ARIMA(p, d, q) has three parameters, where p is the order of autoregressive model, d is the order of difference needed to make the signal to be stationary,q is the order of the moving average. The parameter d is determined by the ADF stationarity test, p and q are determined by the autocorrelation function (ACF for short) and partial autocorrelation function (PACF for short), respectively. Based on our testing, the cumulative crime intensity function itself is stationary. The optimal order for autoregression and moving average are 25 and 26, respectively. These two parameters reflect a roughly one day dependence. Due the simplicity of training the model, we implement the ARIMA model in a rolling fashion and update the model on the fly as new data is presented. The major deficiency of this model is that it cannot included features other than the time series itself. It is also too simple to capture all the features carried by the signal. In general, ARIMA is only suitable for simple time series that carry all the predictable information.

总量控制能够促进分级管理与各负其责。可降低行政成本、提高管理效率。中央层面可以专注于宏观管理与大案要案的查处；地方层面可以锚定包括“点源”与“面源”污染控制的属地管理责任而不可推卸。

Figure 7 Comparison of different methods’ forecasting on the c.d.f and p.d.f over the last two weeks of 2015 on the region [33.9519^°, 33.9951^°] × [-118.2635^°,-118.2262^°]. Charts(a) and (b) are the forecasting results of c.d.f and p.d.f, respectively. We have also provide a zoom in plot the crimes of the first day over this period.

We randomly select a grid cell with the longitude and latitude range [33.9519^°, 33.9951^°]×[-118.2635^°,-118.2262^°] for comparison. The comparison of exact and predicted cumulative densities of crime is depicted in panel (a) of Figure 7. Panel (b) of Figure 7 is a comparison of the crime intensity functions. The crime distribution function itself is highly irregular over the time span. The regularity of the signal is enhanced by integration. The cumulative density function is periodic with some fluctuation.

Table 2 Performance comparison between different methods over the last two weeks of 2015 on the region [33.9519^°, 33.9951^°] × [-118.2635^°,-118.2262^°]. Units for the Error columns: Number of crimes.

The deep learning model provides the optimal prediction, followed by ARIMA, KNN and HA (Table 2). ARIMA, KNN, HA are not on par with one another. The optimal RMSEs in cumulative density and original crime signal are 0.750 and 0.659, respectively. For ARIMA and KNN, the errors in the original signal are more than the cumulative one. Visually, ARIMA and KNN seem to provide excellent predictions. However, this is a misperception due to lagged forecasting. According to our tests, ST-ResNet shows even stronger performance relative to the other predictors when the data become more sparse.

我环顾四周，有三个人抓着匕首的手哆哆嗦嗦抖个不停，像是随时都会掉下来，有两个上了年纪的男人大概高血压心脏病发作，早已经躺在了地上，有一个是哭哭啼啼的女生，真正有威胁的，只有四个人，其中有两个人大概觉得我和旁边的年轻女孩都是女生，身单体薄，不停地打量着我们，缓缓朝我们两个走来。

6 Ternarization of ST-ResNet

In this section, we consider the ternarization of ST-ResNet. Suppose there are in total l fully-connected and convolutional layers with the respective weight filters W_i, i = 1,··· , l. For a fully-connected layer, W_i is a matrix, and for a convolutional layer, it is a high dimensional tensor. Without loss of generality, let us view W_i as a vector of dimension n_i. Then the vector W_i∈Rⁿⁱ is ternary-valued and takes the form

where T_i ∈{-1, 0, 1}ⁿⁱ has the same size as W_i, and α_i ＞0 is a shared scaling factor. Training TNNs calls for solving the following constrained minimization problem

where f denotes the overall energy function determined by the network architecture, W ={W₁,··· , W_l} the weight parameters, b the other trainable parameters, and

Weather Data We collect the weather data from the Weather Underground database available at https://www.wunderground.com/ using a simple web crawler. Special attention should be paid to get extracting the data correctly as the format varies day by day. We select temperature, wind speed and special events, including fog, rain and thunderstorms, for our weather features. Since we study hourly crime forecasting, if more than one weather data are available, we make use of an average of the features. For the time intervals without weather data, we use a linear interpolation from neighboring intervals.

One key feature of the convolutional neural network is weight sharing, i.e., for a given neuron, it shares the common filter over the whole image domain. This simplifies the neural network model and the training procedure. However, for extremely sparse spatial data, like the crime data we study, this weight sharing may lead to the filter with all weights being zeros.As shown in Table 1, without super resolution, the network with convolutional layers offers worst forecasting (all the predictions are zero). We conclude that, for sparse spatial data,applying convolutional network to the super resolved data can give excellent forecasting. On the one hand, it solves weight sharing problem. On the other hand, it captures complex spatial distributions.

The solutionproj _Ti() to the above problem is simply the projection of onto the set T_i.For now let us ignore the subscript i for notational simplicity. In an alternative form, the above problem can be formulated as

我们在自豪于自己国家的传统文化的魅力之余，也应该放眼世界，其实很多国家都开始将我们的传统文化与自己国家的文化相结合，形成特色文化。我们在发展的过程当中，不断借鉴或者吸取别人身上的有点是很有必要的，只有取长补短，不断吸收养分，传统文化的生命力才能长久，这是历史发展的必然规律。

After obtaining (α^∗, T^∗), the ternarization of is then given byproj _T( ) = α^∗T^∗. The solution to (6.2) was first approximated by Li et al. [17] under unrealistic statistical assumptions on the components of, albeit with satisfactory empirical performance. The exact expression forproj _T was later derived by Yin et al. [32]. We summarize the result in the theorem below.

Theorem 6.1 Suppose keeps the k largest entries in magnitude of and zeros out the others. Then the solution to problem (6.2) is given by

where is the sparsity of the optimal ternary weight vector.

For readers’ convenience, we provide a proof here.

Proof Suppose the sparsity of T is k. Since T ∈{-1, 0, 1}ⁿ, then

我国无人驾驶汽车的研制也始于上个世纪八十年代，1989年国防科技大学首先研制出第一辆智能小车，“1992年，国防科技大学、北京理工大学等著名大学研制成功了由中型面包车增加配备计算机、控制系统和传感器改装而成的我国第一辆真正意义上能够自主行驶的测试样车(ATB-1)”[3]。

and thus

Since ‖‖² is a constant, the optimal sparsity k^∗maximizes the termin (6.3), i.e.,

这条线路以休闲观光为主，以欣赏庐山西海和庐山佛教圣地的建筑艺术、坐禅文化，感受庐山西海及庐山的自然风光。

To achieve the lower-bound in (6.3), we must have

According to Theorem 6.1, the ternarization of can be performed in a manner of direct enumeration. This involves sorting the magnitudes of elements of and computing accumulative sum of the sorted sequence, which require computational complexity of O(n log(n)). Our training of ternary ST-ResNet is carried out by a projected SGD-like algorithm (see [5, 23]).We keep updating the floating-point weights using the minibatch (sub)gradient of f evaluated at ternary weights. This is different from the standard projected SGD in which the ternary weights are updated in the descent step. The mean convergence of this pseudo projected SGD has been proved under smoothness and convexity assumptions on f (see [18]). In fact, it has demonstrated much stronger empirical performance than the standard version in training quantized neural networks (see [18]). In addition, we adopt popular techniques in deep learning such as ℓ₂ regularization, batch normalization (see [12]) and ADAM (see [15]) to improve training efficiency. Our method for training ternary ST-ResNet is summarized in Algorithm 2.

We coded and tested the optimizer on Lasagne/Theano (see [7, 29]) platform in Python on a machine with Nvidia GeForce GTX Titan X GPU. We trained fully-ternary ST-ResNet with spatial and temporal super-resolution preprocessing, where all the weight filters are ternary.The training RMSE and testing RMSE with 64×64 neurons are 0.234 and 0.242, respectively.As shown in Table 3, compared to the full-precision model, there is only a small accuracy loss.

Algorithm 2 Training one epoch of ternary weight ST-ResNet.

Table 3 Performance comparison between ST-ResNet and its ternarization. Units for Training and Test Error columns: Number of crimes.

7 Concluding Remarks

In this paper, we present a real-time spatial temporal predictor for end-to-end crime intensity prediction. The key idea of our predictor can be summarized as follows:

· We chose appropriate spatial temporal scales at which crime historical time series carry sufficient predictable signals. For a given time step, we map the number of events into an image,each pixel value represents the number of crime in a grid at a specific time.

· We developed effective spatial temporal signal enhancement techniques to boost the crime forecasting accuracy. These techniques also solve the deficiency of the CNNs for sparse data dues to the weight sharing. More specifically, in the temporal dimension, we compute the diurnal cumulative crime per grid spatial region. In the spatial dimension, we use bilinear interpolation super resolution.

5)技术规范类：例如厨房电气化、电动汽车等领域存在相关设备品质与相关技术标准参差不齐，因此需要相关技术部门或者业界知名企业制定与规划相关标准，从而引导电能替代及相关行业的长足发展。

· We adapted the ST-ResNet for crime prediction.

Our methods provide crime forecasting for each grid cell at hourly temporal scale. The predictions are extremely accurate, which provides reliable guidance for crime control. Our model can be categorized as a deep learning regression method, which provides a better description of crime forecasting than the classification type of methods, since crime prediction is not just a simple yes-or-no problem.

Nevertheless, there are many aspects to improve. On the one hand, the ad hoc grid partitioning of the spatial domain ignores demographic and geographic information. Furthermore,embedding the irregular geometry of the city into a rectangular domain leads to a huge amount of redundant computation. On the other hand, in the ST-ResNet framework, the historical dependencies need to be set explicitly and longer explicit dependencies cause the network to be extremely complex and difficult to train. Adaptive dependence is hard to incorporate into the ST-ResNet framework.

There are a few lines of research worth exploring in the future. First, a better graphical representation of spatial temporal data representation will benefit both the capturing of information from historical data and efficient computation. Second, instead of explicitly pointing out the dependencies, an alternative is to use RNN to learn the dependencies automatically.Third, forecasting crime types is feasible in our framework, although challenging, since the data will be much more sparse compared to the present representation. Fourth, applying the recently proposed Laplacian smoothing gradient descent (see [22]) to train the model to boost the prediction accuracy.

Acknowledgement The authors thank the Los Angeles Police Department for providing the crime data for this paper.

References

[1] Chen, P., Yuan. H. and Shu, X., Forecasting crime using the arima model, Proceeding of the 5th IEEE International Conference on Fuzzy Systems and Knowledge Discovery, 5, 2008, 627-630.

[2] Chen, X., Cho, Y. and Jang, S., Crime prediction using twitter sentiment and weather, Systems and Information Engineering Design Symposium, 2015, 63-68, DOI:10.1109/SIEDS.2015.7117012.

[3] Chetlur, S., Woolley, C., Vandermersch, P., et al., cuDNN: Efficient primitives for deep learning, 2014,arXiv:1410.0759.

[4] Choillet, F., Keras: Keep learning for humans, 2015, https://github.com/fchollet/keras.

[5] Courbariaux, M., Bengio, Y. and David, J., Binaryconnect: Training deep neural networks with binary weights during propagations, Advances in Neural Information Processing Systems, 28, 2015, 3123C3131.

[6] Courbariaux, M., Hubara, I., Soudry, D., et al., Binarized neural networks: Training neural networks with weights and activations constrained to +1 or -1, CoRR, 2016, arXiv:1602.02830.

[7] Dieleman, S., Schlter, J., Raffel. C., et al., Lasagne: First release., 2015, http://lasagne.readthedocs.io/en/latest/.

[8] Gerber, M., Predicting crime using twitter and kernel density estimation, Decision Support System, 61,2014, 115-125.

[9] He, K. M., Zhang, X. Y., Ren, S. Q. and Sun, J., Deep residual learning for image recognition, CVPR,2016, 770-778.

[10] Hochreiter, S. and Schmidhuber, J., Long short-term memory, Neural Comput, 9, 1997, 1735-1780.

[11] Holden, D., Komura, T. and Saito, J., Phase-functioned neural networks for character control, ACM Transactions on Graphics, 36, 2017, 13 pages.

[12] Ioffe, S. and Szegedy, C., Batch normalization: Accelerating deep network training by reducing internal covariate shift, 2015, arXiv:1502.03167.

[13] Jain, A., Zamir, A. R., Savarese, S. and Saxena, A., Structural-rnn: Deep learning on spatio-temporal graphs, CVPR, 2016, arXiv:1511.05298.

[14] Kang, H. W. and Kang, H.-B., Prediction of crime occurrence from multi-modal data using deep learning,PLos ONE., 12, 2017, DOI:10.1371/journal.pone.0176244.

[15] Kingma, D. P. and Ba, J., Adam: A method for stochastic optimization, ICLR, 2015, arXiv:1412,6980.

[16] LeCun, Y., Bengion, Y. and Hinton, G., Deep learning, Nature, 521, 2015, 436-444.

[17] Li, F., Zhang, B. and Liu, B., Ternary weight networks, NIPS Workshop, 2016, http://arxiv.org/abs/1605.04711.

[18] Li, H., De, S., Xu, Z., et al., Training quantized nets: A deeper understanding, 2017, arXiv:1706.02379.

[19] Li, Y., Zemel, R., Brockschmidt, M. and Tarlow, D., Gated graph sequence neural network, ICLR, 2016,http://arxiv.org/abs/1511.05493.

[20] Mohler, G. O., Short, M. B., Brantingham, P. J., et al., Self-exciting point process modeling of crime, J.Amer. Statist. Assoc., 106(493), 2011, 100-108.

[21] Mohler, G. O., Short, M. B. and Brantingham, P. J., The concentration dynamics tradeoffin crime hot spotting, Unraveling the Crime-Place Connection: New Directions in Theory and Policy., 22, 2017, 21 pages.

[22] Osher, S., Wang, B., Yin, P., et al., Laplacian smoothing gradient descent, 2018, arXiv:1806.06317.

[23] Rastegari, M., Ordonez, V., Redmon, J. and Farhadi, A., Xnor-net: Imagenet classification using binary convolutional neural networks, ECCV, 2016, arXiv:1603.05279.

[24] Short, M. B., Mohler, G. O., Brantingham, P. J. and Tita, G. E., Gang rivalry dynamics via coupled point process network, Discrete Contin. Dyn. Syst. Ser. B, 19(5), 2014, 1459-1477.

[25] Short, M. B., Bertozzi, A. L. and Brantingham, P. J., Nonlinear patterns in urban crime: Hotspots,bifurcations, and suppression, SIAM J. Appl. Dyn. Syst., 9(2), 2010, 462-483.

[26] Short. M. B., Brantingham, P. J., Bertozzi, A. L. and Tita, G. E., Dissipation and displacement of hotspots in reaction-diffusion models of crime, Proc. Nat. Acad. Sci, 107(9), 2010, 3961-3965.

[27] Short, M. B., D’Orsogna, M. R., Pasour, V. B., et al., Persistent heat signature for pose-oblivious matching of incomplete models, M3AS: Mathematical Models and Methods in Applied Sciences, 18, 2008, 1249-1267.

[28] Stomakhin. A., Short, M. B. and Bertozzi, A. L., Reconstruction of missing data in social networks based on temporal patterns of interactions, Inverse Problems, 27(11), 2011, 15 pages.

[29] The Theano Development Team, Theano: A python framework for fas computation of mathematical expressions, 2015, arXiv: 1605.02688.

[30] Wang, B., Zhang, D., Zhang, D. H., et al., Deep Learning for Real Time Crime Forecasting, 2017, arXiv:1707.03340.

[31] Wang, X., Gerber, M. S. and Brown, D. E., Automatic crime prediction using events extracted from twitter posts, International Conference on Social Computing, Behavioral-Cultural Modeling, and Prediction, 2012,231-238, DOI:10.1007/978-3-642-29047-328.

[32] Yin, P., Zhang, S., Xin, J. and Qi, Y., Quantization and Training of Low Bit-Width Convolutional Neural Networks for Object Detection, 2016, arXiv:1612.06052.

[33] Zhang. J. B., Zheng, Y. and Qi, D. R., Deep spatio-temporal residual networks for citywide crowd flows prediction, AAAI, 2017, arXiv:1610.00081.

[34] Zhou, A. J., Yao, A. B., Guo, Y. W., et al., Incremental network quantization: Towards lossless cnns with low-precision weights, ICLR, 2017, arXiv:1702.03044.

[35] Zhu, C. Z., Han, S., Miao, H. Z. and Dally, W. J., Trained ternary quantization, ICLR, 2017. arXiv:1612.01064.

2000 MR Subject Classification 00A69, 65C50

Manuscript received February 14, 2019.

¹Department of Mathematics, University of California, Los Angeles, Westwood, Los Angeles, CA 90095,USA. E-mail: wangbaonj@gmail.com yph@ucla.edu bertozzi@math.ucla.edu sjo@math.ucla.edu

²Department of Anthropology, University of California, Los Angeles, Westwood, Los Angeles, CA 90095,USA. E-mail: branting@ucla.edu

³Department of Mathematics, University of California, Irvine, Irvine, CA 92697, USA.E-mail: jxin@math.uci.edu

^∗This work was supported by ONR Grants N00014-16-1-2119, N000-14-16-1-2157, NSF Grants DMS-1417674, DMS-1522383, DMS-1737770 and IIS-1632935.

标签：Crime论文; representation论文; Spatial-temporal论文; deep论文; Learning论文; Real-time论文; forecasting论文; Ternarization论文; Department of Mathematics论文; University of California论文; Los Angeles论文; Department of Anthropology论文; Irvine论文;