Chapter 3 Data transformation

The NYPD Open Data Sets are generally well constructed and with consistent data structure and type, and the sets usually do not have serious missing value issues. Therefore, our focus here would mainly be insights formulation, where we shall decide the specific metric for further exploration.

Specifically, we include the basic data transformation as the pre-processing module to clean the data and obtain some fundamental insights.

Notice that, since all our data listed here have millions of rows, so we turn to the ff library to enable fast loading the data set feature. Besides, we also developed several reusable modules for cleaner and more efficient processing. For more information on ‘ff’ package, please check here.

Our interest lies on the geometry distribution and the time series analysis of our Arrest Data and Complaint Data.

Finally, after successfully cleaned/pre-processed the Arrest Data and the Complaint Data, we introduced a map API as well to include the geometry information for our further visualized and interactive analysis in later chapters.

3.1 Construction of Processing Modules

We developed several reusable modules that we would repetitively use in our analysis below. Specifically, we set our environment as well as a time stamp formatting function to unify our criteria for date data.

One can implement these modules via the syntax of time-and-date manipulation in R and “knitr” as well as “magrittr” packages.

We also developed a mimic function for str() to provide insights of the data sets in a table way.

To formulate proper data structure and columns, we developed a function allowing we derive year and/or month information from the original time stamp column of Arrest Data. Further more, this function also allow us to decide whether an grouping by year and/or month is needed. One can find further details in our repository.

For a detailed implementation, please refer to our repository.

3.2 Preprocessing: Loading and Descriptive Analysis

We created tables for the data set structures to get a first impression on how exactly the data is showed.

(#tab:load the data, arrest)Head 6 Rows in Arrest Dataset for Temporal Analysis
Column1_ Variable Column2_ Classe Column3_ First_value
|| ARREST_KEY || integer || 32311380, 192799737, 193260691
|| ARREST_DATE || integer || 06/18/2007, 01/26/2019, 02/06/2019
|| PD_CD || integer || 511, 177, NA
|| PD_DESC || integer || CONTROLLED SUBSTANCE, POSSESSION 7, SEXUAL ABUSE,
|| KY_CD || integer || 235, 116, NA
|| OFNS_DESC || integer || DANGEROUS DRUGS, SEX CRIMES,
|| LAW_CODE || integer || PL 2200300, PL 1306503, PL 2203400
|| LAW_CAT_CD || integer || M, F, F
|| ARREST_BORO || integer || Q, M, M
|| ARREST_PRECINCT || integer || 27, 25, 14
|| JURISDICTION_CODE || integer || 1, 0, 0
|| AGE_GROUP || integer || 18-24, 45-64, 25-44
|| PERP_SEX || integer || M, M, M
|| PERP_RACE || integer || BLACK, BLACK, UNKNOWN
|| X_COORD_CD || double || NA, 1000555, 986685
|| Y_COORD_CD || double || NA, 230994, 215375
|| Latitude || double || NA, 40.800694331, 40.7578390030001
|| Longitude || double || NA, -73.941109286, -73.991212111
|| Lon_Lat || integer || , POINT (-73.94110928599997 40.800694331000045), POINT (-73.99121211099998 40.75783900300007)
ARREST_KEY ARREST_DATE PD_CD PD_DESC KY_CD OFNS_DESC LAW_CODE LAW_CAT_CD ARREST_BORO ARREST_PRECINCT JURISDICTION_CODE AGE_GROUP PERP_SEX PERP_RACE X_COORD_CD Y_COORD_CD Latitude Longitude Lon_Lat
32311380 06/18/2007 511 CONTROLLED SUBSTANCE, POSSESSION 7 235 DANGEROUS DRUGS PL 2200300 M Q 27 1 18-24 M BLACK NA NA NA NA
192799737 01/26/2019 177 SEXUAL ABUSE 116 SEX CRIMES PL 1306503 F M 25 0 45-64 M BLACK 1000555 230994 40.80069 -73.94111 POINT (-73.94110928599997 40.800694331000045)
193260691 02/06/2019 NA NA PL 2203400 F M 14 0 25-44 M UNKNOWN 986685 215375 40.75784 -73.99121 POINT (-73.99121211099998 40.75783900300007)
149117452 01/06/2016 153 RAPE 3 104 RAPE PL 1302503 F K 67 0 25-44 M BLACK 998032 175598 40.64865 -73.95034 POINT (-73.95033556299995 40.648650085000035)
190049060 11/15/2018 157 RAPE 1 104 RAPE PL 1303501 F K 77 0 25-44 M BLACK 1003606 185050 40.67458 -73.93022 POINT (-73.93022154099998 40.67458330800008)
24288194 09/13/2006 203 TRESPASS 3, CRIMINAL 352 CRIMINAL TRESPASS PL 140100E M K 77 2 45-64 M BLACK 1004580 183838 40.67125 -73.92671 POINT (-73.926713851 40.67125445700003)
Column1_ Variable Column2_ Classe Column3_ First_value
|| CMPLNT_NUM || integer || 394506329, 968873685, 509837549
|| CMPLNT_FR_DT || character || 12/31/2019, 12/29/2019, 12/15/2019
|| CMPLNT_FR_TM || character || 17:30:00, 16:31:00, 18:45:00
|| CMPLNT_TO_DT || character || , 12/29/2019,
|| CMPLNT_TO_TM || character || , 16:54:00,
|| ADDR_PCT_CD || integer || 32, 47, 109
|| RPT_DT || character || 12/31/2019, 12/29/2019, 12/29/2019
|| KY_CD || integer || 118, 113, 578
|| OFNS_DESC || character || DANGEROUS WEAPONS, FORGERY, HARRASSMENT 2
|| PD_CD || integer || 793, 729, 638
|| PD_DESC || character || WEAPONS POSSESSION 3, FORGERY,ETC.,UNCLASSIFIED-FELO, HARASSMENT,SUBD 3,4,5
|| CRM_ATPT_CPTD_CD || character || COMPLETED, COMPLETED, COMPLETED
|| LAW_CAT_CD || character || FELONY, FELONY, VIOLATION
|| BORO_NM || character || MANHATTAN, BRONX, QUEENS
|| LOC_OF_OCCUR_DESC || character || , , FRONT OF
|| PREM_TYP_DESC || character || STREET, STREET, STREET
|| JURIS_DESC || character || N.Y. POLICE DEPT, N.Y. POLICE DEPT, N.Y. POLICE DEPT
|| JURISDICTION_CODE || integer || 0, 0, 0
|| PARKS_NM || character || , ,
|| HADEVELOPT || character || , ,
|| HOUSING_PSA || character || , ,
|| X_COORD_CD || integer || 999937, 1022508, 1034178
|| Y_COORD_CD || integer || 238365, 261990, 209758
|| SUSP_AGE_GROUP || character || , , 25-44
|| SUSP_RACE || character || , , UNKNOWN
|| SUSP_SEX || character || , , M
|| TRANSIT_DISTRICT || integer || NA, NA, NA
|| Latitude || double || 40.820926797, 40.8857014060001, 40.742281156
|| Longitude || double || -73.943324219, -73.861640325, -73.81982408
|| Lat_Lon || character || (40.82092679700002, -73.94332421899996), (40.885701406000074, -73.86164032499995), (40.74228115600005, -73.81982408)
|| PATROL_BORO || character || PATROL BORO MAN NORTH, PATROL BORO BRONX, PATROL BORO QUEENS NORTH
|| STATION_NAME || character || , ,
|| VIC_AGE_GROUP || character || UNKNOWN, UNKNOWN, 25-44
|| VIC_RACE || character || UNKNOWN, UNKNOWN, WHITE HISPANIC
|| VIC_SEX || character || E, E, F

The data is relatively complex.

First, for our Arrest Data, we acquired over 5.1 million rows and 19 variables.

Column1_ Variable Column2_ Classe Column3_ First_value
|| ARREST_KEY || integer || 32311380, 192799737, 193260691
|| ARREST_DATE || integer || 06/18/2007, 01/26/2019, 02/06/2019
|| PD_CD || integer || 511, 177, NA
|| PD_DESC || integer || CONTROLLED SUBSTANCE, POSSESSION 7, SEXUAL ABUSE,
|| KY_CD || integer || 235, 116, NA
|| OFNS_DESC || integer || DANGEROUS DRUGS, SEX CRIMES,
|| LAW_CODE || integer || PL 2200300, PL 1306503, PL 2203400
|| LAW_CAT_CD || integer || M, F, F
|| ARREST_BORO || integer || Q, M, M
|| ARREST_PRECINCT || integer || 27, 25, 14
|| JURISDICTION_CODE || integer || 1, 0, 0
|| AGE_GROUP || integer || 18-24, 45-64, 25-44
|| PERP_SEX || integer || M, M, M
|| PERP_RACE || integer || BLACK, BLACK, UNKNOWN
|| X_COORD_CD || double || NA, 1000555, 986685
|| Y_COORD_CD || double || NA, 230994, 215375
|| Latitude || double || NA, 40.800694331, 40.7578390030001
|| Longitude || double || NA, -73.941109286, -73.991212111
|| Lon_Lat || integer || , POINT (-73.94110928599997 40.800694331000045), POINT (-73.99121211099998 40.75783900300007)

Then, for our Complaints Data, we acquired over 7.3 million rows and 35 variables.

Column1_ Variable Column2_ Classe Column3_ First_value
|| CMPLNT_NUM || integer || 394506329, 968873685, 509837549
|| CMPLNT_FR_DT || character || 12/31/2019, 12/29/2019, 12/15/2019
|| CMPLNT_FR_TM || character || 17:30:00, 16:31:00, 18:45:00
|| CMPLNT_TO_DT || character || , 12/29/2019,
|| CMPLNT_TO_TM || character || , 16:54:00,
|| ADDR_PCT_CD || integer || 32, 47, 109
|| RPT_DT || character || 12/31/2019, 12/29/2019, 12/29/2019
|| KY_CD || integer || 118, 113, 578
|| OFNS_DESC || character || DANGEROUS WEAPONS, FORGERY, HARRASSMENT 2
|| PD_CD || integer || 793, 729, 638
|| PD_DESC || character || WEAPONS POSSESSION 3, FORGERY,ETC.,UNCLASSIFIED-FELO, HARASSMENT,SUBD 3,4,5
|| CRM_ATPT_CPTD_CD || character || COMPLETED, COMPLETED, COMPLETED
|| LAW_CAT_CD || character || FELONY, FELONY, VIOLATION
|| BORO_NM || character || MANHATTAN, BRONX, QUEENS
|| LOC_OF_OCCUR_DESC || character || , , FRONT OF
|| PREM_TYP_DESC || character || STREET, STREET, STREET
|| JURIS_DESC || character || N.Y. POLICE DEPT, N.Y. POLICE DEPT, N.Y. POLICE DEPT
|| JURISDICTION_CODE || integer || 0, 0, 0
|| PARKS_NM || character || , ,
|| HADEVELOPT || character || , ,
|| HOUSING_PSA || character || , ,
|| X_COORD_CD || integer || 999937, 1022508, 1034178
|| Y_COORD_CD || integer || 238365, 261990, 209758
|| SUSP_AGE_GROUP || character || , , 25-44
|| SUSP_RACE || character || , , UNKNOWN
|| SUSP_SEX || character || , , M
|| TRANSIT_DISTRICT || integer || NA, NA, NA
|| Latitude || double || 40.820926797, 40.8857014060001, 40.742281156
|| Longitude || double || -73.943324219, -73.861640325, -73.81982408
|| Lat_Lon || character || (40.82092679700002, -73.94332421899996), (40.885701406000074, -73.86164032499995), (40.74228115600005, -73.81982408)
|| PATROL_BORO || character || PATROL BORO MAN NORTH, PATROL BORO BRONX, PATROL BORO QUEENS NORTH
|| STATION_NAME || character || , ,
|| VIC_AGE_GROUP || character || UNKNOWN, UNKNOWN, 25-44
|| VIC_RACE || character || UNKNOWN, UNKNOWN, WHITE HISPANIC
|| VIC_SEX || character || E, E, F

We processed the Arrest Data, Complaints Data and other data sets separately.

3.3 Pre-processing for Arrest Data

3.3.1 Arrest Data: Time Analysis Pre-processing

To formulate proper data structure and columns, we developed a function allowing we derive year and/or month information from the original time stamp column of Arrest Data. Further more, this function also allow us to decide whether an grouping by year and/or month is needed. One can find further details in our repository.

Table 3.1: Head 6 Rows in Arrest Dataset for Temporal Analysis
Year LAW_CAT_CD count prop
2006 F 103308 0.2786039
2006 I 2040 0.0055015
2006 M 245508 0.6620928
2006 V 19950 0.0538017
2007 F 108617 0.2700982
2007 I 4234 0.0105287

We selected information from year, LAW_CAT_CD (crime category), counting of each category, and probability represented by frequency out via processing on ‘LAW_CAT_CD’, ‘ARREST_DATE’, and ‘Year’. This allowed us to perform a analysis on crime category based on their counting and probability in a time series manner.

Column1_ Variable Column2_ Classe Column3_ First_value
|| Year || character || 2006, 2006, 2006
|| LAW_CAT_CD || integer || F, I, M
|| count || integer || 103308, 2040, 245508
|| prop || double || 0.278603906085662, 0.00550152910147085, 0.662092846394071

3.3.2 Arrest Data: Geometry Anaylsis Pre-processing

We selected out ‘LAW_CAT_CD’, ‘ARREST_DATE’, ‘Year’, ‘ARREST_BORO’, ‘Latitude’, and ‘Longitude’ in order to get a better understanding on the geometry distribution of the crimes.

Table 3.2: Head 6 Rows in Arrest Dataset for Spatial Analysis
LAW_CAT_CD ARREST_DATE Year ARREST_BORO Latitude Longitude
F 01/26/2019 2019 M 40.80069 -73.94111
F 02/06/2019 2019 M 40.75784 -73.99121
F 01/06/2016 2016 K 40.64865 -73.95034
F 11/15/2018 2018 K 40.67458 -73.93022
M 09/13/2006 2006 K 40.67125 -73.92671
F 10/24/2018 2018 M 40.71620 -73.99749

Here is a brief structure of the cleaned data set.

Column1_ Variable Column2_ Classe Column3_ First_value
|| LAW_CAT_CD || integer || F, F, F
|| ARREST_DATE || integer || 01/26/2019, 02/06/2019, 01/06/2016
|| Year || character || 2019, 2019, 2016
|| ARREST_BORO || integer || M, M, K
|| Latitude || double || 40.800694331, 40.7578390030001, 40.648650085
|| Longitude || double || -73.941109286, -73.991212111, -73.950335563

3.4 Pre-processing for Complaint Data

Our modification on Complaint Data set are similar to what we processed on Arrest Data set.

Column1_ Variable Column2_ Classe Column3_ First_value
|| Year || character || 2006, 2006, 2006
|| LAW_CAT_CD || character || FELONY, MISDEMEANOR, VIOLATION
|| count || integer || 171638, 288539, 69417
|| prop || double || 0.324093550908809, 0.544830568322149, 0.131075880769042

3.5 Pre-processing for other Data Sets

No pre-processing is needed for other data sets.

3.6 Next Stage

We have successfully constructed cleaned data sets now. Next, we can start analyzing.