Chapter 3 Data transformation
The NYPD Open Data Sets are generally well constructed and with consistent data structure and type, and the sets usually do not have serious missing value issues. Therefore, our focus here would mainly be insights formulation, where we shall decide the specific metric for further exploration.
Specifically, we include the basic data transformation as the pre-processing module to clean the data and obtain some fundamental insights.
Notice that, since all our data listed here have millions of rows, so we turn to the ff
library to enable fast loading the data set feature. Besides, we also developed several reusable modules for cleaner and more efficient processing. For more information on ‘ff’ package, please check here.
Our interest lies on the geometry distribution and the time series analysis of our Arrest Data and Complaint Data.
Finally, after successfully cleaned/pre-processed the Arrest Data and the Complaint Data, we introduced a map API as well to include the geometry information for our further visualized and interactive analysis in later chapters.
3.1 Construction of Processing Modules
We developed several reusable modules that we would repetitively use in our analysis below. Specifically, we set our environment as well as a time stamp formatting function to unify our criteria for date data.
One can implement these modules via the syntax of time-and-date manipulation in R and “knitr” as well as “magrittr” packages.
We also developed a mimic function for str() to provide insights of the data sets in a table way.
To formulate proper data structure and columns, we developed a function allowing we derive year and/or month information from the original time stamp column of Arrest Data. Further more, this function also allow us to decide whether an grouping by year and/or month is needed. One can find further details in our repository.
For a detailed implementation, please refer to our repository.
3.2 Preprocessing: Loading and Descriptive Analysis
We created tables for the data set structures to get a first impression on how exactly the data is showed.
Column1_ | Variable | Column2_ | Classe | Column3_ | First_value |
---|---|---|---|---|---|
|| | ARREST_KEY | || | integer | || | 32311380, 192799737, 193260691 |
|| | ARREST_DATE | || | integer | || | 06/18/2007, 01/26/2019, 02/06/2019 |
|| | PD_CD | || | integer | || | 511, 177, NA |
|| | PD_DESC | || | integer | || | CONTROLLED SUBSTANCE, POSSESSION 7, SEXUAL ABUSE, |
|| | KY_CD | || | integer | || | 235, 116, NA |
|| | OFNS_DESC | || | integer | || | DANGEROUS DRUGS, SEX CRIMES, |
|| | LAW_CODE | || | integer | || | PL 2200300, PL 1306503, PL 2203400 |
|| | LAW_CAT_CD | || | integer | || | M, F, F |
|| | ARREST_BORO | || | integer | || | Q, M, M |
|| | ARREST_PRECINCT | || | integer | || | 27, 25, 14 |
|| | JURISDICTION_CODE | || | integer | || | 1, 0, 0 |
|| | AGE_GROUP | || | integer | || | 18-24, 45-64, 25-44 |
|| | PERP_SEX | || | integer | || | M, M, M |
|| | PERP_RACE | || | integer | || | BLACK, BLACK, UNKNOWN |
|| | X_COORD_CD | || | double | || | NA, 1000555, 986685 |
|| | Y_COORD_CD | || | double | || | NA, 230994, 215375 |
|| | Latitude | || | double | || | NA, 40.800694331, 40.7578390030001 |
|| | Longitude | || | double | || | NA, -73.941109286, -73.991212111 |
|| | Lon_Lat | || | integer | || | , POINT (-73.94110928599997 40.800694331000045), POINT (-73.99121211099998 40.75783900300007) |
ARREST_KEY | ARREST_DATE | PD_CD | PD_DESC | KY_CD | OFNS_DESC | LAW_CODE | LAW_CAT_CD | ARREST_BORO | ARREST_PRECINCT | JURISDICTION_CODE | AGE_GROUP | PERP_SEX | PERP_RACE | X_COORD_CD | Y_COORD_CD | Latitude | Longitude | Lon_Lat |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
32311380 | 06/18/2007 | 511 | CONTROLLED SUBSTANCE, POSSESSION 7 | 235 | DANGEROUS DRUGS | PL 2200300 | M | Q | 27 | 1 | 18-24 | M | BLACK | NA | NA | NA | NA | |
192799737 | 01/26/2019 | 177 | SEXUAL ABUSE | 116 | SEX CRIMES | PL 1306503 | F | M | 25 | 0 | 45-64 | M | BLACK | 1000555 | 230994 | 40.80069 | -73.94111 | POINT (-73.94110928599997 40.800694331000045) |
193260691 | 02/06/2019 | NA | NA | PL 2203400 | F | M | 14 | 0 | 25-44 | M | UNKNOWN | 986685 | 215375 | 40.75784 | -73.99121 | POINT (-73.99121211099998 40.75783900300007) | ||
149117452 | 01/06/2016 | 153 | RAPE 3 | 104 | RAPE | PL 1302503 | F | K | 67 | 0 | 25-44 | M | BLACK | 998032 | 175598 | 40.64865 | -73.95034 | POINT (-73.95033556299995 40.648650085000035) |
190049060 | 11/15/2018 | 157 | RAPE 1 | 104 | RAPE | PL 1303501 | F | K | 77 | 0 | 25-44 | M | BLACK | 1003606 | 185050 | 40.67458 | -73.93022 | POINT (-73.93022154099998 40.67458330800008) |
24288194 | 09/13/2006 | 203 | TRESPASS 3, CRIMINAL | 352 | CRIMINAL TRESPASS | PL 140100E | M | K | 77 | 2 | 45-64 | M | BLACK | 1004580 | 183838 | 40.67125 | -73.92671 | POINT (-73.926713851 40.67125445700003) |
Column1_ | Variable | Column2_ | Classe | Column3_ | First_value |
---|---|---|---|---|---|
|| | CMPLNT_NUM | || | integer | || | 394506329, 968873685, 509837549 |
|| | CMPLNT_FR_DT | || | character | || | 12/31/2019, 12/29/2019, 12/15/2019 |
|| | CMPLNT_FR_TM | || | character | || | 17:30:00, 16:31:00, 18:45:00 |
|| | CMPLNT_TO_DT | || | character | || | , 12/29/2019, |
|| | CMPLNT_TO_TM | || | character | || | , 16:54:00, |
|| | ADDR_PCT_CD | || | integer | || | 32, 47, 109 |
|| | RPT_DT | || | character | || | 12/31/2019, 12/29/2019, 12/29/2019 |
|| | KY_CD | || | integer | || | 118, 113, 578 |
|| | OFNS_DESC | || | character | || | DANGEROUS WEAPONS, FORGERY, HARRASSMENT 2 |
|| | PD_CD | || | integer | || | 793, 729, 638 |
|| | PD_DESC | || | character | || | WEAPONS POSSESSION 3, FORGERY,ETC.,UNCLASSIFIED-FELO, HARASSMENT,SUBD 3,4,5 |
|| | CRM_ATPT_CPTD_CD | || | character | || | COMPLETED, COMPLETED, COMPLETED |
|| | LAW_CAT_CD | || | character | || | FELONY, FELONY, VIOLATION |
|| | BORO_NM | || | character | || | MANHATTAN, BRONX, QUEENS |
|| | LOC_OF_OCCUR_DESC | || | character | || | , , FRONT OF |
|| | PREM_TYP_DESC | || | character | || | STREET, STREET, STREET |
|| | JURIS_DESC | || | character | || | N.Y. POLICE DEPT, N.Y. POLICE DEPT, N.Y. POLICE DEPT |
|| | JURISDICTION_CODE | || | integer | || | 0, 0, 0 |
|| | PARKS_NM | || | character | || | , , |
|| | HADEVELOPT | || | character | || | , , |
|| | HOUSING_PSA | || | character | || | , , |
|| | X_COORD_CD | || | integer | || | 999937, 1022508, 1034178 |
|| | Y_COORD_CD | || | integer | || | 238365, 261990, 209758 |
|| | SUSP_AGE_GROUP | || | character | || | , , 25-44 |
|| | SUSP_RACE | || | character | || | , , UNKNOWN |
|| | SUSP_SEX | || | character | || | , , M |
|| | TRANSIT_DISTRICT | || | integer | || | NA, NA, NA |
|| | Latitude | || | double | || | 40.820926797, 40.8857014060001, 40.742281156 |
|| | Longitude | || | double | || | -73.943324219, -73.861640325, -73.81982408 |
|| | Lat_Lon | || | character | || | (40.82092679700002, -73.94332421899996), (40.885701406000074, -73.86164032499995), (40.74228115600005, -73.81982408) |
|| | PATROL_BORO | || | character | || | PATROL BORO MAN NORTH, PATROL BORO BRONX, PATROL BORO QUEENS NORTH |
|| | STATION_NAME | || | character | || | , , |
|| | VIC_AGE_GROUP | || | character | || | UNKNOWN, UNKNOWN, 25-44 |
|| | VIC_RACE | || | character | || | UNKNOWN, UNKNOWN, WHITE HISPANIC |
|| | VIC_SEX | || | character | || | E, E, F |
The data is relatively complex.
First, for our Arrest Data, we acquired over 5.1 million rows and 19 variables.
Column1_ | Variable | Column2_ | Classe | Column3_ | First_value |
---|---|---|---|---|---|
|| | ARREST_KEY | || | integer | || | 32311380, 192799737, 193260691 |
|| | ARREST_DATE | || | integer | || | 06/18/2007, 01/26/2019, 02/06/2019 |
|| | PD_CD | || | integer | || | 511, 177, NA |
|| | PD_DESC | || | integer | || | CONTROLLED SUBSTANCE, POSSESSION 7, SEXUAL ABUSE, |
|| | KY_CD | || | integer | || | 235, 116, NA |
|| | OFNS_DESC | || | integer | || | DANGEROUS DRUGS, SEX CRIMES, |
|| | LAW_CODE | || | integer | || | PL 2200300, PL 1306503, PL 2203400 |
|| | LAW_CAT_CD | || | integer | || | M, F, F |
|| | ARREST_BORO | || | integer | || | Q, M, M |
|| | ARREST_PRECINCT | || | integer | || | 27, 25, 14 |
|| | JURISDICTION_CODE | || | integer | || | 1, 0, 0 |
|| | AGE_GROUP | || | integer | || | 18-24, 45-64, 25-44 |
|| | PERP_SEX | || | integer | || | M, M, M |
|| | PERP_RACE | || | integer | || | BLACK, BLACK, UNKNOWN |
|| | X_COORD_CD | || | double | || | NA, 1000555, 986685 |
|| | Y_COORD_CD | || | double | || | NA, 230994, 215375 |
|| | Latitude | || | double | || | NA, 40.800694331, 40.7578390030001 |
|| | Longitude | || | double | || | NA, -73.941109286, -73.991212111 |
|| | Lon_Lat | || | integer | || | , POINT (-73.94110928599997 40.800694331000045), POINT (-73.99121211099998 40.75783900300007) |
Then, for our Complaints Data, we acquired over 7.3 million rows and 35 variables.
Column1_ | Variable | Column2_ | Classe | Column3_ | First_value |
---|---|---|---|---|---|
|| | CMPLNT_NUM | || | integer | || | 394506329, 968873685, 509837549 |
|| | CMPLNT_FR_DT | || | character | || | 12/31/2019, 12/29/2019, 12/15/2019 |
|| | CMPLNT_FR_TM | || | character | || | 17:30:00, 16:31:00, 18:45:00 |
|| | CMPLNT_TO_DT | || | character | || | , 12/29/2019, |
|| | CMPLNT_TO_TM | || | character | || | , 16:54:00, |
|| | ADDR_PCT_CD | || | integer | || | 32, 47, 109 |
|| | RPT_DT | || | character | || | 12/31/2019, 12/29/2019, 12/29/2019 |
|| | KY_CD | || | integer | || | 118, 113, 578 |
|| | OFNS_DESC | || | character | || | DANGEROUS WEAPONS, FORGERY, HARRASSMENT 2 |
|| | PD_CD | || | integer | || | 793, 729, 638 |
|| | PD_DESC | || | character | || | WEAPONS POSSESSION 3, FORGERY,ETC.,UNCLASSIFIED-FELO, HARASSMENT,SUBD 3,4,5 |
|| | CRM_ATPT_CPTD_CD | || | character | || | COMPLETED, COMPLETED, COMPLETED |
|| | LAW_CAT_CD | || | character | || | FELONY, FELONY, VIOLATION |
|| | BORO_NM | || | character | || | MANHATTAN, BRONX, QUEENS |
|| | LOC_OF_OCCUR_DESC | || | character | || | , , FRONT OF |
|| | PREM_TYP_DESC | || | character | || | STREET, STREET, STREET |
|| | JURIS_DESC | || | character | || | N.Y. POLICE DEPT, N.Y. POLICE DEPT, N.Y. POLICE DEPT |
|| | JURISDICTION_CODE | || | integer | || | 0, 0, 0 |
|| | PARKS_NM | || | character | || | , , |
|| | HADEVELOPT | || | character | || | , , |
|| | HOUSING_PSA | || | character | || | , , |
|| | X_COORD_CD | || | integer | || | 999937, 1022508, 1034178 |
|| | Y_COORD_CD | || | integer | || | 238365, 261990, 209758 |
|| | SUSP_AGE_GROUP | || | character | || | , , 25-44 |
|| | SUSP_RACE | || | character | || | , , UNKNOWN |
|| | SUSP_SEX | || | character | || | , , M |
|| | TRANSIT_DISTRICT | || | integer | || | NA, NA, NA |
|| | Latitude | || | double | || | 40.820926797, 40.8857014060001, 40.742281156 |
|| | Longitude | || | double | || | -73.943324219, -73.861640325, -73.81982408 |
|| | Lat_Lon | || | character | || | (40.82092679700002, -73.94332421899996), (40.885701406000074, -73.86164032499995), (40.74228115600005, -73.81982408) |
|| | PATROL_BORO | || | character | || | PATROL BORO MAN NORTH, PATROL BORO BRONX, PATROL BORO QUEENS NORTH |
|| | STATION_NAME | || | character | || | , , |
|| | VIC_AGE_GROUP | || | character | || | UNKNOWN, UNKNOWN, 25-44 |
|| | VIC_RACE | || | character | || | UNKNOWN, UNKNOWN, WHITE HISPANIC |
|| | VIC_SEX | || | character | || | E, E, F |
We processed the Arrest Data, Complaints Data and other data sets separately.
3.3 Pre-processing for Arrest Data
3.3.1 Arrest Data: Time Analysis Pre-processing
To formulate proper data structure and columns, we developed a function allowing we derive year and/or month information from the original time stamp column of Arrest Data. Further more, this function also allow us to decide whether an grouping by year and/or month is needed. One can find further details in our repository.
Year | LAW_CAT_CD | count | prop |
---|---|---|---|
2006 | F | 103308 | 0.2786039 |
2006 | I | 2040 | 0.0055015 |
2006 | M | 245508 | 0.6620928 |
2006 | V | 19950 | 0.0538017 |
2007 | F | 108617 | 0.2700982 |
2007 | I | 4234 | 0.0105287 |
We selected information from year, LAW_CAT_CD (crime category), counting of each category, and probability represented by frequency out via processing on ‘LAW_CAT_CD’, ‘ARREST_DATE’, and ‘Year’. This allowed us to perform a analysis on crime category based on their counting and probability in a time series manner.
Column1_ | Variable | Column2_ | Classe | Column3_ | First_value |
---|---|---|---|---|---|
|| | Year | || | character | || | 2006, 2006, 2006 |
|| | LAW_CAT_CD | || | integer | || | F, I, M |
|| | count | || | integer | || | 103308, 2040, 245508 |
|| | prop | || | double | || | 0.278603906085662, 0.00550152910147085, 0.662092846394071 |
3.3.2 Arrest Data: Geometry Anaylsis Pre-processing
We selected out ‘LAW_CAT_CD’, ‘ARREST_DATE’, ‘Year’, ‘ARREST_BORO’, ‘Latitude’, and ‘Longitude’ in order to get a better understanding on the geometry distribution of the crimes.
LAW_CAT_CD | ARREST_DATE | Year | ARREST_BORO | Latitude | Longitude |
---|---|---|---|---|---|
F | 01/26/2019 | 2019 | M | 40.80069 | -73.94111 |
F | 02/06/2019 | 2019 | M | 40.75784 | -73.99121 |
F | 01/06/2016 | 2016 | K | 40.64865 | -73.95034 |
F | 11/15/2018 | 2018 | K | 40.67458 | -73.93022 |
M | 09/13/2006 | 2006 | K | 40.67125 | -73.92671 |
F | 10/24/2018 | 2018 | M | 40.71620 | -73.99749 |
Here is a brief structure of the cleaned data set.
Column1_ | Variable | Column2_ | Classe | Column3_ | First_value |
---|---|---|---|---|---|
|| | LAW_CAT_CD | || | integer | || | F, F, F |
|| | ARREST_DATE | || | integer | || | 01/26/2019, 02/06/2019, 01/06/2016 |
|| | Year | || | character | || | 2019, 2019, 2016 |
|| | ARREST_BORO | || | integer | || | M, M, K |
|| | Latitude | || | double | || | 40.800694331, 40.7578390030001, 40.648650085 |
|| | Longitude | || | double | || | -73.941109286, -73.991212111, -73.950335563 |
3.4 Pre-processing for Complaint Data
Our modification on Complaint Data set are similar to what we processed on Arrest Data set.
Column1_ | Variable | Column2_ | Classe | Column3_ | First_value |
---|---|---|---|---|---|
|| | Year | || | character | || | 2006, 2006, 2006 |
|| | LAW_CAT_CD | || | character | || | FELONY, MISDEMEANOR, VIOLATION |
|| | count | || | integer | || | 171638, 288539, 69417 |
|| | prop | || | double | || | 0.324093550908809, 0.544830568322149, 0.131075880769042 |
3.5 Pre-processing for other Data Sets
No pre-processing is needed for other data sets.
3.6 Next Stage
We have successfully constructed cleaned data sets now. Next, we can start analyzing.