Skip to content

1. The Data Processing Pipeline

The analysis code begins with the data processing pipeline. This pipeline starts by downloading the latest economic indicators and ends by outputting a set of features that can be input into the VAR modeling pipeline.

Its main purpose is to identify the series of transformations that will make each time series indicator stationary so that the indicators are suitable for use in a vector autoregression.

The code for the pipeline is available at:

src/fyp_analysis/pipelines/data_processing/ (link)

Running the Pipeline

To run the pipeline, execute:

poetry run fyp-analysis-run --pipeline dp

where dp is short for "data processing".

Parameters

The parameters for the data processing pipeline can be set in the file: conf/base/parameters/data_processing.yml (link). The parameters are:

  • fresh_indicators: whether to download fresh economic indicators
  • seasonal_adjustments: the names of the columns to apply seasonal adjustments to
  • min_feature_year: the minimum year to trim the indicators to

Steps

This section outlines the steps (also called nodes) in the data processing pipeline. The steps are defined in the src/fyp_analysis/pipelines/data_processing/pipeline.py file. In this file, we define the function to run for each step, as well as the inputs and outputs of each function.

This pipeline will download the latest version of a set of economic indicators, perform various transformations, and output a set of features suitable to be used as input to the modeling pipeline.

Warning

Make sure you have properly set up your local API credentials before running this pipeline. Otherwise, you won't be able to download all of the necessary indicators. See the setup instructions for more information.

In python, the pipeline is defined as follows:

def create_pipeline(**kwargs):
    return Pipeline(
        [
            node(
                func=get_economic_indicators,
                inputs="params:fresh_indicators",
                outputs="economic_indicators",
                name="economic_indicators_node",
            ),
            node(
                func=get_quarterly_averages,
                inputs="economic_indicators",
                outputs="quarterly_features_raw",
                name="quarterly_features_raw_node",
            ),
            node(
                func=impute_cbo_values,
                inputs=[
                    "quarterly_features_raw",
                    "params:plan_start_year",
                    "params:cbo_forecast_date",
                ],
                outputs="quarterly_features_cbo_imputed",
                name="impute_cbo_node",
            ),
            node(
                func=combine_features_and_bases,
                inputs=["quarterly_features_cbo_imputed", "plan_details"],
                outputs="features_and_bases",
                name="combine_features_bases_node",
            ),
            node(
                func=seasonally_adjust_features,
                inputs=["features_and_bases", "params:seasonal_adjustments"],
                outputs="features_and_bases_sa",
                name="seasonal_adjustment_node",
            ),
            node(
                func=get_stationary_guide,
                inputs="features_and_bases_sa",
                outputs="stationary_guide",
                name="stationary_guide_node",
            ),
            node(
                func=get_final_unscaled_features,
                inputs=["features_and_bases_sa", "params:min_feature_year"],
                outputs="final_unscaled_features",
                name="final_unscaled_features_node",
            ),
            node(
                func=get_final_scaled_features,
                inputs=[
                    "final_unscaled_features",
                    "stationary_guide",
                ],
                outputs="final_scaled_features",
                name="final_scaled_features_node",
            ),
        ]
    )

Reminder

As described here, if you are working with IPython or in a Jupyter notebook, you can load any named dataset (the inputs/outputs above) using the catalog.load() function. For example, to load the "economic_indicators" dataset (the output from step 1), use:

indicators = catalog.load("economic_indicators")

Step 1: Download indicators

  • Function: get_economic_indicators()
  • Purpose: Download the latest set of economic indicators and save them locally
  • Inputs:
    • Parameter: fresh_indicators
  • Outputs:
    • Dataset: economic_indicators in the data/02_intermediate/ folder

Economic indicators are defined in the src/fyp_analysis/pipelines/data_processing/indicators/sources folder. Right now, there are various sources, including FRED, Quandl, CARTO (Philadelphia open data), and Zillow, with a JSON file for each source that lists the information necessary for download. New indicators can be added by adding a new entry to the appropriate JSON file.

The current set of indicators includes the following:

name description source frequency geography
ActivityLicensesPhilly Commercial activity licenses for the City of Philadelphia carto monthly Philadelphia
BizLicensesPhilly Business licenses for the City of Philadelphia carto monthly Philadelphia
BuildingPermitsPhilly New construction permits for the City of Philadelphia carto monthly Philadelphia
DeedTransfersPhilly Deed real estate transfers for the City of Philadelphia carto monthly Philadelphia
10YearTreasury 10-Year Treasury Constant Maturity Rate fred monthly national
3MonthTreasury 3-Month Treasury Bill: Secondary Market Rate fred monthly national
AlcoholSales Retail Sales: Beer, Wine, and Liquor Stores fred quarterly national
BuildingPermitsPhillyMSA New Private Housing Units Authorized by Building Permits for Philadelphia-Camden-Wilmington, PA-NJ-DE-MD (MSA) fred quarterly Philadelphia MSA
CPIPhillyMSA Consumer Price Index for All Urban Consumers: All Items in Philadelphia-Camden-Wilmington, PA-NJ-DE-MD fred monthly Philadelphia MSA
CPIU Consumer Price Index for All Urban Consumers: All Items in U.S. City Average fred monthly national
CarSales Total Vehicle Sales fred daily national
ConsumerConfidence Consumer Opinion Surveys: Confidence Indicators: Composite Indicators: OECD Indicator for the United States fred monthly national
ContinuedClaimsPA Continued Claims (Insured Unemployment) in Pennsylvania fred weekly state
CorporateProfits Corporate Profits with Inventory Valuation Adjustment (IVA) and Capital Consumption Adjustment (CCAdj) fred quarterly national
EconomicConditionsPhillyMSA Economic Conditions Index for Philadelphia-Camden-Wilmington, PA-NJ-DE-MD (MSA) fred monthly Philadelphia MSA
EmploymentCostIndex Employment Cost Index: Wages and Salaries: Private Industry Workers fred quarterly national
FHFAHousePriceIndex Purchase Only House Price Index for the United States fred quarterly national
FedFundsRate Effective Federal Funds Rate fred monthly national
GDP Gross Domestic Product fred quarterly national
GDPPriceIndex Gross Domestic Product: Chain-type Price Index fred quarterly national
GovtSocialBenefits Federal government current transfer payments: Government social benefits: to persons fred quarterly national
HousePriceIndexPhillyMSA All-Transactions House Price Index for Philadelphia, PA (MSAD) fred quarterly Philadelphia MSA
HousingStarts Housing Starts: Total: New Privately Owned Housing Units Started fred monthly national
HousingSupply Monthly Supply of Houses in the United States fred monthly national
InitialClaimsPA Initial Claims in Pennsylvania fred weekly state
JobOpenings Job Openings: Total Nonfarm fred monthly national
ManufacturingHoursWorked Weekly Hours Worked: Manufacturing for the United States fred quarterly national
NYCGasPrice Conventional Gasoline Prices: New York Harbor, Regular fred daily national
NewManufacturingOrders Manufacturers' New Orders: Nondefense Capital Goods Excluding Aircraft fred monthly national
NonfarmEmployeesPhilly All Employees: Total Nonfarm in Philadelphia City, PA fred monthly Philadelphia
NonfarmEmployeesPhillyMSA All Employees: Total Nonfarm in Philadelphia-Camden-Wilmington, PA-NJ-DE-MD (MSA) fred monthly Philadelphia MSA
NonfarmEmployment All Employees, Total Nonfarm fred monthly national
NonresidentialInvestment Private Nonresidential Fixed Investment fred quarterly national
OilPriceWTI Crude Oil Prices: West Texas Intermediate (WTI) - Cushing, Oklahoma fred monthly national
PCE Personal Consumption Expenditures fred monthly national
PCEPriceIndex Personal Consumption Expenditures: Chain-type Price Index fred monthly national
PPI Producer Price Index for All Commodities fred monthly national
PersonalIncome Personal Income fred quarterly national
PersonalIncomePhillyMSA Per Capita Personal Income in Philadelphia County/city, PA fred annual Philadelphia MSA
PersonalSavingsRate Personal Savings Rate fred monthly national
PopulationPhilly Resident Population in Philadelphia County/city, PA fred annual Philadelphia
PrimeEPOP Employment-Population Ratio fred monthly national
RealDisposablePersonalIncome Real Disposable Personal Income fred monthly national
RealGDP Real Gross Domestic Product fred quarterly national
RealGDPPhillyMSA Total Real Gross Domestic Product for Philadelphia-Camden-Wilmington, PA-NJ-DE-MD fred annual Philadelphia MSA
RealRetailFoodServiceSales Advance Real Retail and Food Services Sales fred monthly national
ResidentialInvestment Private Residential Fixed Investment fred quarterly national
SahmRule Real-time Sahm Rule Recession Indicator fred monthly national
TotalBusinessSales Total Business Sales fred monthly national
UncertaintyIndex Economic Policy Uncertainty Index for United States fred monthly national
UnemploymentPhilly Unemployment Rate in Philadelphia County/City, PA fred monthly Philadelphia
UnemploymentPhillyMSA Unemployment Rate in Philadelphia-Camden-Wilmington, PA-NJ-DE-MD fred monthly Philadelphia MSA
UnemploymentRate Unemployment Rate fred monthly national
Wage&Salaries Compensation of Employees, Received: Wage and Salary Disbursements fred monthly national
WagesPhillyMSA Average Weekly Wages for Employees in Total Covered Establishments in Philadelphia-Camden-Wilmington, PA-NJ-DE-MD fred quarterly Philadelphia MSA
WeeklyEconomicIndex Weekly Economic Index (Lewis-Mertens-Stock) fred weekly national
YieldCurve 10-Year Treasury Constant Maturity Minus 2-Year Treasury Constant Maturity fred daily national
SP500 Monthly S&P 500 Price quandl monthly national
HousingInventoryPhillyMSA For-Sale Inventory (Smooth, All Homes, Monthly) zillow monthly Philadelphia MSA
MeanDaysToSalePhillyMSA Mean Days to Pending (Smooth, All Homes, Monthly) zillow monthly Philadelphia MSA
MedianHomeValuePhilly ZHVI All Homes (SFR, Condo/Co-op) Time Series zillow monthly Philadelphia
MedianListPricePhillyMSA Median List Price (Smooth, All Homes, Monthly) zillow monthly Philadelphia MSA
RentIndexPhillyMSA ZORI (Smoothed, Seasonally Adjusted} All Homes Plus Multifamily zillow monthly Philadelphia MSA

Step 2: Impute CBO values

  • Function: impute_cbo_values()
  • Purpose: Impute CBO forecast values for Q4 of the current fiscal year.
  • Inputs:
    • Dataset: economic_indicators
  • Outputs:
    • Dataset: quarterly_features_raw in the data/02_intermediate/ folder

For economic indicators that CBO is projections for, this will impute the forecast value for Q4 of the current fiscal year, where an actual value is lacking.

Step 3: Get quarterly averages

  • Function: get_quarterly_averages()
  • Purpose: Get the quarterly averages of the indicators and remove any indicators with annual frequency.
  • Inputs:
    • Dataset: quarterly_features_raw
    • Parameter: plan_start_year
    • Parameter: cbo_forecast_date
  • Outputs:
    • Dataset: quarterly_features_cbo_imputed in the data/02_intermediate/ folder

Step 4: Combine indicators and tax bases

  • Function: combine_features_and_bases()
  • Purpose: Combine the economic indicator features and the tax base data into a single data frame.
  • Inputs:
    • Dataset: quarterly_features_cbo_imputed
    • Dataset: plan_details
  • Outputs:
    • Dataset: features_and_bases in the data/02_intermediate/ folder

Step 5: Seasonally adjust features

  • Function: seasonally_adjust_features()
  • Purpose: Seasonally adjust the specified columns, using the LOESS functionality in statsmodels.
  • Inputs:
    • Dataset: features_and_bases
    • Parameter: seasonal_adjustments
  • Outputs:
    • Dataset: features_and_bases_sa in the data/02_intermediate/ folder

Step 6: Calculate stationary guide

  • Function: get_stationary_guide()
  • Purpose: Make the stationary guide, a spreadsheet which contains the instructions for making each feature stationary.
  • Inputs:
    • Dataset: features_and_bases_sa
  • Outputs:
    • Dataset: stationary_guide in the data/02_intermediate/ folder

For each feature, the stationary guide contains the following information:

  • Can we take the log of the variable (e.g., is it non-negative?)?
  • How many differences for stationary?
  • Should we normalize the data first?

The spreadsheet is available in the data/02_intermediate/ folder (link).

This step also creates the diagnostic stationary plots for all tax bases and save them to data / 02_intermediate / stationary_figures. These figures test the autocorrelation and partial autocorrelation of the time series. For example, the stationary figure for the Wage Tax is:

Wage Tax stationary
Figure

Step 7: Final unscaled features

  • Function: get_final_unscaled_features()
  • Purpose: Get the final unscaled features to input into the modeling pipeline. The only additional preprocessing performed in this step is trimming to the specific minimum year for all features and tax bases.
  • Inputs:
    • Dataset: features_and_bases_sa
    • Parameter: min_feature_year
  • Outputs:
    • Dataset: final_unscaled_features in the data/03_feature/ folder

Step 8: Final scaled features

  • Function: get_final_scaled_features()
  • Purpose: Get the final scaled features to input into the modeling pipeline. This applies the final preprocessor based on the "stationary guide." For each feature, it takes the log of the feature if able (if not, it applies a normalization). Finally, the preprocessor differences the feature until it is stationary.
  • Inputs:
    • Dataset: final_unscaled_features
  • Outputs:
    • Dataset: final_scaled_features in the data/03_feature/ folder