Overview
The analysis code is subdivided into four steps:
- Data Processing: a pipeline that prepares the relevant data indicators to be input into the modeling regressions.
- Forecast Preparation: a pipeline that prepares for the forecasting step by trying to estimate the best features to use for each tax.
- Exploratory modeling: an exploratory step that runs a grid search to find the best features for each tax.
- Forecasting: a pipeline that runs the regressions for each tax and produces the final forecasts.
You can think of a pipeline as a series of functions where the inputs to one function depend on the outputs from a previous function.
The third step above is an interactive step performed using the Jupyter notebooks in
the notebooks/
folder. There is a notebook file for each tax. These notebooks
are used to identify the best fitting parameters for each tax, e.g., which
endogenous and exogenous variables should be used in the vector
auto-regressions. Once these best-fit parameters are found, they can be fed
into the modeling pipeline.
To manage the pipelines and the data inputs/outputs, the project uses the kedro package. From the kedro documentation:
Kedro is an open-source Python framework for creating reproducible, maintainable and modular data science code. It borrows concepts from software engineering best-practice and applies them to machine-learning code; applied concepts include modularity, separation of concerns and versioning.
Kedro is useful for our purposes because it enables reproducible revenue projections, manages the data inputs and outputs, and tracks any changes in the data and results over time.
There are a few key concepts from kedro that are necessary to understand how this project works. This section provides a brief introduction to this concepts. To fully understand kedro, it is worth going through the spaceflights tutorial on the kedro documentation. The full documentation is available here.
The Data Catalog
This section introduces catalog.yml
. The file is located in conf/base
and
is a registry of all data sources available for use by the project. It manages
loading and saving of data.
The Data Catalog provides instructions for how to load and save the various data inputs and outputs used by the analysis pipelines. The Data Catalog for this project is available here.
The Data Catalog is composed of a series of named entries. Giving official "names" to the data frames the analysis uses is helpful because then we can refer to those data frames in our pipeline code. Because the Data Catalog provides the saving/loading instructions, functions in our pipeline code will automatically "know" about the data and how to load it. For example:
economic_indicators:
type: pandas.CSVDataSet
filepath: data/02_intermediate/economic_indicators_all.csv
save_args:
index: True
load_args:
index_col: 0
parse_dates: True
We've created a named dataset called "economic_indicators" and specified that
it should be saved as a CSV file to the location
data/02_intermediate/economic_indicators_all.csv
(more info on the data/
folder here). The other
arguments are passed to the read_csv()
and DataFrame.to_csv()
functions
from pandas
.
Note
For more information, see the Kedro documentation on the Data Catalog.
Configuration
The analysis depends on a set of input parameters that we can define in a
configuration file. These files are located in the conf/base/
folder of the
repository. There are four relevant parameters files:
conf/base/parameters.yml
: This holds general parameters about the analysis, such as the start year for the plan being analyzed.conf/base/parameters/data_processing.yml
: This holds parameters specific to the data processing pipeline.conf/base/parameters/forecast_prep.yml
: This holds parameters specific to the forecast prep pipeline.conf/base/parameters/forecast.yml
: This holds parameters specific to the forecasting pipeline.
When running a pipeline or working in one of the Jupyter notebooks, the
parameters will automatically be loaded by kedro
and available as variables.
Magic!
Note
For more information, see the Kedro documentation on configuration parameters.
Nodes
From the kedro documentation:
Nodes are the building blocks of pipelines and represent tasks. Pipelines are used to combine nodes to build workflows, which range from simple machine learning workflows to end-to-end production workflows.
Nodes are just Python functions that can be put together in sequential order to form a pipeline. Nodes are useful because we can specify any named dataset from the Data Catalog or configuration parameter as either the input or output of the function.
For example, the first step of the data processing pipeline uses the following node:
node(
func=get_economic_indicators,
inputs="params:fresh_indicators",
outputs="economic_indicators",
name="economic_indicators_node",
)
This function outputs the economic_indicators
data frame that we defined
earlier in the Data Catalog. When running the pipeline, kedro
will
automatically save the data frame as a CSV to file location we specified in the
data catalog. This node will call the function get_economic_indicators()
.
Note the syntax params:fresh_indicators
— this is how you are able to
reference configuration parameters, by prefixing the name of the variable with
the "params:" tag. In this case, the function takes an input argument that
determines whether the function should download a fresh copy of the indicators
or not.
This is the second node in the data processing pipeline:
node(
func=get_quarterly_averages,
inputs="economic_indicators",
outputs="quarterly_features_raw",
name="quarterly_features_raw_node",
)
This node will call the function
get_quarterly_averages()
, which will take the quarterly average of the economic indicators. It takes the
raw economic_indicators
data frame as input and outputs a
quarterly_features_raw
dataset (that is also defined in the Data Catalog).
Note
For more information, see the Kedro documentation on Nodes.
Pipelines
From the kedro documentation:
A pipeline organises the dependencies and execution order of your collection of nodes, and connects inputs and outputs while keeping your code modular. The pipeline determines the node execution order by resolving dependencies and does not necessarily run the nodes in the order in which they are passed in.
There are three pipelines in this project for data processing, forecast prep, and forecasting. These are modular and completely separate from each other. The outputs of the data processing pipeline are used as inputs to the forecast prep pipeline and then the forecast pipeline.
In the repository, the source code for these pipelines are broken out separately in to different folders (see here).
More information is provided for each of these pipelines: data processing, forecast prep, and forecasting. sections of the documentation.
Note
For more information, see the Kedro documentation on Pipelines.
Next Steps
The following sections of the documentation provide more detail on the analysis:
- The data/ folder: Everything you need to know about the data inputs and outputs in the analysis
- Steps:
- 1. Data Processing: The data processing pipeline
- 2. Forecast Prep: The forecast prep pipeline
- 3. Exploratory Modeling: The exploratory modeling step
- 4. Forecasting: The forecasting pipeline