Spark Dialect Configuration¶
The Spark Dialect Configuration extends the base converter config and allows you to configure additional aspects of the conversion process. The following sections describe the available configuration classes and their fields.
Spark Converter Config¶
| Field | Description | Default Value | Localizable | Environment Variable |
|---|---|---|---|---|
| default_output_extension | Default output file extension. | Defaults to .py when the output_format is module and to .ipynb otherwise. |
True | |
| user_template_paths | List of folders with user templates to make available for conversion. | Empty list | False | |
| template_configs | List of template configs. | False | ||
| template_tag_filter | Filter config for template tags to inclusion/exclusion from matching. | StringFilterConfig | True | |
| node_filter | Filter config for node inclusion/exclusion from rendering. | NodeFilterConfig | True | |
| use_runtime | If True converter will use runtime user-defined functions where appropriate. Be aware, that setting this to False may reduce the amount of automatically converted code, since for some of the constructs there may not be a non-runtime static inline version. | True | True | ALC_USE_RUNTIME |
| custom_udf_name_mapping | Custom name mapping for runtime functions. This allows to use custom names for runtime functions. For the name of specific functions, consult target dialect documentation. | Empty dict | True | |
| conversion_comment_verbosity_level | Verbosity level for conversion comments (code, todo, warning, debug). Verbosity levels: - code: outputs only regular code comments retained from the source or considered part of the output. - todo: outputs code and todo comments; todo comments are used for output that has to be adjusted manually. - warning: outputs code, todo and warning comments; warning comments are used for potentially invalid code. - debug: outputs all comments, including developer warnings; debug comments are used for code that is unlikely to be invalid. |
todo | True | |
| conversion_mode | Conversion mode (normal, strict, lax). Changes code generation and how the comment verbosity level is handled: - NORMAL: The default mode. Balances correctness & readability, by allowing some heuristics about the common cases when achieving 100% match would generate overly verbose and complex code, while still allowing short constructs that ensure correctness. - STRICT: Prioritizes correctness over readability, striving to mark anything that is potentially not 100% correct in all scenarios as a TODO item and reducing heuristics to a minimum. - LAX: Prioritizes readability over correctness, assuming the best case scenario and avoiding generating additiona expressions that would be needed to handle edge cases. In addition to that, the verbosity level of conversion comments is adjusted based on the mode: - In strict mode, the warning comment is treated as todo, and debug is treated as warning. So more todo comments are generated. - In lax mode, the todo comment is treated as warning, and warning is treated as debug. Meaning no todo comments are generated at all. |
normal | True | |
| llm | Configuration for GenAI based conversion. | LLMConfig | False | |
| disable_conversion_disclaimer_header | If True, Alchemist will not generate header comments that include information about the code source, conversion timestamp and Alchemist version used. | False | False | |
| spark_conf_ansi_enabled | Whether to generate code that assumes ANSI SQL mode is enabled in Spark. | True | True | |
| table_write_format | Table format to use when writing dataframes. Must be one of: auto, delta. |
auto | False | |
| sas | SAS to Spark conversion options. | SparkSASConfig | True | |
| output_format | Output format of the generated code. Can be either ipynb for Jupyter notebooks or module for Python modules. |
module | True | |
| render_all_source_code | Whether to include the entire original SAS code before the converted code in the markdown output. | True | True | |
| render_markdown_headers | Whether to generate markdown that reflects the original SAS program structure. | True | True | |
| spark_session_identifier | Identifier of the Spark session to use in the generated code. | spark | False | |
| spark_func_namespace | Namespace for Spark functions. If set to None (default), Spark functions used in the generated code will be imported directly, e.g.: python from pyspark.sql.functions import col, lit Otherwise, functions module will be imported using the specified name: python import pyspark.sql.functions as [spark_func_namespace] and then used accordingly: python [spark_func_namespace].col("column_name") [spark_func_namespace].lit("literal_value") |
None | False | |
| file_path_map | SAS file path mapping to target environment paths. | Empty dict | True |
File Path Map¶
File path mapping is used to convert source file location to a new cloud location. Result is always a posix path.
Mapping may specify a prefix part of the full path as it appears in the source, and how it should be converted to the target part of the path itself.
The longest matching prefix will be used. If no prefix matches, the original path will be used (which is probably not what you want).
The resulting path will be automatically converted to posix path.
Example:
- for the path
C:\User\any_local_dir\file.xlsx - the mapping can be
{"C:\\User\\": "/mnt/User/"} - and the final path will be
/mnt/User/any_local_dir/file.xlsx
Spark SAS Specific Converter Config¶
| Field | Description | Default Value | Environment Variable |
|---|---|---|---|
| year_cutoff | Year to use as SAS year cutoff in carious date related operatins. This should match YEARCUTOFF option value, but only for the last two digits. For example, if YEARCUTOFF=1940, then year_cutoff=40. Alchemist currently does support YEARCUTOFF values before 1900 or after 1999. If not set, conversion assumes default value of (19)40. See https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/lesysoptsref/n0yt2tjgsd5dpzn16wk1m3thcd3c.htm | 40 | |
| libref_to_schema | Mapping of SAS librefs to Spark schemas. | Empty dict |
Example¶
Here's a comprehensive example of how you can define the Spark configuration in the configuration file:
converter:
sas:
year_cutoff: 60
libref_to_schema:
libref1: spark_schema1
libref2: "{spark_schema_var}"
output_format: ipynb
spark_session_identifier: my_spark
spark_func_namespace: F
render_all_source_code: false
render_markdown_headers: true
spark_conf_ansi_enabled: true
file_path_map:
"C:\\Data\\": "/mnt/data/"
"\\\\shared\\folder\\": "/dbfs/shared/"
template_configs:
- inline: from migration.helpers import *
match_patterns:
- (SASProgram)
output_type: extra_imports
In this example:
SAS Configuration:
year_cutoffis set to 60 (for YEARCUTOFF=1960)libref1will be converted tospark_schema1, andlibref2will be converted to{spark_schema_var}, assuming that it will be used in f-strings and the variablespark_schema_varwill be defined in the output code
Output Configuration:
- The output format is set to Jupyter notebooks (
.ipynb) - The Spark session identifier is customized to
my_spark - Spark functions will be imported under the
Fnamespace (e.g.,F.col("column_name")) - Original SAS source code will not be included in the output
- Markdown headers reflecting SAS program structure will be generated
- Generated code will assume ANSI SQL mode is enabled
Path Mapping:
- Windows paths starting with
C:\Data\will be converted to/mnt/data/ - UNC paths starting with
\\shared\folder\will be converted to/dbfs/shared/
Templates:
- Additional imports will be added to all SAS Programs
For more information about template configurations, see the template features documentation.