SAS Code to Spark Conversion Nuances¶
This section describes nuances of converting from SAS code to Spark.
BEST format¶
SAS BESTw.d format is very complex, and is very hard to replicate exactly in Spark. E.g. SAS uses a different scientific notation for large number with exponent then Spark, python or Java. It also controls the maximum legnth of the output and truncates it based on the given width.
Alchemist will generate a regular cast call, with a different handling for values with and without decimals.
INPUT function with $w. informat¶
Caveats
- The result assumes a correct character expression in an input field. If input was used on a numeric column is the source, the converted code will match SAS behavior.
- Converted goes does not handle a single period
.
in an input field (SAS$w.
converts a single period to a blank)
Template
If out of the box conversion does not work for you, you can use the following inline template to achieve precise conversion:
converter:
template_configs:
- inline: "IF(TRIM(render(match["source"])) = '.', '', TRIM(SUBSTR(NVL(render(match["source"]), ''), 1, match["informat_width"])))"
match_patterns:
- '(SASCallExpression @arguments=[(*) -> source (* @value=(SASFormatExpression @format_name="\$" @format_width=".+" -> informat_width))] @func_name=(SASCodeName @identifier="INPUT"))'
Order By¶
SAS data steps as well as some procedures require that the data have been pre-sorted, thus it is common for users to call either PROC SORT
or include ORDER BY
in SQL statements. Spark in turn does not require data to be pre-sorted, thus in some cases there is no need to preserve ordering.
As of today, Alchemist can't reliably detect whether ordering is needed or not and thus errs on the side of caution and preserves ordering. This should not lead to significant performance degradation since Spark is smart enough to eliminate all but the last sort.
However, if you are sure that ordering is not needed you may use node filtering to prevent PROC SORT
from being converted.
Calculated Columns¶
SAS allows referencing calculated columns in the same SELECT
statement where they are defined. Despite documentation only mentioning SELECT
and WHERE
, it may also be used in GROUP BY
and HAVING
clauses.
Spark has a similar capability called Lateral Column Alias aka LCA (for more details read the announcement blog). The key differences with SAS are:
- LCA is only supported in SELECT
and HAVING
clauses
- If a source table has a column of the same name as the calculated column, the source column will be used (SAS will use the calculated column)
As of now Alchemist converts SAS calculated columns to LCA when possible, without checking for the presence of a source column with the same name. In a rare usage scenario with column name collision this may lead to incorrect results.