Skip to content

SAS Code to Spark Conversion Nuances

This section describes nuances of converting from SAS code to Spark.

BEST format

SAS BESTw.d format is very complex, and is very hard to replicate exactly in Spark. E.g. SAS uses a different scientific notation for large number with exponent then Spark, python or Java. It also controls the maximum legnth of the output and truncates it based on the given width.

Alchemist will generate a regular cast call, with a different handling for values with and without decimals.

INPUT function with $w. informat

Caveats

  • The result assumes a correct character expression in an input field. If input was used on a numeric column is the source, the converted code will match SAS behavior.
  • Converted goes does not handle a single period . in an input field (SAS $w. converts a single period to a blank)

Template

If out of the box conversion does not work for you, you can use the following inline template to achieve precise conversion:

converter:
  template_configs:
    - inline: "IF(TRIM(render(match["source"])) = '.', '', TRIM(SUBSTR(NVL(render(match["source"]), ''), 1, match["informat_width"])))"
      match_patterns:
        - '(SASCallExpression @arguments=[(*) -> source (* @value=(SASFormatExpression @format_name="\$" @format_width=".+" -> informat_width))] @func_name=(SASCodeName @identifier="INPUT"))'

Order By

SAS data steps as well as some procedures require that the data have been pre-sorted, thus it is common for users to call either PROC SORT or include ORDER BY in SQL statements. Spark in turn does not require data to be pre-sorted, thus in some cases there is no need to preserve ordering.

As of today, Alchemist can't reliably detect whether ordering is needed or not and thus errs on the side of caution and preserves ordering. This should not lead to significant performance degradation since Spark is smart enough to eliminate all but the last sort.

However, if you are sure that ordering is not needed you may use node filtering to prevent PROC SORT from being converted.

FIRST. and LAST. DATA Step Variables

The generated code may produce different results depending on the sort order of non-BY columns. Achieving deterministic results requires the user to specify the sort order, but this feature will be implemented in a future update.

Calculated Columns

SAS allows referencing calculated columns in the same SELECT statement where they are defined. Despite documentation only mentioning SELECT and WHERE, it may also be used in GROUP BY and HAVING clauses.

Spark has a similar capability called Lateral Column Alias aka LCA (for more details read the announcement blog). The key differences with SAS are: - LCA is only supported in SELECT and HAVING clauses - If a source table has a column of the same name as the calculated column, the source column will be used (SAS will use the calculated column)

As of now, Alchemist converts SAS calculated columns by naming them _CALC_<col_name> to avoid potential column name collisions and then renames them back to the original name, without checking whether LCA is possible.

Spark Runtime compatibility

Alchemist may generate Databricks SQL statements that work in Databricks Runtime but may not work in Apache Spark Runtime. The list of such constructs includes:

  • QUALIFY clause
  • EXCEPT clause (SELECT * EXCEPT(col) FROM ...)

TIMESTAMP arithmetic operations precision

When converting SAS datetimes subtractions, e.g. datetime_col_or_expr1 - datetime_col_or_expr2, into Spark dialect, the resulting precision may be affected. For instance, the expression '07feb21:8:45:32.123'dt - '03feb21:8:45:33.192837'dt yields 345598.9301629060 in SAS, while the converted expression in Spark results in 345598.93016290665. It is important to be mindful of potential precision differences when performing datetime arithmetic operations.

DML operations on SAS views/work tables

SAS supports SQL DML operations (UPDATE/INSERT/DELETE) on views or work tables. In contrast, Spark does not support these operations. Alchemist converts such code marking it as unsupported