{"id":122,"date":"2020-01-24T05:56:37","date_gmt":"2020-01-24T05:56:37","guid":{"rendered":"http:\/\/vargas-solar.com\/data-ml-studios\/?page_id=122"},"modified":"2020-01-24T17:32:21","modified_gmt":"2020-01-24T17:32:21","slug":"ho-6-etl-using-pyspark","status":"publish","type":"page","link":"http:\/\/vargas-solar.com\/data-ml-studios\/ho-6-etl-using-pyspark\/","title":{"rendered":"HO-7: ETL using PySpark"},"content":{"rendered":"\n<body>\n  <div tabindex=\"-1\" id=\"notebook\" class=\"border-box-sizing\">\n    <div class=\"container\" id=\"notebook-container\">\n\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In&nbsp;[&nbsp;]:<\/div>\n<div class=\"inner_cell\">\n    <div class=\"input_area\">\n<div class=\" highlight hl-ipython3\"><pre><span><\/span><span class=\"o\">!<\/span>pip install pyspark\n<\/pre><\/div>\n\n    <\/div>\n<\/div>\n<\/div>\n\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In&nbsp;[&nbsp;]:<\/div>\n<div class=\"inner_cell\">\n    <div class=\"input_area\">\n<div class=\" highlight hl-ipython3\"><pre><span><\/span><span class=\"kn\">from<\/span> <span class=\"nn\">pyspark<\/span> <span class=\"kn\">import<\/span> <span class=\"n\">SparkContext<\/span><span class=\"p\">,<\/span> <span class=\"n\">SparkConf<\/span>\n<span class=\"kn\">from<\/span> <span class=\"nn\">pyspark.sql<\/span> <span class=\"kn\">import<\/span> <span class=\"n\">SparkSession<\/span><span class=\"p\">,<\/span> <span class=\"n\">SQLContext<\/span>\n\n<span class=\"n\">sc<\/span> <span class=\"o\">=<\/span> <span class=\"n\">SparkContext<\/span><span class=\"p\">(<\/span><span class=\"s1\">&#39;local&#39;<\/span><span class=\"p\">,<\/span> <span class=\"s1\">&#39;ml-studio&#39;<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">sc<\/span>\n<\/pre><\/div>\n\n    <\/div>\n<\/div>\n<\/div>\n\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In&nbsp;[&nbsp;]:<\/div>\n<div class=\"inner_cell\">\n    <div class=\"input_area\">\n<div class=\" highlight hl-ipython3\"><pre><span><\/span><span class=\"n\">sqlContext<\/span> <span class=\"o\">=<\/span> <span class=\"n\">SQLContext<\/span><span class=\"p\">(<\/span><span class=\"n\">sc<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">sqlContext<\/span>\n<\/pre><\/div>\n\n    <\/div>\n<\/div>\n<\/div>\n\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\"><div class=\"prompt input_prompt\">\n<\/div><div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h1 id=\"Outlier-Detection:-An-ETL-Tutorial-with-Spark\">Outlier Detection: An ETL Tutorial with Spark<a class=\"anchor-link\" href=\"#Outlier-Detection:-An-ETL-Tutorial-with-Spark\">&#182;<\/a><\/h1><p>Part of the Industry 4.0 framework is to make sure that manufacturers have more visibility over what\u2019s going on with their machines in the factory floors. This is why Industry 4.0 works tightly with Internet of Things. IoT allows large scale real-time data collection from sensors that are installed in production equipment possible. Nevertheless, having good Data Collection Agents alone isn\u2019t sufficient. We need an automated way of extracting, analyzing and summarizing information from the large data stream, since it\u2019s impossible for humans to do it manually. In big data terminology, this process is often referred to as ETL (Extract-Transform-Load).<\/p>\n<p>Today, we\u2019ll discuss one family of algorithm that I have personally seen to be useful in the industry: outlier detection. The idea is to find any abnormal measurements from the data stream and highlight them to the domain experts e.g. process engineers. I will share an implementation of a basic anomaly detection algorithm in Spark. I could have done the same tutorial with python\u2019s pandas-dataframe, but unfortunately once we deal with big dataset (whose size is way larger than memory space), the latter is no longer suitable.<\/p>\n<p>This is a simple dummy dataset that I use. Suppose we have data stream from 2 sensors. How can we automatically capture the two anomalous dots that are present below?<\/p>\n<center><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/3900\/1*NT1gMrAKkeGC-YUF-olSDw.png\" alt=\"drawing\" width=\"800\"\/><\/center><h2 id=\"Outlier-Detection\">Outlier Detection<a class=\"anchor-link\" href=\"#Outlier-Detection\">&#182;<\/a><\/h2><p>The model that we use finds region of values whose probability of occurrence are low under the distribution that has been fitted to the observed data. We assume that our sensors are unimodal gaussian in nature. With that, we can calculate the two thresholds that are six sigma away from the distribution\u2019s mean.<\/p>\n<center><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/3056\/1*hLuquZaamGS1GdYjen7ieQ.png\" alt=\"drawing\" width=\"800\"\/><\/center><p>Visually, the thresholds are fitted in this manner. Any measurements above the Upper Limit (around 25) or below the Lower Limit (around 15) are deemed as outliers.<\/p>\n<center><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/2792\/0*FSmI7mMO6Eukukpn.png\" alt=\"drawing\" width=\"800\"\/><\/center><p>Now to implement this in Spark, we first import all of the library dependencies:<\/p>\n\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In&nbsp;[&nbsp;]:<\/div>\n<div class=\"inner_cell\">\n    <div class=\"input_area\">\n<div class=\" highlight hl-ipython3\"><pre><span><\/span><span class=\"kn\">from<\/span> <span class=\"nn\">pyspark.sql<\/span> <span class=\"kn\">import<\/span> <span class=\"o\">*<\/span>\n<span class=\"kn\">from<\/span> <span class=\"nn\">pyspark.sql.types<\/span> <span class=\"kn\">import<\/span> <span class=\"o\">*<\/span>\n<span class=\"kn\">from<\/span> <span class=\"nn\">pyspark.sql.functions<\/span> <span class=\"kn\">import<\/span> <span class=\"o\">*<\/span>\n<\/pre><\/div>\n\n    <\/div>\n<\/div>\n<\/div>\n\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\"><div class=\"prompt input_prompt\">\n<\/div><div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h2 id=\"Extract\">Extract<a class=\"anchor-link\" href=\"#Extract\">&#182;<\/a><\/h2><p>We now assume that our data comes in a csv format. It has also been saved in a file called test.csv. We first specify the data schema explicitly. Note than in production, data could also be obtained from a database and message broker (e.g. MQTT, Kafka etc\u2026).<\/p>\n\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In&nbsp;[&nbsp;]:<\/div>\n<div class=\"inner_cell\">\n    <div class=\"input_area\">\n<div class=\" highlight hl-ipython3\"><pre><span><\/span><span class=\"n\">customSchema<\/span> <span class=\"o\">=<\/span> <span class=\"n\">StructType<\/span><span class=\"p\">([<\/span>\n    <span class=\"n\">StructField<\/span><span class=\"p\">(<\/span><span class=\"s2\">&quot;sensorId&quot;<\/span><span class=\"p\">,<\/span> <span class=\"n\">StringType<\/span><span class=\"p\">(),<\/span> <span class=\"kc\">True<\/span><span class=\"p\">),<\/span>\n    <span class=\"n\">StructField<\/span><span class=\"p\">(<\/span><span class=\"s2\">&quot;values&quot;<\/span><span class=\"p\">,<\/span> <span class=\"n\">DoubleType<\/span><span class=\"p\">(),<\/span> <span class=\"kc\">True<\/span><span class=\"p\">)])<\/span>\n<\/pre><\/div>\n\n    <\/div>\n<\/div>\n<\/div>\n\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\"><div class=\"prompt input_prompt\">\n<\/div><div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>Then, we read the csv file into a Spark DataFrame. Here we can see that there are only two columns: sensorId and values.<\/p>\n\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In&nbsp;[&nbsp;]:<\/div>\n<div class=\"inner_cell\">\n    <div class=\"input_area\">\n<div class=\" highlight hl-ipython3\"><pre><span><\/span><span class=\"c1\"># load dataset<\/span>\n<span class=\"n\">df<\/span> <span class=\"o\">=<\/span> <span class=\"n\">sqlContext<\/span><span class=\"o\">.<\/span><span class=\"n\">read<\/span><span class=\"o\">.<\/span><span class=\"n\">format<\/span><span class=\"p\">(<\/span><span class=\"s2\">&quot;csv&quot;<\/span><span class=\"p\">)<\/span><span class=\"o\">.<\/span><span class=\"n\">option<\/span><span class=\"p\">(<\/span><span class=\"s2\">&quot;header&quot;<\/span><span class=\"p\">,<\/span> <span class=\"s2\">&quot;true&quot;<\/span><span class=\"p\">)<\/span><span class=\"o\">.<\/span><span class=\"n\">schema<\/span><span class=\"p\">(<\/span><span class=\"n\">customSchema<\/span><span class=\"p\">)<\/span><span class=\"o\">.<\/span><span class=\"n\">load<\/span><span class=\"p\">(<\/span><span class=\"s2\">&quot;data\/test.csv&quot;<\/span><span class=\"p\">)<\/span>\n  \n<span class=\"n\">df<\/span><span class=\"o\">.<\/span><span class=\"n\">printSchema<\/span><span class=\"p\">()<\/span>\n<\/pre><\/div>\n\n    <\/div>\n<\/div>\n<\/div>\n\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\"><div class=\"prompt input_prompt\">\n<\/div><div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>We can register the dataset as a table for SQL-style queries:<\/p>\n\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In&nbsp;[&nbsp;]:<\/div>\n<div class=\"inner_cell\">\n    <div class=\"input_area\">\n<div class=\" highlight hl-ipython3\"><pre><span><\/span><span class=\"n\">df<\/span><span class=\"o\">.<\/span><span class=\"n\">createOrReplaceTempView<\/span><span class=\"p\">(<\/span><span class=\"s2\">&quot;sensors&quot;<\/span><span class=\"p\">)<\/span>\n<\/pre><\/div>\n\n    <\/div>\n<\/div>\n<\/div>\n\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In&nbsp;[&nbsp;]:<\/div>\n<div class=\"inner_cell\">\n    <div class=\"input_area\">\n<div class=\" highlight hl-ipython3\"><pre><span><\/span><span class=\"n\">sqlContext<\/span><span class=\"o\">.<\/span><span class=\"n\">sql<\/span><span class=\"p\">(<\/span><span class=\"s2\">&quot;SELECT * FROM sensors&quot;<\/span><span class=\"p\">)<\/span><span class=\"o\">.<\/span><span class=\"n\">show<\/span><span class=\"p\">()<\/span>\n<\/pre><\/div>\n\n    <\/div>\n<\/div>\n<\/div>\n\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\"><div class=\"prompt input_prompt\">\n<\/div><div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h2 id=\"Transform\">Transform<a class=\"anchor-link\" href=\"#Transform\">&#182;<\/a><\/h2><p>We would like to calculate the distribution profile for each sensorID, particularly the Upper and Lower Outlier Thresholds. To do that, we need to group the dataframe by sensorId, followed by aggregating each sensor data\u2019s mean and standard deviation accordingly. We can then create 2 new columns, one for each outlier threshold.<\/p>\n\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In&nbsp;[&nbsp;]:<\/div>\n<div class=\"inner_cell\">\n    <div class=\"input_area\">\n<div class=\" highlight hl-ipython3\"><pre><span><\/span><span class=\"c1\"># calculate statistics<\/span>\n<span class=\"n\">statsDF<\/span> <span class=\"o\">=<\/span> <span class=\"n\">df<\/span><span class=\"o\">.<\/span><span class=\"n\">groupBy<\/span><span class=\"p\">(<\/span><span class=\"s2\">&quot;sensorId&quot;<\/span><span class=\"p\">)<\/span><span class=\"o\">.<\/span><span class=\"n\">agg<\/span><span class=\"p\">(<\/span><span class=\"n\">mean<\/span><span class=\"p\">(<\/span><span class=\"n\">df<\/span><span class=\"o\">.<\/span><span class=\"n\">values<\/span><span class=\"p\">)<\/span><span class=\"o\">.<\/span><span class=\"n\">alias<\/span><span class=\"p\">(<\/span><span class=\"s2\">&quot;mean&quot;<\/span><span class=\"p\">),<\/span> <span class=\"n\">stddev<\/span><span class=\"p\">(<\/span><span class=\"n\">df<\/span><span class=\"o\">.<\/span><span class=\"n\">values<\/span><span class=\"p\">)<\/span><span class=\"o\">.<\/span><span class=\"n\">alias<\/span><span class=\"p\">(<\/span><span class=\"s2\">&quot;stddev&quot;<\/span><span class=\"p\">))<\/span>\n\n<span class=\"c1\"># add columns with upper and lower limits<\/span>\n<span class=\"n\">statsDF<\/span> <span class=\"o\">=<\/span> <span class=\"n\">statsDF<\/span><span class=\"o\">.<\/span><span class=\"n\">withColumn<\/span><span class=\"p\">(<\/span><span class=\"s2\">&quot;UpperLimit&quot;<\/span><span class=\"p\">,<\/span> <span class=\"n\">statsDF<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span> <span class=\"o\">+<\/span> <span class=\"n\">statsDF<\/span><span class=\"o\">.<\/span><span class=\"n\">stddev<\/span> <span class=\"o\">*<\/span> <span class=\"mi\">3<\/span><span class=\"p\">)<\/span><span class=\"o\">.<\/span><span class=\"n\">withColumn<\/span><span class=\"p\">(<\/span><span class=\"s2\">&quot;LowerLimit&quot;<\/span><span class=\"p\">,<\/span> <span class=\"n\">statsDF<\/span><span class=\"o\">.<\/span><span class=\"n\">mean<\/span> <span class=\"o\">-<\/span> <span class=\"n\">statsDF<\/span><span class=\"o\">.<\/span><span class=\"n\">stddev<\/span> <span class=\"o\">*<\/span> <span class=\"mi\">3<\/span><span class=\"p\">)<\/span>\n  \n<span class=\"n\">statsDF<\/span><span class=\"o\">.<\/span><span class=\"n\">printSchema<\/span><span class=\"p\">()<\/span>\n<\/pre><\/div>\n\n    <\/div>\n<\/div>\n<\/div>\n\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\"><div class=\"prompt input_prompt\">\n<\/div><div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>We would now like to find which sensor readings are anomalous in the original dataframe. Since the information live in two different dataframes, we need to join them using the sensorId column as a common index.<\/p>\n\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In&nbsp;[&nbsp;]:<\/div>\n<div class=\"inner_cell\">\n    <div class=\"input_area\">\n<div class=\" highlight hl-ipython3\"><pre><span><\/span><span class=\"c1\"># join the two dataframe<\/span>\n<span class=\"n\">joinDF<\/span> <span class=\"o\">=<\/span> <span class=\"n\">df<\/span><span class=\"o\">.<\/span><span class=\"n\">join<\/span><span class=\"p\">(<\/span><span class=\"n\">statsDF<\/span><span class=\"p\">,<\/span> <span class=\"n\">df<\/span><span class=\"o\">.<\/span><span class=\"n\">sensorId<\/span> <span class=\"o\">==<\/span> <span class=\"n\">statsDF<\/span><span class=\"o\">.<\/span><span class=\"n\">sensorId<\/span><span class=\"p\">)<\/span>\n<\/pre><\/div>\n\n    <\/div>\n<\/div>\n<\/div>\n\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\"><div class=\"prompt input_prompt\">\n<\/div><div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<p>Lastly, we can filter rows whose values lie beyond the range enclosed by the outlier thresholds. Voila! We managed to capture the two anomalous points.<\/p>\n\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In&nbsp;[&nbsp;]:<\/div>\n<div class=\"inner_cell\">\n    <div class=\"input_area\">\n<div class=\" highlight hl-ipython3\"><pre><span><\/span><span class=\"c1\"># outlierDetection<\/span>\n<span class=\"k\">def<\/span> <span class=\"nf\">detect_outlier<\/span><span class=\"p\">(<\/span><span class=\"n\">values<\/span><span class=\"p\">,<\/span> <span class=\"n\">UpperLimit<\/span><span class=\"p\">,<\/span> <span class=\"n\">LowerLimit<\/span><span class=\"p\">):<\/span>\n    <span class=\"c1\"># outliers are points lying below LowerLimit or above upperLimit<\/span>\n    <span class=\"k\">return<\/span> <span class=\"p\">(<\/span><span class=\"n\">values<\/span> <span class=\"o\">&lt;<\/span> <span class=\"n\">LowerLimit<\/span><span class=\"p\">)<\/span> <span class=\"ow\">or<\/span> <span class=\"p\">(<\/span><span class=\"n\">values<\/span> <span class=\"o\">&gt;<\/span> <span class=\"n\">UpperLimit<\/span><span class=\"p\">)<\/span>\n\n<span class=\"n\">udf_detect_outlier<\/span> <span class=\"o\">=<\/span> <span class=\"n\">udf<\/span><span class=\"p\">(<\/span><span class=\"k\">lambda<\/span> <span class=\"n\">values<\/span><span class=\"p\">,<\/span> <span class=\"n\">UpperLimit<\/span><span class=\"p\">,<\/span> <span class=\"n\">LowerLimit<\/span><span class=\"p\">:<\/span> <span class=\"n\">detect_outlier<\/span><span class=\"p\">(<\/span><span class=\"n\">values<\/span><span class=\"p\">,<\/span> <span class=\"n\">UpperLimit<\/span><span class=\"p\">,<\/span> <span class=\"n\">LowerLimit<\/span><span class=\"p\">),<\/span> <span class=\"n\">BooleanType<\/span><span class=\"p\">())<\/span>\n\n<span class=\"n\">outlierDF<\/span> <span class=\"o\">=<\/span> <span class=\"n\">joinDF<\/span><span class=\"o\">.<\/span><span class=\"n\">withColumn<\/span><span class=\"p\">(<\/span><span class=\"s2\">&quot;isOutlier&quot;<\/span><span class=\"p\">,<\/span> <span class=\"n\">udf_detect_outlier<\/span><span class=\"p\">(<\/span><span class=\"n\">joinDF<\/span><span class=\"o\">.<\/span><span class=\"n\">values<\/span><span class=\"p\">,<\/span> <span class=\"n\">joinDF<\/span><span class=\"o\">.<\/span><span class=\"n\">UpperLimit<\/span><span class=\"p\">,<\/span> <span class=\"n\">joinDF<\/span><span class=\"o\">.<\/span><span class=\"n\">LowerLimit<\/span><span class=\"p\">))<\/span><span class=\"o\">.<\/span><span class=\"n\">filter<\/span><span class=\"p\">(<\/span><span class=\"s2\">&quot;isOutlier&quot;<\/span><span class=\"p\">)<\/span>\n  \n<span class=\"n\">outlierDF<\/span><span class=\"o\">.<\/span><span class=\"n\">createOrReplaceTempView<\/span><span class=\"p\">(<\/span><span class=\"s2\">&quot;outliers&quot;<\/span><span class=\"p\">)<\/span>\n<\/pre><\/div>\n\n    <\/div>\n<\/div>\n<\/div>\n\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In&nbsp;[&nbsp;]:<\/div>\n<div class=\"inner_cell\">\n    <div class=\"input_area\">\n<div class=\" highlight hl-ipython3\"><pre><span><\/span><span class=\"n\">sqlContext<\/span><span class=\"o\">.<\/span><span class=\"n\">sql<\/span><span class=\"p\">(<\/span><span class=\"s2\">&quot;SELECT * FROM outliers&quot;<\/span><span class=\"p\">)<\/span><span class=\"o\">.<\/span><span class=\"n\">show<\/span><span class=\"p\">()<\/span>\n<\/pre><\/div>\n\n    <\/div>\n<\/div>\n<\/div>\n\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\"><div class=\"prompt input_prompt\">\n<\/div><div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h2 id=\"Conclusion\">Conclusion<a class=\"anchor-link\" href=\"#Conclusion\">&#182;<\/a><\/h2><p>We have seen how a typical ETL pipeline with Spark works, using anomaly detection as the main transformation process. Note that some of the procedures used here is not suitable for production. For example, CSV input and output are not encouraged. Normally we would use Hadoop Distributed File System (HDFS) instead. The latter could be wrapped under a database too e.g. HBase. Nonetheless, the main programming paradigm stays the same.<\/p>\n\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"input\">\n<div class=\"prompt input_prompt\">In&nbsp;[&nbsp;]:<\/div>\n<div class=\"inner_cell\">\n    <div class=\"input_area\">\n<div class=\" highlight hl-ipython3\"><pre><span><\/span> \n<\/pre><\/div>\n\n    <\/div>\n<\/div>\n<\/div>\n\n<\/div>\n    <\/div>\n  <\/div>\n","protected":false},"excerpt":{"rendered":"<p>In&nbsp;[&nbsp;]: !pip install pyspark In&nbsp;[&nbsp;]: from pyspark import SparkContext, SparkConf from pyspark.sql import SparkSession, SQLContext sc = SparkContext(&#8216;local&#8217;, &#8216;ml-studio&#8217;) sc In&nbsp;[&nbsp;]: sqlContext = SQLContext(sc) sqlContext Outlier Detection: An ETL Tutorial with Spark&para; Part of the Industry 4.0 framework is to make sure that manufacturers have more visibility over what&rsquo;s going on with their machines in [&hellip;]<\/p>\n","protected":false},"author":11,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-122","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"http:\/\/vargas-solar.com\/data-ml-studios\/wp-json\/wp\/v2\/pages\/122","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/vargas-solar.com\/data-ml-studios\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"http:\/\/vargas-solar.com\/data-ml-studios\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"http:\/\/vargas-solar.com\/data-ml-studios\/wp-json\/wp\/v2\/users\/11"}],"replies":[{"embeddable":true,"href":"http:\/\/vargas-solar.com\/data-ml-studios\/wp-json\/wp\/v2\/comments?post=122"}],"version-history":[{"count":3,"href":"http:\/\/vargas-solar.com\/data-ml-studios\/wp-json\/wp\/v2\/pages\/122\/revisions"}],"predecessor-version":[{"id":134,"href":"http:\/\/vargas-solar.com\/data-ml-studios\/wp-json\/wp\/v2\/pages\/122\/revisions\/134"}],"wp:attachment":[{"href":"http:\/\/vargas-solar.com\/data-ml-studios\/wp-json\/wp\/v2\/media?parent=122"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}