Updated on 2024-08-10 GMT+08:00

Spark Core Sample Projects (Python)

Function

Collect statistics on female netizens who dwell on online shopping for more than two hours during weekends.

Sample Code

The following code snippets are used as an example. For complete codes, see collectFemaleInfo.py.

def contains(str, substr):
  if substr in str:
    return True
  return False

if __name__ == "__main__":
  if len(sys.argv) < 2:
    print "Usage: CollectFemaleInfo <file>"
    exit(-1)

  spark = SparkSession \
    .builder \
    .appName("CollectFemaleInfo") \
    .getOrCreate()

  """
  The following functions are performed:
  1. Read data. The input parameter argv[1] specifies the data path. - text
  2. Filter the data information of the time that female netizens spend online.
  3. Summarize the total time that each female netizen spends online. -map/map/reduceByKey.
  4. Filter information about female netizens who spend more than 2 hours online. - filter
  """
  inputPath = sys.argv[1]
  result = spark.read.text(inputPath).rdd.map(lambda r: r[0])\
    .filter(lambda line: contains(line, "female")) \
    .map(lambda line: line.split(',')) \
    .map(lambda dataArr: (dataArr[0], int(dataArr[2]))) \
    .reduceByKey(lambda v1, v2: v1 + v2) \
    .filter(lambda tupleVal: tupleVal[1] > 120) \
    .collect()
  for (k, v) in result:
    print k + "," + str(v)

  # Stop SparkContext.
  spark.stop()