Updated on 2024-10-23 GMT+08:00

Spark Core Sample Projects (Python)

Function

Collects the information of female netizens who spend more than 2 hours in online shopping on the weekend from the log files.

Sample Code

The following code segment is only an example. For details, see collectFemaleInfo.py.

def contains(str, substr):
  if substr in str:
    return True
  return False

if __name__ == "__main__":
  if len(sys.argv) < 2:
    print "Usage: CollectFemaleInfo <file>"
    exit(-1)

  spark = SparkSession \
    .builder \
    .appName("CollectFemaleInfo") \
    .getOrCreate()

  """
  The following programs are used to implement the following functions: 
  1. Read data. This code indicates the data path that the input parameter argv[1] specifies. - text
  2. Filter data about the time that female netizens spend online. - filter
  3. Aggregate the total time that each female netizen spends online. - map/map/reduceByKey
  4. Filter information about female netizens who spend more than two hours online. - filter
  """
  inputPath = sys.argv[1]
  result = spark.read.text(inputPath).rdd.map(lambda r: r[0])\
    .filter(lambda line: contains(line, "female")) \
    .map(lambda line: line.split(',')) \
    .map(lambda dataArr: (dataArr[0], int(dataArr[2]))) \
    .reduceByKey(lambda v1, v2: v1 + v2) \
    .filter(lambda tupleVal: tupleVal[1] > 120) \
    .collect()
  for (k, v) in result:
    print k + "," + str(v)

  # Stop SparkContext.
  spark.stop()