Satya's blog - 2017/11/

Nov 23 2017 15:40	Effective issue tracker usage	[link]

Suppose you use an issue tracker like JIRA or Pivotal Tracker (PT). Here's how I use it effectively.

Stories, short for "user stories", issues, and tickets are used interchangeably here. There are subtle differences (or not so subtle ones) that don't matter for this discussion.

In JIRA you have multiple projects. Each project has a list of tickets. This should be called the "Backlog". Additionally, you can create a "Sprint".

In PT, in the default view, there are three columns: the "current sprint", the "backlog", and an "icebox". I'd advise finding the setting to enable stacking of current on top of backlog, because that gives a "truer" view.

The (current) sprint contains stories that are being worked on in this sprint. These stories will be in various states, such as Done, In Review, In Progress, To Do. Those are the 4 major states for stories. PT enforces the states while JIRA is more configurable. PT places stories in the sprint automatically based on project velocity (which is based on story points). In JIRA, the current sprint is manually determined during your sprint planning meeting (aka iteration planning (IPM)).

Every story in the current sprint must have a story owner and (if your team uses estimates) an estimate. Some shops do not assign a story owner until someone starts the story, and then that person is the owner. This works well in smaller teams consisting of mostly senior devs. PT enables this by auto-assigning owners.

Additionally, stories may belong to epics. That's a good way in JIRA to group stories. PT also has epics but I prefer to use Labels in PT. JIRA has labels but they're less visible to me. I'd advise using Components in JIRA for the same purpose.

The Backlog contains stories for upcoming sprints. They may be un-assigned, un-estimated, not even fully fleshed-out. All of those things can happen during the sprint planning meeting.

PT has an additional space, the Icebox. This is sort-of the ideas bin: stories that haven't been thought through, or that are a reminder to actually add a "real" story, or stuff that's deferred until some later date.

If team members run out of stories for the current sprint, they may be able to pick up stories from the backlog. After checking with the project manager, perhaps. Beware that stories in the backlog may not be quite ready to be worked on.

PT forces all the stories to be in force-ranked order, which means someone (project manager, in consultation with senior developers) needs to order the stories. JIRA can also be configured to do force-ranking. Stories should usually be placed in order of business priority and according to depenecies. Feature Y has priority but also depends on Feature X? Place X first (higher) in the sprint.

(Why the PM? That's who's running the project. Why the senior devs? Because that's who can point out hidden dependencies. In a small enough team, it'd be all the devs... which means you're having a sprint planning meeting. Or a backlog-grooming meeting, which is a pre-planning meeting. Because of time constraints. These meetings need to be extremely focussed, and I will write a separate article about them.)

Tag: howto agile jira

Nov 18 2017 13:41	Straightforward git branching workflow	[link]

Here's a simple git branching workflow that I like and use.

All "production" work is on the master branch.

Start a feature, first create a new branch:

git checkout -b feature-name

Perhaps prefixed or suffixed with a ticket/issue identifier, e.g. axp-1090-add-phone-field

Usually useful to push the branch to your git origin (often, GitHub) right away, and have git track the remote. This is done in one command with:

git push -u origin HEAD

HEAD is a special thingy referring to the commit we're "on" now.

Do the work. Maybe multiple commits, maybe some work-in-progress commits.

git add
git add -p
git commit -m "AXP-1090 Add phone field"

add with -p makes git prompt for each change. Useful to review the changes and to be selective. Maybe some changes should be in a different commit?

The git commit command can open a text editor for the commit message, or you can specify the message as shown here.

To re-order commits, use `git rebase -i` which will rebase against the upstream, in this case master.

Move the commits as presented in the editor in the order you want. Note that work-in-progress commits can be squashed by changing "pick" to "squash". Each squashed commit will become part of the commit above it. And yes you can squash multiple commits in a row.

Solve any conflicts. In the default configuration git should show you lots of help on how.

To integrate any changes that have happened in master, every now and then (with a clean branch! see `git status`) do a fetch and rebase (and do one of these when you're ready to merge your changes back to master):

master:  M1 -> M2 -> M3
              \
your branch:    B1 -> B2

git fetch
git rebase

master:  M1 -> M2 -> M3
                      \
your branch:           B1 -> B2

Beware that if you switch to master (git checkout master) you will still be behind origin/master, and need to do `git rebase` *on master*.

When the branch is in a sufficiently clean state, push your work to the remote:

git push

If you've rebased, you'll need to use `git push -f` i.e. "force-push". And that is why we use branches, and that is why everyone's should be on their own branch. Otherwise, force-pushing will overwrite other people's work on the branch. That is why we never force-push master (except when we do).

Use `git status` often, which is why I have a shell alias `gs` for `git status`. And I have `git` itself aliased as `g`, with various git commands shortened in my ~/.gitconfig

To merge your changes to master, open a Pull Request on GitHub or, if you're doing it yourself manually, you can merge. First, rebase against master, then switch to master and then merge your branch.

git rebase
git checkout master
git merge axp-1090-add-phone-field
git branch -d axp-1090-add-phone-field # optional, delete your branch

At this point you should have a clean master that you can push (not force-push) to origin.

Learn git branching: https://learngitbranching.js.org/?demo
How to Write a Git Commit Message: https://chris.beams.io/posts/git-commit/

Tag: git techy

Nov 17 2017 12:18	Spark with Databricks and AWS EMR	[link]

Databricks is a web-based Spark notebook system, and also the company that wrote several spark libraries. The notebooks support Python and Scala.

Databricks runs your jobs in your AWS account, on EC2 instances that Databricks manages. The initial setup is a little tricky.

Once everything it set up, it's a pretty good system for developing and debugging Spark-based jobs. It can run jobs on a cron-like schedule. Databricks has support for retrying failed jobs, and can notify about success/failure by email. It gives good access to the spark worker logs for debugging.

However, this is expensive to run. I came up with a workflow that involves development/debugging on databricks, and then export the notebook as a script to be run in EMR (Elastic Map-Reduce, an AWS product). I use Python and pyspark, so this works pretty well.

I do need to import the libraries that Databricks imports automatically. I forget the exact import statements I used, but they are easy enough to figure out.

I used Luigi, which is a Python library for setting up task workflows (see also: Airflow). I set up a few abstract classes to encapsulate "this is an EMR job". Then I extended one of those classes to actually run a spark job, something like:

class ActualJob(AbstractSparkJob):
    luigi_param = luigi.Parameter()  # I usually pass a timestamp to every task

    def spark_cmd:
        return "this_command.py --runs --spark-script --located-in s3://example.com/foo.py"

spark_cmd returns a command line string.

My AbstractSparkJob takes the given command and does one of two things: either submit an EMR step to the EMR cluster, using command_runner.jar to run the command, or ssh into the cluster and run spark-submit with the given command as its parameters.

(I don't have all the code do that available to post right now, sorry)

The abstract classes encapsulate all the work of spinning up an EMR cluster (and shutting it down), and making the AWS API calls via the boto library, and the work to ssh in and run spark-submit.

The abstract classes makes it easy for any other data infrastructure developer person to add more jobs.

The actual script that was exported from Databricks lives in S3 and is referenced by the command shown above. EMR does the work of fetching it from S3 and running it. Part of my data pipeline startup is to copy all the EMR scripts, which I keep in a convenient subdirectory, up to a specific prefix in S3. That way my actual Spark scripts all live in the same code repository as the rest of the pipeline.

Tag: spark aws

- Oldest first