Satya's blog

Mar 28 2023 21:32 Algorithms and data structures

In day-to-day programming, I rarely think about algorithms and data structures. Binary trees. Sorting.

That's because algorithms are already encapsulated in the standard libraries of the programming languages I use. Most languages like Python, Ruby, and Golang have sort functions.

For most data structure purposes, lists and associative arrays (dictionaries, aka key-value) are enough.

Faster algorithms? Processors are relatively cheap, parallelize it.

Memory? Memory is cheap and if that's not enough, spool it out to disk. If that's slow, parallelize it.

No, what's expensive today is network calls. And that's only because everything's fast enough that we can make API calls -- that's Application Programming Interface -- over the Internet.

Of course there are still situations where algorithms and data structures matter. The preceding just talks about 90% of general programming these days: websites and backends. Someone still has to make the standard and not-so-standard libraries of all these languages work, and if you're in embedded-devices land then all bets are off.

Jul 13 2022 07:59 Git bisect

For those that don't know, and those that do, git bisect is a very cool tool/strategy/usage to find out the answer to "which commit broke everything?" I used it recently. I have a page which suddenly started showing blanks instead of the numbers that should be there. So I used git bisect. First I did `git bisect start`.

I knew this was on my "test" branch, so I checked that master was good (it was), and that the tip of test was bad (it was), so I marked both of them with `git bisect good` and `git bisect bad`.

Then I just kept reloading the page, marking each commit as good or bad depending on whether the numbers showed or not.

`git bisect` does the work of checking out the "next" commit after you name two commits good and bad. It does a binary search, so it checks out a commit half-way between the last-marked good and bad ones, and if you mark that one good/bad it moves HEAD to half-way between that one and bad/good. It's cool to watch in action.

You can also use this to find the "commit that introduced the change", it doesn't have to be a breaking change. Just so long as you have a quick way to test whether a particular commit falls on the "good" or "bad" side. In my case, that was just reloading the page to see if numbers showed up. Other ways are to run a unit test, run a script, or whatever.

Feb 22 2020 09:08 Smart car
I want a Smart car! (dead link, new )

I wish it was cheaper, though :-(
Sep 19 2018 17:32 Agile best practices
There are many things that people call "Agile software development". Here's how I do it. It's similar to how Pivotal Labs did it (and for all I know, still do).

Have a force-ranked issue system

Issues (tickets, bugs, stories, whatever you call them) should be in a single list. The list should be ordered by priority in which your project manager wants things done. That way the people working on things can always just pull the top-most item off the list. Yes, please self-assign items as you start work on them so others know that it is being worked on.

There should be one list per team or whatever unit of organization you have. Any member of the team can pick up the next item. There is no confusion about which team owns an item.

Please include tech-debt items! And please allow ICs to add items.

What counts as a ticket/issue story?

Preferably include smaller self-contained features or bugs. Try not to put different things in the same issue, even if related or "in the same area". This is of course subjective, but one rule-of-thumb is "can it be tested by the PM or QA independently?"


Sprint length of two weeks seems ideal. One week is too short, four is too long. Three is just odd.

Sprint lifecycle should be something like: backlog grooming before the sprint, planning meeting on the first day (or before), check-ins/standups during, and a retro on the last day.

Backlog grooming: PM, EM, tech leads, and maybe some senior individual contributors (depends on size of team) can go through the list of issues, maybe triage if needed, re-order them if needed. Basically, prepare for the actual planning meeting. Ideally backlog grooming is ongoing, but sometimes it doesn't happen. Should be a 30-minute or less meeting.

Please ensure a healthy, non-zero, number of tech-debt/IC-contributed items are addressed. This is a good time to ask for more info from stakeholders or to close stale issues.

Planning: Whole team, issues should be quickly discussed so the whole team has a general idea of what's involved. Issues can be assigned now, but try to let ICs self-assign rather than handing them down. Should be an hour or less.

Check-ins/standups: Daily fast standup or every-other-day check-ins to ask and answer "are we on target?" (and adjust stakeholder expectations).

If it's a daily fast standup, each person should quickly (within 30 seconds -- do NOT time people) tell what they got done yesterday, what they intend to work on today, and importantly, any blockers or requests for help. No interruptions, jokes, side-tracking, or requests for clarification. Break out after the stand-up if any of this is desired, and no one has any obligation to stay for this followup. Note that this is often done asynchronously in the team chat channel.

The check-ins or standups shouldn't be more than 15 minutes; if you're always going over, you have a big team or a talkative team. Try to find a time that works for everyone. Just before lunch is good motivation!

Retro: Various teams use various forms of retro, but the core seems to be "what went well" and "what we could do better". Let everyone contribute by writing their lists all together on a board or paper or something, and then going around one by one to expand verbally. Shouldn't take more than 2 minutes each, adjust for team size. Try to get action items for improvements. Try to always have some positive items.

The retro can segue into backlog grooming or even planning for the next sprint.

Nov 23 2017 15:40 Effective issue tracker usage

Suppose you use an issue tracker like JIRA or Pivotal Tracker (PT). Here's how I use it effectively.

Stories, short for "user stories", issues, and tickets are used interchangeably here. There are subtle differences (or not so subtle ones) that don't matter for this discussion.

In JIRA you have multiple projects. Each project has a list of tickets. This should be called the "Backlog". Additionally, you can create a "Sprint".

In PT, in the default view, there are three columns: the "current sprint", the "backlog", and an "icebox". I'd advise finding the setting to enable stacking of current on top of backlog, because that gives a "truer" view.

The (current) sprint contains stories that are being worked on in this sprint. These stories will be in various states, such as Done, In Review, In Progress, To Do. Those are the 4 major states for stories. PT enforces the states while JIRA is more configurable. PT places stories in the sprint automatically based on project velocity (which is based on story points). In JIRA, the current sprint is manually determined during your sprint planning meeting (aka iteration planning (IPM)).

Every story in the current sprint must have a story owner and (if your team uses estimates) an estimate. Some shops do not assign a story owner until someone starts the story, and then that person is the owner. This works well in smaller teams consisting of mostly senior devs. PT enables this by auto-assigning owners.

Additionally, stories may belong to epics. That's a good way in JIRA to group stories. PT also has epics but I prefer to use Labels in PT. JIRA has labels but they're less visible to me. I'd advise using Components in JIRA for the same purpose.

The Backlog contains stories for upcoming sprints. They may be un-assigned, un-estimated, not even fully fleshed-out. All of those things can happen during the sprint planning meeting.

PT has an additional space, the Icebox. This is sort-of the ideas bin: stories that haven't been thought through, or that are a reminder to actually add a "real" story, or stuff that's deferred until some later date.

If team members run out of stories for the current sprint, they may be able to pick up stories from the backlog. After checking with the project manager, perhaps. Beware that stories in the backlog may not be quite ready to be worked on.

PT forces all the stories to be in force-ranked order, which means someone (project manager, in consultation with senior developers) needs to order the stories. JIRA can also be configured to do force-ranking. Stories should usually be placed in order of business priority and according to depenecies. Feature Y has priority but also depends on Feature X? Place X first (higher) in the sprint.

(Why the PM? That's who's running the project. Why the senior devs? Because that's who can point out hidden dependencies. In a small enough team, it'd be all the devs... which means you're having a sprint planning meeting. Or a backlog-grooming meeting, which is a pre-planning meeting. Because of time constraints. These meetings need to be extremely focussed, and I will write a separate article about them.)

Tag: howto agile jira

Nov 18 2017 13:41 Straightforward git branching workflow

Here's a simple git branching workflow that I like and use.

All "production" work is on the master branch.

Start a feature, first create a new branch:

git checkout -b feature-name

Perhaps prefixed or suffixed with a ticket/issue identifier, e.g. axp-1090-add-phone-field

Usually useful to push the branch to your git origin (often, GitHub) right away, and have git track the remote. This is done in one command with:

git push -u origin HEAD

HEAD is a special thingy referring to the commit we're "on" now.

Do the work. Maybe multiple commits, maybe some work-in-progress commits.

git add
git add -p
git commit -m "AXP-1090 Add phone field"

add with -p makes git prompt for each change. Useful to review the changes and to be selective. Maybe some changes should be in a different commit?

The git commit command can open a text editor for the commit message, or you can specify the message as shown here.

To re-order commits, use `git rebase -i` which will rebase against the upstream, in this case master.

Move the commits as presented in the editor in the order you want. Note that work-in-progress commits can be squashed by changing "pick" to "squash". Each squashed commit will become part of the commit above it. And yes you can squash multiple commits in a row.

Solve any conflicts. In the default configuration git should show you lots of help on how.

To integrate any changes that have happened in master, every now and then (with a clean branch! see `git status`) do a fetch and rebase (and do one of these when you're ready to merge your changes back to master):

master:  M1 -> M2 -> M3
your branch:    B1 -> B2
git fetch
git rebase
master:  M1 -> M2 -> M3
your branch:           B1 -> B2

Beware that if you switch to master (git checkout master) you will still be behind origin/master, and need to do `git rebase` *on master*.

When the branch is in a sufficiently clean state, push your work to the remote:

git push

If you've rebased, you'll need to use `git push -f` i.e. "force-push". And that is why we use branches, and that is why everyone's should be on their own branch. Otherwise, force-pushing will overwrite other people's work on the branch. That is why we never force-push master (except when we do).

Use `git status` often, which is why I have a shell alias `gs` for `git status`. And I have `git` itself aliased as `g`, with various git commands shortened in my ~/.gitconfig

To merge your changes to master, open a Pull Request on GitHub or, if you're doing it yourself manually, you can merge. First, rebase against master, then switch to master and then merge your branch.

git rebase
git checkout master
git merge axp-1090-add-phone-field
git branch -d axp-1090-add-phone-field # optional, delete your branch

At this point you should have a clean master that you can push (not force-push) to origin.

Tag: git techy

Nov 17 2017 12:18 Spark with Databricks and AWS EMR

Databricks is a web-based Spark notebook system, and also the company that wrote several spark libraries. The notebooks support Python and Scala.

Databricks runs your jobs in your AWS account, on EC2 instances that Databricks manages. The initial setup is a little tricky.

Once everything it set up, it's a pretty good system for developing and debugging Spark-based jobs. It can run jobs on a cron-like schedule. Databricks has support for retrying failed jobs, and can notify about success/failure by email. It gives good access to the spark worker logs for debugging.

However, this is expensive to run. I came up with a workflow that involves development/debugging on databricks, and then export the notebook as a script to be run in EMR (Elastic Map-Reduce, an AWS product). I use Python and pyspark, so this works pretty well.

I do need to import the libraries that Databricks imports automatically. I forget the exact import statements I used, but they are easy enough to figure out.

I used Luigi, which is a Python library for setting up task workflows (see also: Airflow). I set up a few abstract classes to encapsulate "this is an EMR job". Then I extended one of those classes to actually run a spark job, something like:

class ActualJob(AbstractSparkJob):
    luigi_param = luigi.Parameter()  # I usually pass a timestamp to every task

    def spark_cmd:
        return " --runs --spark-script --located-in s3://"

spark_cmd returns a command line string.

My AbstractSparkJob takes the given command and does one of two things: either submit an EMR step to the EMR cluster, using command_runner.jar to run the command, or ssh into the cluster and run spark-submit with the given command as its parameters.

(I don't have all the code do that available to post right now, sorry)

The abstract classes encapsulate all the work of spinning up an EMR cluster (and shutting it down), and making the AWS API calls via the boto library, and the work to ssh in and run spark-submit.

The abstract classes makes it easy for any other data infrastructure developer person to add more jobs.

The actual script that was exported from Databricks lives in S3 and is referenced by the command shown above. EMR does the work of fetching it from S3 and running it. Part of my data pipeline startup is to copy all the EMR scripts, which I keep in a convenient subdirectory, up to a specific prefix in S3. That way my actual Spark scripts all live in the same code repository as the rest of the pipeline.

Tag: spark aws

Feb 20 2017 14:17 Power consumption
CPU+RAM+fans23W 210mA
+ HDD 37W 240mA
+ GTX 105057W 500mA

Tag: geeky tech

Jan 22 2017 17:11 How to back up Samsung Galaxy S4 using adb

Connect the phone via USB. Go into Settings, About Phone, click on Build Version about 7 times. Then Developer Options becomes available, use that to turn on USB Debugging.

Use `lsusb` to get the phone's code. Example output:

Bus 001 Device 059: ID 04e8:6866 Samsung Electronics Co., Ltd GT-I9...

Create a file or directory `~/.android/adbusb.ini` containing the line "0x04e8" based on the lsusb output above.

As root, drop a file called /etc/udev/rules.d/s4-android.rules containing the following:

SUBSYSTEM=="usb", SYSFS{idVendor}=="04e8", MODE="0666"

Note that the vendor id is again from the lsusb above.

Run this too (as root):

chmod 644 /etc/udev/rules.d/s4-android.rules

Chef recipe for that file:

file 'udev_rule_s4_android' do
    path '/etc/udev/rules.d/s4-android.rules'
    mode '0644'
    owner 'root'
    group 'root'

execute 'udev-restart' do
    command '/etc/init.d/udev restart'
    subscribes :create, 'udev_rule_s4_android', :immediately
    action :nothing

Connect the phone in 'PTP' mode:
Connect the phone via USB. Drag down the notifications, click on the 'Connected as...' notification, and set it to PTP or Camera mode.

`adb shell` should now get you a shell on the phone.


Tag: android

Dec 18 2016 07:37 Star Wars: Rogue One

Watched Rogue One on Dec 15th at the event.

Spoilers ahead!

Call-backs to the original movies:

  1. Blue milk (is that an even older call-back to "Blue Harvest"?)
  2. The Governator
  3. The Princess
  4. The reprise of the blockade runner boarding scene (at the airlock *to* the blockade runner)
  5. Choking people
  6. Imperial battle station design elements
  7. "Commence primary ignition" (Which I always heard as the nonsensical "Commence spider ignition")
  8. The guy who has the death sentence in twelve systems
  9. Nice reveal at the end there. Saw it coming at the exact right moment (not too early, not too late)

Random observations:

Somewhere, some time, I remember reading a thing that was supposed to be an early draft of the original movie. It had a subtitle "Journal of the Whills", and Biggs Darklighter and his brother Luke were searching for (the? singular?) Kyber crystal. I can't find this thing anywhere, though Gizmodo seems to remember the same thing:

Who cleans all the imperial installations and mirror-polishes everything?

At least that platform at the data archive had a hand-rail. Who builds platforms over sheer drops, and why? Was that control panel there so you can see the dish while aiming it? How would that even work, you don't aim a dish like that by eye!

Things I didn't like:

The punning: "Choke on your aspirations"? (That's a double pun btw). It doesn't fit the character, I'm sorry, I can't take that. It's as bad as Han shot first.

While the quasi-Jedi guy was awesome, I felt that the character was simply shoe-horned in there for comic relief. One droid (played by Alan Tudyk! +1!) is enough for comic relief, thanks.

Inexplicable things:

They have a shield around an entire planet, where is it generated from? The DS-2 shield in ROTJ was generated from a base on (the moon of) Endor.

That's a short list.

Observations about the Governor:

So we know the actor has been dead for two decades. Apparently they got someone else to be the body double (see wikipedia article for the movie). I assume they texture-mapped the original actor's face on. To me it seemed as if they'd got a character out of a video game.

The effect was *just* a *little* bit short of real. Maybe because I was looking for it. I'm trying to read it as the Governor has cold dead eyes because he's that much of a stone-cold whatever, not because it's animated.

Sadly I spent so much time staring at the effect that I missed the dialogue.

Feb 15 2016 22:03 Moving things in AWS S3

Recently I had several hundred small files in an AWS S3 bucket, in folders by date.

Something like s3://bucket/2016-02-08/ (and a couple layers deeper), with a few "directories" under the dated "directory". (The sub-directories were the first letter of the ... things, so I had about 62 sub-directories (A-Z, a-z, 0-9). This is a naive hash function.)

I wanted to move them into a year-based "directory", so s3://bucket/2016/2016-02-08/

(Why am I quoting the word "directory"? Because they're not really directories, they're "prefixes". This is relevant if you use the AWS S3 SDK libraries.)

Moving them via the S3 web "console"'s cut/paste interface is slow. REALLLLLY slow. Like, multiple-days slow.

So I (after trying a few other things) pulled out the aws command-line tool (AWS CLI).

Since the sub-directories were letters and numbers, I could do this:

`for x in a b c d;do echo aws s3 mv --recursive s3://bucket/2016-02-08/$x s3://bucket/2016/2016-02-08/$x \&;done >`

The `for` loop runs through a, b, c, d (different from A, B, C, D), and sets up a recursive move operation. This move is much faster using the AWS CLI. Additionally, I background the process of moving the 'a's (using the `\&`) so the 'b's can start right away, and so forth.

But I don't run the commands right away. Notice that they're being `echo`ed. Capture the output in a file, and run the Why?

Because I can now set up a second file with d e f g, to go right after the first, or even in parallel. So now I have up to 4 or 8 move operations going at once. watch the whole thing with `watch "ps axww|grep scr"` in a separate terminal, of course.

But mainly because the `&` backgrounding interacts weirdly with the for loop.

With this, I was done in well, a couple of hours. A lot of that was waiting for the last copy-paste I ran in the web console to finish.